The Huai River Analysis

Policy Analysis on China’s Huai River Policy

Introduction

Abstract

The Huai River Policy is a Chinese policy that provides free (or highly subsidized) coal for residents north of the river since the policy’s institution in 1950.

According to the empirical conclusions from Ebenstein et. al in a PNAS research paper, the analysis suggests that the Huai River Policy has had an adverse effect on human health. The policy led to PM10 concentrations being at levels that are 46 percent higher in the north and a reduce life expectancy by 3.1 years (caused by cardiorespiratory mortality). In general, the research suggests that an additional 10 μg/m3 of PM10 is associated with a 0.64-year decrease in life expectancy.

The implications of these findings predict that Chinese-compliance to their Class I PM10 standard (of 40 μg/m3) would lead to an increase of 3.7 billion life-years for their current population. In addition, less resources would be required for dedication to pollution-related issues on both the individual and national levels.

Personal Thoughts

It would be interesting to see if there is any research that can predict what the total loss of human life would be in the absence of these policies. - Also leads to a deeper question regarding how to define what is a “good policy” – total life years, total number of pollutant-related deaths, - i.e.: If the Huai River policy was never enacted (and as a result more people died annually during winter-months due to insufficient heating capabilities), how would those figures compare to the figures from this research design and reality in China today?

Data Prep

Loading packages

pacman::p_load( # Package manager for installing/loading
   dplyr,       # For cleaning/wrangling syntax
   magrittr,     # For pipes
   ggplot2,       # For graphing/visuals
   ggpubr,      # For arranging/formatting graphs
   rdd, 
   rddtools   # For RDD modeling/design
   )

Loading data

# HTML included in script above to limit bright-light eye strain

# Reading data
river = haven::read_dta("huairiver.dta")

# Modify column names for easier reference
river %<>% rename(
   id = 1,
   wind = 2,
   precip = 4,
   north_of_river=5
)

Question 1 - Why would a simple analysis comparing northern vs souther cities be inaccurate?

Cities that are further north have colder winter temperatures. Lower winter temperatures means more demand for heat (essential for not freezing to death) which means more demand for energy and thus more demand for particulate-emitting coal-burning that contribute to higher PM10 levels.

Even if the Huai River policy did not exist, it would be likely that we would see higher PM10 concentrations in areas further-north. The image below provides an overview of mean January temperatures in China over the specified time interval.

Regression discontinuity design is able to overcome this issue because it allows for analysis of populations that are receiving a given treatment versus populations that are included in the control group.

LIMITATIONS OF EXISTING STUDIES - Literature is composed of observational studies that include confounding variables (e.g. health effects associated with pollution exposure) - Current evidence looks largely at populations with modest levels of PM common among developing countries - Limited evidence and research surrounding long-term exposure effects

WAYS IN WHICH PREVIOUS LIMITATIONS WERE ADDRESSED - Research design based on China’s Huai River Policy - Average PM10 concentrations in China are five-times higher than the WHO standard - The Huai River Policy results in sustained differences in PM10 between both sides of the river that have persisted since data was first collected - Also includes periods of loosened migration-restrictions

Question 2 - What are the outcome and assignment variables in Fig.2 of the Ebenstein et.al paper?

The outcome variable is pm10, the value of which is the average PM10 level (measured in μg/m3) for the given county-code.

The assignment variable (i.e. “running variable”) is dist_huai

Question 3 - What is a binned scatter plot?

A binned scatter plot is a plot where data points for a given variable are grouped into bins and then an aggregate statistics is used to summarize each bin. It is thus known as a non-parametric way of obtaining E[Y|X] (i.e. the conditional expectation of Y given X)

Question 4 - Graphical Visualization

Exploring the Data More

Creating a scatter plot, creating groups based on distance to river

# Creating bin for each mile
river_round = river %>% group_by(dist_huai) %>%
   round(digits = 2) %>% # making another df for rounded data 
   round(digits = 1) %>%
   round(digits = 0) %>%
   ungroup()             
ggplot(data=river_round,aes(x=river_round$dist_huai,y=river_round$pm10)) +
   geom_point()+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F)

Since there is an outlier value at dist_huai = 2, it would taking another look to see how the predicted regression values change if we remove the single value.

Looking at the data we can see that outlier is associated with id = 653221, so now removing that row and running the same code for the graph above

river_round %>% filter(id!=653221)%>%
   ggplot(data=.,aes(x=.$dist_huai,y=.$pm10)) +
      geom_point()+
      geom_smooth(method = "lm", color="blue",se=F)+
      geom_smooth(method = "loess", color="red",se=F)+
      ylim(0,308) # to preserve graph proportions for comparing plots

Even if the outlier is removed, the regression predictions change by an amount small amount. In the scope of this analysis, it is important to consider the significance in the shapes of each line. The blue line has a positive slope, from which we can use to support the notion of counties that are further north of the river are associated with higher levels of PM10. The red line representing a loess regression has a convex shape, which predicts that there is a positive correlation between distance to the Huai river PM10 concentration up until a few miles north of the river, where the trend begins again in the opposite direction.

We can also look at how the graph would change if we change display the data by methodologically creating bins (rather than simply plotting the rounded distance) or we could look at the same plotting as the graphs above, but with absolute value. Doing so may lose the value of the the distinction of which side of the river an observation is on, but would also help reassure us of other trends that may exist in the data (plus we could create a binary variable indicating if a given observation is north or south of the river)

# This time using the non-rounded data to preserve some additional information
# lost by rounding

# Considering range of value in distance
range(river$dist_huai) # apprx. -12.78 and 16.48 for min/max
## [1] -12.77608  16.47519
diff(range(river$dist_huai)) # apprx. 29.25
## [1] 29.25127
# Also removing outlier from original df, saving for later if wanted
outlier_obs = river %>% filter(id==653221)
river %<>% filter(id!=653221)

PM10

# Making a bin for approximately each degree
plot1 = ggplot(river, aes(x=dist_huai,y=pm10)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=30, # Bin for each mile, similar to 
                   color='blue', size=2, geom='point')+ # plots above
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily large number of bins
plot2 = ggplot(river, aes(x=dist_huai,y=pm10)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily small number of bins
plot3 = ggplot(river, aes(x=dist_huai,y=pm10)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=10, 
                   color='blue', size=2, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrary number of bins using absolute values for distance
plot4 = ggplot(river, aes(x=abs(dist_huai),y=pm10)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

ggarrange(plot1, plot2, plot3, plot4,vjust=1,hjust=-4,labels = "AUTO",
          font.label = list(size=10, face="bold.italic",
                            color = "navy")) %>%
   annotate_figure(.,
   left = text_grob("Average PM10 level (μg/m3)",
      color = "black",rot = 90,face="bold"),
   bottom = text_grob("Distance from Huai River (degrees North)",
      color = "black",face="bold")) %>%
               annotate_figure(.,
            bottom = text_grob("Figure (D.) scale in absolute values",
               color = "black",face="italic",
               hjust = -1.25,vjust = -5, size = 6)) -> plot4x_pm
plot4x_pm

There is a clear linear trend between how far north a data point is and average PM10 levels. The positive slope matches what we might expect without deeper analysis: cities that are further north require heating (and thus produce more emissions in the heating-process) than do cities that are further south. If we look at Figure D, however, we see a different trend. If we ignore if an observation is considered to be northern or southern (relative to its positioning to the river), we see that there is a negative-linear relationship between PM10 concentration and nearness to the river.

In addition, if we think of the LOESS regression line (in red) as a parabolic function, we can see that the vertex of the concave-line in Figured A-C is positions slightly to the right of zero.

Temperature

Now to graphically illustrate the correlation between distance from the river and temperature, precipitation, and changes in wind speed

## Temperature

# Note: I am reusing plot1:4 naming conventions for the sake of less
# computational strain

# Making a bin for approximately each degree
plot1 = ggplot(river, aes(x=dist_huai,y=temp)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=30, # Bin for each mile, similar to 
                   color='blue', size=2, geom='point')+ # plots above
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily large number of bins
plot2 = ggplot(river, aes(x=dist_huai,y=temp)) +
  geom_point(size=1.5, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily small number of bins
plot3 = ggplot(river, aes(x=dist_huai,y=temp)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=10, 
                   color='blue', size=2, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrary number of bins using absolute values for distance
plot4 = ggplot(river, aes(x=abs(dist_huai),y=temp)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

ggarrange(plot1, plot2, plot3, plot4,vjust=1,hjust=-4,labels = "AUTO",
          font.label = list(size=10, face="bold.italic",
                            color = "navy")) %>%
   annotate_figure(.,
   left = text_grob("Average Temperature (°F)",
      color = "black",rot = 90,face="bold"),
   bottom = text_grob("Distance from Huai River (degrees North)",
      color = "black",face="bold")) %>%
               annotate_figure(.,
            bottom = text_grob("Figure (D.) scale in absolute values",
               color = "black",face="italic",
               hjust = -1.25,vjust = -5, size = 6)) -> plot4x_temp
plot4x_temp

Temperature appears to show a strong linear correlation using both LOESS and lm regression methods. Notice in figure D of the temperature plots how the LOESS regression seems to fit the poorly, while there is still a linear trend (albeit poorly fitting). This is interesting to consider.

These trends mean that, even if we ignore which side of the river we are looking at, there is still a strong-enough trend in the data for the line of best fit to suggest that there is a relationship between nearness to the river and average temperatures. We can further note how even though it is clear that the lm line is still not very well-fitting, it looks as if there is another underlying trend in the data not picked up by either regressions in Figure D.

If we assume that these trend exist because of natural temperature variations in northern versus southern continues, then we would be ignoring the information that is provided in Figure D.

Precipitation

## Precipitation

# Making a bin for approximately each degree
plot1 = ggplot(river, aes(x=dist_huai,y=precip)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=30, # Bin for each mile, similar to 
                   color='blue', size=2, geom='point')+ # plots above
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily large number of bins
plot2 = ggplot(river, aes(x=dist_huai,y=precip)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily small number of bins
plot3 = ggplot(river, aes(x=dist_huai,y=precip)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=10, 
                   color='blue', size=2, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrary number of bins using absolute values for distance
plot4 = ggplot(river, aes(x=abs(dist_huai),y=precip)) +
  geom_point(size=1.5, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

ggarrange(plot1, plot2, plot3, plot4,vjust=1,hjust=-4,labels = "AUTO",
          font.label = list(size=10, face="bold.italic",
                            color = "navy")) %>%
   annotate_figure(.,
   left = text_grob("Average Precipitation (mm)",
      color = "black",rot = 90,face="bold"),
   bottom = text_grob("Distance from Huai River (degrees North)",
      color = "black",face="bold")) %>%
               annotate_figure(.,
            bottom = text_grob("Figure (D.) scale in absolute values",
               color = "black",face="italic",
               hjust = -1.25,vjust = -5, size = 6)) -> plot4x_precip
plot4x_precip

While the linear relationship that appears in each plot is worth noting and our consideration, it is also interesting to see the shape of the LOESS regression line for Figures A-C. In each of these figures, that appears to be an infection point centered (or at least roughly centered) at zero. This might suggest that there is perhaps a relationship that can describe the data more truthfully if the “haves” and the “have-nots” (in terms of free/subsidized coal because of the policy) are analyzed with a different approach.

Wind

## Wind speed

# Making a bin for approximately each degree
plot1 = ggplot(river, aes(x=dist_huai,y=wind)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=30, # Bin for each mile, similar to 
                   color='blue', size=2, geom='point')+ # plots above
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily large number of bins
plot2 = ggplot(river, aes(x=dist_huai,y=wind)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrarily small number of bins
plot3 = ggplot(river, aes(x=dist_huai,y=wind)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=10, 
                   color='blue', size=2, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arbitrary number of bins using absolute values for distance
plot4 = ggplot(river, aes(x=abs(dist_huai),y=wind)) +
  geom_point(size=1.5, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   geom_smooth(method = "loess", color="red",se=F, size=.75)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())

# Arranging plots
ggarrange(plot1, plot2, plot3, plot4,vjust=1,hjust=-4,labels = "AUTO",
          font.label = list(size=10, face="bold.italic",
                            color = "navy")) %>%
   annotate_figure(.,
   left = text_grob("Average Wind Speed (m/s)",
      color = "black",rot = 90,face="bold"),
   bottom = text_grob("Distance from Huai River (degrees North)",
      color = "black",face="bold")) %>%
               annotate_figure(.,
            bottom = text_grob("Figure (D.) scale in absolute values",
               color = "black",face="italic",
               hjust = -1.25,vjust = -5, size = 6)) -> plot4x_wind
plot4x_wind

When looking at wind speed, there is again a similar trend that shows in each graph. The models with the smaller number of bins-used provide additional inference that can be potentially lost with large bin sizes. There is also a slight trend that appears in Figures A-C with a local maximum around zero. While this may be something that warrants additional investigation, my hypothesis is that this can be explained by the fact that there are no wind-barriers (e.g. trees, buildings) on the river itself that may exist further inland.


Takeaways Moving Forward

After considering all of the graphs and interpretations above, we can arrive at a few generalization in the scope of our analysis:

  • Large number of bins for the data may obscure underlying trends
  • Graphically visualizing the data in terms of absolute-values provide some additional insight
  • We can make some changes in the analytical methods above to perhaps provide a better-fitting and robustly-insightful statistics, visualizations, and interpretations

Question 5 - Regression Analysis

In attempting to replicate the bandwidth selection from the Ebstein et.al paper, while we could make an educated guess by looking at the number of points on a given side of the axis (and count 13 points plotted with negative degree-values and 17 with positive degree-values), we can also recall that the range of the original values for degree-distance to the river

range(river$dist_huai)
## [1] -12.77608  16.47519
range(river_round$dist_huai)
## [1] -13  16

which is approximately (-12.78, 16.47) and rounded to (-13, 16), the same range of points visually observed in the Ebstein et.al plot and the same as the bin selection of one of the models we used earlier. Although the number of positive/negative values differ, I expect that this is because of differences in rounding. While I’m sure I could manipulate the data with perhaps a celing() or floor() function to obtain the exact same range/bin values (or perhaps there is a better function for rounding than the one I used), since the range is the same and less value is lost in rounding by all values up/down regardless of distance to nearest whole number/integer, I will use the rounding/bin-selection as described above and earlier in this analysis.

If we plot the same data using a regression discontinuity design, we can gain better insight regarding the difference between the groups/treatment

river_round %<>% group_by(dist_huai) %>%
   mutate(north_of_river = as.factor(north_of_river)) %>% ungroup(.)

# PM10, similar to the graph included in the paper 
river %>% na.omit() %>% 
   mutate(north_of_river = as.factor(north_of_river)) %>%
   ggplot(.,aes(dist_huai, pm10, color=north_of_river)) +
      geom_point(alpha=0.35)+
     stat_summary_bin(fun.y='mean', bins=30, 
                   color='black', geom='point')+
            geom_smooth(method = "lm", formula = y~poly(x,2),se=F)+
      geom_vline(xintercept = 0, linetype="longdash")+
      ylim(40,160)+
      ylab("Distance from Huai River")+ 
      xlab("Average PM10")

Now to formally define our models

library(rdrobust)
river %>% na.omit() %>% attach(.)

summary(
   rdplot(y=pm10, x=dist_huai, nbins = 15, ci = .95, shade=TRUE,
         title="PM10 and Distance to Huai River",
         y.label="PM10",
         x.label="Distance to River")
   )

## Call: rdplot
## 
## Number of Obs.                  146
## Kernel                      Uniform
## 
## Number of Obs.                   75              71
## Eff. Number of Obs.              75              71
## Order poly. fit (p)               4               4
## BW poly. fit (h)             12.448          16.475
## Number of bins scale              1               1
## 
## Bins Selected                    15              15
## Average Bin Length            0.830           1.098
## Median Bin Length             0.830           1.098
## 
## IMSE-optimal bins                 5               5
## Mimicking Variance bins           7               9
## 
## Relative to IMSE-optimal:
## Implied scale                 3.000           3.000
## WIMSE variance weight         0.036           0.036
## WIMSE bias weight             0.964           0.964
summary(
   rdplot(y=temp, x=dist_huai, nbins = 14, ci = .95, shade=TRUE,
         title="Temperature and Distance to Huai River",
         y.label="Temerature (F)",
         x.label="Distance to River")
   )

## Call: rdplot
## 
## Number of Obs.                  146
## Kernel                      Uniform
## 
## Number of Obs.                   75              71
## Eff. Number of Obs.              75              71
## Order poly. fit (p)               4               4
## BW poly. fit (h)             12.448          16.475
## Number of bins scale              1               1
## 
## Bins Selected                    14              14
## Average Bin Length            0.889           1.177
## Median Bin Length             0.889           1.177
## 
## IMSE-optimal bins                 7               7
## Mimicking Variance bins          13              16
## 
## Relative to IMSE-optimal:
## Implied scale                 2.000           2.000
## WIMSE variance weight         0.111           0.111
## WIMSE bias weight             0.889           0.889
summary(
   rdplot(y=wind, x=dist_huai, nbins = 9, ci = .95, shade=TRUE,
         title="Wind-speeds and Distance to Huai River",
         y.label="Average Wind Speed",
         x.label="Distance to River")
   )

## Call: rdplot
## 
## Number of Obs.                  146
## Kernel                      Uniform
## 
## Number of Obs.                   75              71
## Eff. Number of Obs.              75              71
## Order poly. fit (p)               4               4
## BW poly. fit (h)             12.448          16.475
## Number of bins scale              1               1
## 
## Bins Selected                     9               9
## Average Bin Length            1.383           1.831
## Median Bin Length             1.383           1.831
## 
## IMSE-optimal bins                 5               5
## Mimicking Variance bins           6               7
## 
## Relative to IMSE-optimal:
## Implied scale                 1.800           1.800
## WIMSE variance weight         0.146           0.146
## WIMSE bias weight             0.854           0.854

The graph above showing the wind variable shows a violation of covariate smoothness, so regression discontinuity design will not be very insightful for insight between these two variables (or at least we are unable to say that the difference in wind speeds given other characteristics of a given observation cannot be independent attributed to exposure to the treatment/side of the river)

summary(
   rdplot(y=precip, x=dist_huai, nbins = 12, ci = .95, shade=TRUE,
         title="Precipitation and Distance to Huai River",
         y.label="Precipitation",
         x.label="Distance to River")
   )

## Call: rdplot
## 
## Number of Obs.                  146
## Kernel                      Uniform
## 
## Number of Obs.                   75              71
## Eff. Number of Obs.              75              71
## Order poly. fit (p)               4               4
## BW poly. fit (h)             12.448          16.475
## Number of bins scale              1               1
## 
## Bins Selected                    12              12
## Average Bin Length            1.037           1.373
## Median Bin Length             1.037           1.373
## 
## IMSE-optimal bins                 5               6
## Mimicking Variance bins           7              10
## 
## Relative to IMSE-optimal:
## Implied scale                 2.400           2.000
## WIMSE variance weight         0.067           0.111
## WIMSE bias weight             0.933           0.889

Question 6 - The Identification Assumption

In RD design, the identification assumption would hold true if we are able to determine that the jump that occurs in the data at the cutoff point is only explained by exposure to the treatment (i.e. the change in Di is the only reason for the change in Yi)

As mentioned above, the assumption hold true except for the plot for the interaction with the wind variable

Question 7 - Manipulation Testing

If we consider the distribution of the variable in the data

ggplot(river, aes(wind)) +  geom_histogram()

ggplot(river, aes(temp)) +  geom_histogram()

ggplot(river, aes(pm10)) +  geom_histogram()

ggplot(river, aes(wind)) +  geom_histogram()

we do not need to worry about performing a manipulation test. Although none of the plots above have perfectly normal/skewed distribution, they do not show any apparent/significant jump in observed-characters that may potentially be explaining a significant portion of the trend in the modeling.

There is a spike in the distribution of temperature, but since temperature is correlated with PM10 concentration (which is normal enough), therein lies a hint about a possible trend in the data and we can move on in our analysis with confidence.

Question 8 - The Placebo Test

The purpose of running a placebo test is to ensure that there are not trends being derived from analysis of a specific set of data that may in fact be able to be attributed to another source that is not contained in study’s data set.

By considering this potential and with the use of false locations, the authors are able to better-isolate the true effect of treatment that might potentially be obfuscated if simply looking at latitudinal position (which might result in natural variation of variables as the river bends)

In the paper the author further explain that, since the predictions of the test fall within the 95% CI of predicted values for PM10 and life expectancy We can utilize similar logic (along with the hint from the assignment outline) to perform a similar test

# Creating placebo
river = river %>% group_by(dist_huai) %>% round(digits = 0)
attach(river)

river = river %>%
   mutate(placebo = if_else(dist_huai==1, dist_huai+1,dist_huai))
summary(
   rdplot(y=pm10, x=placebo, nbins = 15, ci = .95, shade=TRUE,
         title="PM10 and Distance to Huai River (placebo)",
         y.label="PM10",
         x.label="Distance to River")
   )
## Error in complete.cases(x): object 'placebo' not found

We can see that the result still hold true and we be further assured of the reliability of our conclusions.