Inferences about relationships
Kyle Brewster
T/F Questions
- True/False/Uncertain: P-values tell us the probability that the null hypothesis is true. Justify your answer.
FALSE, the p-values tell us the probability of getting a result at least as extreme as the observed result if the null hypothesis is true.
- True/False/Uncertain: A result is important if it is statistically significant. Justify your answer.
FALSE, all that statistical significance means is that it is relatively unlikely that we would have gotten a result at least as extreme as was observed if the null hypothesis is true.
The Data
The file polls2016_clean.csv is based on data downloaded from FiveThiryEight.com which collected polls from HuffPost Pollster, RealClearPolitics, polling firms and news reports. Parts of the data have been recoded.
Name | Description |
---|---|
state |
State |
startdate |
Start Date of Poll |
enddate |
End Date of Poll |
grade |
538’s Grade for Pollster |
samplesize |
Poll’s Sample Size |
population |
lv Likely Voters or rv Registered Voters |
rawpoll_clinton |
Clinton Share in Poll |
rawpoll_trump |
Trump Share in Poll |
The file federalelections2016.xlsx includes results from the 2016 US presidential election. It was downloaded from: https://transition.fec.gov/general/FederalElections2016.shtml.
library(dplyr)
library(readr)
library(readxl)
library(ggplot2)
setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
polls <- read.csv("polls2016_clean.csv")
polls <- polls %>% filter(population=="lv" | population=="rv")%>%
mutate(startdate=parse_date(startdate,"%m/%d/%Y"),
enddate=parse_date(enddate,"%m/%d/%Y"),
rawpoll_clinton = rawpoll_clinton/100,
rawpoll_trump = rawpoll_trump/100,
rawpoll_diff = rawpoll_clinton-rawpoll_trump)
# Extensive cleaning of data completed in Microsoft Excel.
setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
votes <- read.csv("votes_clean.csv")
votes <- votes %>%
mutate(clinton_share = trump_votes/total_votes,
trump_share = clin_votes/total_votes,
diff_share = clinton_share-trump_share)
merged_df <- merge(polls, votes, by="state")
dim(merged_df)
## [1] 9054 18
Trends in Pollings
The graph shows that Clinton’s predicted share of votes was increasing until just before April 2016 (perhaps having to do with the Clinton email scandal) and then decreased and remained relatively stable from June 2016 to late-September when it stated to increase again. Once it became closer to election, the predictions for Trump and Clinton were close, with Clinton predictions being slightly higher
ggplot(polls,aes(x=enddate,y=rawpoll_clinton,color=population))+
geom_point(stat="summary", fun.y="mean")+
geom_smooth()+
xlab("Date")+ylab("Clinton Share")
# Using GAM for regression line
Again where lv
and rv
are “likely votes” and “registed votes”.
Polls and Results
Let’s the sample to polls of likely voters that ended on or after September 1st, 2016.
Use plotting functionality to make a scatter plot of the predicted Clinton share rawpoll_clinton
against Clinton’s actual vote share ClintonShare
in each state. Divide rawpoll_clinton
by 100 to make it range from 0 to 1 rather than 0 to 100.
merged_df <- merged_df %>%
filter(enddate>"9-1-2016")
ggplot(merged_df,aes(x=rawpoll_clinton,y=clinton_share)) +
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
geom_smooth(method = "loess", se=FALSE, size=.5, color="red")+
xlab("Clinton Polling")+ylab("Clinton Vote Share")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Bias in Polls
Calculating the average bias in the polls, where bias is defined as:
Answer
merged_df <- merged_df %>%
mutate(bias = rawpoll_clinton-clinton_share)
Compared to the grade for the pollsters (grade
) from the dataset?
state_bias <- merged_df %>%
group_by(state, grade) %>%
summarise(avg_bias = mean(bias, na.rm=TRUE)) %>%
arrange(desc(abs(avg_bias))) %>% ungroup()
tibble(state_bias)
## # A tibble: 330 x 3
## state grade avg_bias
## <chr> <chr> <dbl>
## 1 DC "C-" 0.829
## 2 DC "B" 0.510
## 3 WY "B-" -0.492
## 4 WY "" -0.482
## 5 WY "B" -0.462
## 6 WY "C-" -0.458
## 7 WV "B-" -0.445
## 8 WV "B" -0.419
## 9 WV "C-" -0.407
## 10 AL "B" -0.406
## # ... with 320 more rows
If we look at the absolute value of the bias (to consider those who showed positive and negative bias) and look at our results, it seems that there is a considerable amount of differentiation between the grades of the pollsters and the actual results of the election.
Margins of Error
The variance of a binary random variable, , is equal to . We saw in class that the standard error of a mean is given by:
Since support for Clinton is a binary variable, the standard error of a poll asking Clinton’s vote share is given by:
Since is unknown at the time of the poll, the estimated standard error is:
Using rawpoll_clinton
(divided by 100) and the sample size to calculate the standard error and 95% confidence interval around the predicted vote share with the interval’s lower and upper bounds being and , respectively.
Answer
merged_df <- merged_df %>%
mutate(se = sqrt((rawpoll_clinton*(1-rawpoll_clinton))/samplesize),
l_CI = rawpoll_clinton-1.96*se,
u_CI = rawpoll_clinton+1.96*se)
For what percentage of polls was Clinton’s true vote share outside of the 95% confidence interval around the predicted vote share? (Remember that the standard error is based on a probability ranging from 0 to 1 but shares are sometimes 0 to 100.)
Answer
merged_df <- merged_df %>%
mutate(C_Interval = clinton_share<u_CI, clinton_share>l_CI)
mean(merged_df$C_Interval,na.rm=TRUE)
## [1] 0.4418296