Inferences about Relationships

Inferences about relationships

T/F Questions

  1. True/False/Uncertain: P-values tell us the probability that the null hypothesis is true. Justify your answer.

FALSE, the p-values tell us the probability of getting a result at least as extreme as the observed result if the null hypothesis is true.

  1. True/False/Uncertain: A result is important if it is statistically significant. Justify your answer.

FALSE, all that statistical significance means is that it is relatively unlikely that we would have gotten a result at least as extreme as was observed if the null hypothesis is true.

The Data

The file polls2016_clean.csv is based on data downloaded from FiveThiryEight.com which collected polls from HuffPost Pollster, RealClearPolitics, polling firms and news reports. Parts of the data have been recoded.

Name Description
state State
startdate Start Date of Poll
enddate End Date of Poll
grade 538’s Grade for Pollster
samplesize Poll’s Sample Size
population lv Likely Voters or rv Registered Voters
rawpoll_clinton Clinton Share in Poll
rawpoll_trump Trump Share in Poll

The file federalelections2016.xlsx includes results from the 2016 US presidential election. It was downloaded from: https://transition.fec.gov/general/FederalElections2016.shtml.

library(dplyr)
library(readr)
library(readxl)
library(ggplot2)
setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
polls <- read.csv("polls2016_clean.csv")

polls <- polls %>% filter(population=="lv" | population=="rv")%>%
  mutate(startdate=parse_date(startdate,"%m/%d/%Y"),
         enddate=parse_date(enddate,"%m/%d/%Y"),
         rawpoll_clinton = rawpoll_clinton/100,
         rawpoll_trump = rawpoll_trump/100,
         rawpoll_diff = rawpoll_clinton-rawpoll_trump)
# Extensive cleaning of data completed in Microsoft Excel.
setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
votes <- read.csv("votes_clean.csv")

votes <- votes %>% 
  mutate(clinton_share = trump_votes/total_votes,
         trump_share = clin_votes/total_votes,
         diff_share = clinton_share-trump_share)


merged_df <- merge(polls, votes, by="state")
dim(merged_df)
## [1] 9054   18

Polls and Results

Let’s the sample to polls of likely voters that ended on or after September 1st, 2016.

Use plotting functionality to make a scatter plot of the predicted Clinton share rawpoll_clinton against Clinton’s actual vote share ClintonShare in each state. Divide rawpoll_clinton by 100 to make it range from 0 to 1 rather than 0 to 100.

merged_df <- merged_df %>%
  filter(enddate>"9-1-2016")

ggplot(merged_df,aes(x=rawpoll_clinton,y=clinton_share)) +
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  geom_smooth(method = "loess", se=FALSE, size=.5, color="red")+
  xlab("Clinton Polling")+ylab("Clinton Vote Share")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Bias in Polls

Calculating the average bias in the polls, where bias is defined as:

Bias=p^p
.

Answer

merged_df <- merged_df %>%
  mutate(bias = rawpoll_clinton-clinton_share)

Compared to the grade for the pollsters (grade) from the dataset?

state_bias <- merged_df %>%
  group_by(state, grade) %>%
  summarise(avg_bias = mean(bias, na.rm=TRUE)) %>%
  arrange(desc(abs(avg_bias))) %>% ungroup()
tibble(state_bias)
## # A tibble: 330 x 3
##    state grade avg_bias
##    <chr> <chr>    <dbl>
##  1 DC    "C-"     0.829
##  2 DC    "B"      0.510
##  3 WY    "B-"    -0.492
##  4 WY    ""      -0.482
##  5 WY    "B"     -0.462
##  6 WY    "C-"    -0.458
##  7 WV    "B-"    -0.445
##  8 WV    "B"     -0.419
##  9 WV    "C-"    -0.407
## 10 AL    "B"     -0.406
## # ... with 320 more rows

If we look at the absolute value of the bias (to consider those who showed positive and negative bias) and look at our results, it seems that there is a considerable amount of differentiation between the grades of the pollsters and the actual results of the election.

Margins of Error

The variance of a binary random variable, X, is equal to P(X=1)×(1P(X=1)). We saw in class that the standard error of a mean is given by:

SE(X¯)=σ2N
.

Since support for Clinton is a binary variable, the standard error of a poll asking Clinton’s vote share is given by:

SE(X¯)=p(1p)N
where p is the true level of support Clinton.

Since p is unknown at the time of the poll, the estimated standard error is:

SE^(p^)=p^(1p^)N
.

Using rawpoll_clinton (divided by 100) and the sample size to calculate the standard error and 95% confidence interval around the predicted vote share with the interval’s lower and upper bounds being p^1.96×SE(p^) and p^+1.96×SE(p^), respectively.

Answer

merged_df <- merged_df %>% 
  mutate(se = sqrt((rawpoll_clinton*(1-rawpoll_clinton))/samplesize),
         l_CI = rawpoll_clinton-1.96*se,
         u_CI = rawpoll_clinton+1.96*se)

For what percentage of polls was Clinton’s true vote share outside of the 95% confidence interval around the predicted vote share? (Remember that the standard error is based on a probability ranging from 0 to 1 but shares are sometimes 0 to 100.)

Answer

merged_df <- merged_df %>% 
  mutate(C_Interval = clinton_share<u_CI, clinton_share>l_CI)

mean(merged_df$C_Interval,na.rm=TRUE)
## [1] 0.4418296