Inferences about relationships

T/F Questions

True/False/Uncertain: P-values tell us the probability that the null hypothesis is true. Justify your answer.

FALSE, the p-values tell us the probability of getting a result at least as extreme as the observed result if the null hypothesis is true.

True/False/Uncertain: A result is important if it is statistically significant. Justify your answer.

FALSE, all that statistical significance means is that it is relatively unlikely that we would have gotten a result at least as extreme as was observed if the null hypothesis is true.

The Data

The file polls2016_clean.csv is based on data downloaded from FiveThiryEight.com which collected polls from HuffPost Pollster, RealClearPolitics, polling firms and news reports. Parts of the data have been recoded.

Name	Description
`state`	State
`startdate`	Start Date of Poll
`enddate`	End Date of Poll
`grade`	538’s Grade for Pollster
`samplesize`	Poll’s Sample Size
`population`	`lv` Likely Voters or `rv` Registered Voters
`rawpoll_clinton`	Clinton Share in Poll
`rawpoll_trump`	Trump Share in Poll

The file federalelections2016.xlsx includes results from the 2016 US presidential election. It was downloaded from: https://transition.fec.gov/general/FederalElections2016.shtml.

library(dplyr)
library(readr)
library(readxl)
library(ggplot2)

setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
polls <- read.csv("polls2016_clean.csv")

polls <- polls %>% filter(population=="lv" | population=="rv")%>%
  mutate(startdate=parse_date(startdate,"%m/%d/%Y"),
         enddate=parse_date(enddate,"%m/%d/%Y"),
         rawpoll_clinton = rawpoll_clinton/100,
         rawpoll_trump = rawpoll_trump/100,
         rawpoll_diff = rawpoll_clinton-rawpoll_trump)

# Extensive cleaning of data completed in Microsoft Excel.
setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
votes <- read.csv("votes_clean.csv")

votes <- votes %>% 
  mutate(clinton_share = trump_votes/total_votes,
         trump_share = clin_votes/total_votes,
         diff_share = clinton_share-trump_share)


merged_df <- merge(polls, votes, by="state")
dim(merged_df)

## [1] 9054   18

Trends in Pollings

The graph shows that Clinton’s predicted share of votes was increasing until just before April 2016 (perhaps having to do with the Clinton email scandal) and then decreased and remained relatively stable from June 2016 to late-September when it stated to increase again. Once it became closer to election, the predictions for Trump and Clinton were close, with Clinton predictions being slightly higher

ggplot(polls,aes(x=enddate,y=rawpoll_clinton,color=population))+
  geom_point(stat="summary", fun.y="mean")+
  geom_smooth()+
  xlab("Date")+ylab("Clinton Share")

# Using GAM for regression line

Again where lv and rv are “likely votes” and “registed votes”.

Polls and Results

Let’s the sample to polls of likely voters that ended on or after September 1st, 2016.

Use plotting functionality to make a scatter plot of the predicted Clinton share rawpoll_clinton against Clinton’s actual vote share ClintonShare in each state. Divide rawpoll_clinton by 100 to make it range from 0 to 1 rather than 0 to 100.

merged_df <- merged_df %>%
  filter(enddate>"9-1-2016")

ggplot(merged_df,aes(x=rawpoll_clinton,y=clinton_share)) +
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  geom_smooth(method = "loess", se=FALSE, size=.5, color="red")+
  xlab("Clinton Polling")+ylab("Clinton Vote Share")

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Bias in Polls

Calculating the average bias in the polls, where bias is defined as:

B i a s = \hat{p} - p

$Bias=\hat{p}-p$ .

Answer

merged_df <- merged_df %>%
  mutate(bias = rawpoll_clinton-clinton_share)

Compared to the grade for the pollsters (grade) from the dataset?

state_bias <- merged_df %>%
  group_by(state, grade) %>%
  summarise(avg_bias = mean(bias, na.rm=TRUE)) %>%
  arrange(desc(abs(avg_bias))) %>% ungroup()
tibble(state_bias)

## # A tibble: 330 x 3
##    state grade avg_bias
##    <chr> <chr>    <dbl>
##  1 DC    "C-"     0.829
##  2 DC    "B"      0.510
##  3 WY    "B-"    -0.492
##  4 WY    ""      -0.482
##  5 WY    "B"     -0.462
##  6 WY    "C-"    -0.458
##  7 WV    "B-"    -0.445
##  8 WV    "B"     -0.419
##  9 WV    "C-"    -0.407
## 10 AL    "B"     -0.406
## # ... with 320 more rows

If we look at the absolute value of the bias (to consider those who showed positive and negative bias) and look at our results, it seems that there is a considerable amount of differentiation between the grades of the pollsters and the actual results of the election.

Margins of Error

The variance of a binary random variable, $X$ , is equal to $P(X=1)\times(1-P(X=1))$ . We saw in class that the standard error of a mean is given by:

S E (\bar{X}) = \sqrt{\frac{σ^{2}}{N}}

$SE(\bar{X})=\sqrt{\frac{\sigma^2}{N}}$ .

Since support for Clinton is a binary variable, the standard error of a poll asking Clinton’s vote share is given by:

S E (\bar{X}) = \sqrt{\frac{p (1 - p)}{N}}

$SE(\bar{X})=\sqrt{\frac{p(1-p)}{N}}$ where

p

$p$ is the true level of support Clinton.

Since $p$ is unknown at the time of the poll, the estimated standard error is:

\hat{S E} (\hat{p}) = \sqrt{\frac{\hat{p} (1 - \hat{p})}{N}}

$\hat{SE}(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}$ .

Using rawpoll_clinton (divided by 100) and the sample size to calculate the standard error and 95% confidence interval around the predicted vote share with the interval’s lower and upper bounds being $\hat{p}-1.96\times SE(\hat{p})$ and $\hat{p}+1.96\times SE(\hat{p})$ , respectively.

Answer

merged_df <- merged_df %>% 
  mutate(se = sqrt((rawpoll_clinton*(1-rawpoll_clinton))/samplesize),
         l_CI = rawpoll_clinton-1.96*se,
         u_CI = rawpoll_clinton+1.96*se)

For what percentage of polls was Clinton’s true vote share outside of the 95% confidence interval around the predicted vote share? (Remember that the standard error is based on a probability ranging from 0 to 1 but shares are sometimes 0 to 100.)

Answer

merged_df <- merged_df %>% 
  mutate(C_Interval = clinton_share<u_CI, clinton_share>l_CI)

mean(merged_df$C_Interval,na.rm=TRUE)

## [1] 0.4418296

Kyle M. Brewster

Inferences about Relationships

Inferences about relationships

Kyle Brewster

T/F Questions

The Data

Trends in Pollings

Polls and Results

Bias in Polls

Answer

Margins of Error

Answer

Answer