Regression Analysis

Regression Analysis

The data in this analysis is from the National Longitudinal Survey of Youth 1979 to measure the correlation between income and years of education.

The data file nlsy79.csv includes this information and some additional variables we may use later in future analyses. A description of the variables in the data is in the table below.

Name Description
CASEID Unique identifier
earn2009 Earnings in 2009
hgc Years of education
race Race and Ethnicity
sex Gender
bmonth Birth Month
byear Birth Year
afqt Armed Forces Qualifying Test Percentile
region_1979 Region
faminc1978 Family Income in 1978
nsibs79 Number of Siblings

Regressing earnings on years of education to determine the average increase in earnings for every additional year of schooling.

library(readr)
library(ggplot2)
library(dplyr)
setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
nlsy79 <- read_csv("nlsy79.csv")
## Rows: 6110 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): race, sex
## dbl (9): CASEID, earn2009, bmonth, byear, afqt, region_1979, faminc1978, hgc...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
nlsy79 %>% filter(earn2009>0 & is.na(earn2009)==0)
reg_1 <- lm(earn2009~hgc, data=nlsy79)
summary(reg_1)
## 
## Call:
## lm(formula = earn2009 ~ hgc, data = nlsy79)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -116311  -35064   -9092   16297  308252 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -82664.1     4719.5  -17.52   <2e-16 ***
## hgc           9948.8      349.7   28.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57060 on 4332 degrees of freedom
##   (1776 observations deleted due to missingness)
## Multiple R-squared:  0.1575, Adjusted R-squared:  0.1573 
## F-statistic: 809.6 on 1 and 4332 DF,  p-value: < 2.2e-16

Using graphing functionality to plot the conditional expectation of earnings with respect to years of education.

ggplot(nlsy79,aes(x=hgc,y=earn2009))+
  geom_point(stat = "summary", fun.y="mean")+
  geom_smooth(method = "lm", se=FALSE)+
  geom_smooth(method = "loess", color="green", se=FALSE)+
  xlab("Years of Education")+xlab("Income")+
  theme_bw(base_size = 24)

# From looking at a linear regression (in blue) compared to for example a loess regression (in green), the return doesn't seems to be more exponential than linear.

Creating a variable that equals years of education squared, then regressing earnings on years of education and years of education squared. How much do earnings increase for someone who gets 10 instead of 9 years of schooling? What about someone who gets 17 instead of 16

nlsy79$hgc_sq <- nlsy79$hgc^2
reg_2 <- lm(data = nlsy79, earn2009~hgc+hgc_sq)
summary(reg_2)
## 
## Call:
## lm(formula = earn2009 ~ hgc + hgc_sq, data = nlsy79)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -140661  -33628   -9090   15742  310794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47122.7    16206.1   2.908  0.00366 ** 
## hgc          -9711.8     2375.7  -4.088 4.43e-05 ***
## hgc_sq         719.4       86.0   8.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56610 on 4331 degrees of freedom
##   (1776 observations deleted due to missingness)
## Multiple R-squared:  0.1709, Adjusted R-squared:  0.1705 
## F-statistic: 446.2 on 2 and 4331 DF,  p-value: < 2.2e-16
# We know that marginal benefit is the partial derivative of our regression formula with respect to our variable of interest (in this case, grades completed). Using the results from the regression including h^2, we can calculate that to be 
## -9711.8+2(719.4*9)  = 3,237.40 for hgc from 9 to 10
## -9711.8+2(719.4*16) = 13,309.00 for hgc from 16 to 17

Coding years of education as a factor, then regressing earnings on years of education.

nlsy79 <- nlsy79 %>%
  mutate(hgc_factor = factor(hgc))
reg_3 <- lm(data = nlsy79, earn2009~hgc_factor-1)
summary(reg_3)
## 
## Call:
## lm(formula = earn2009 ~ hgc_factor - 1, data = nlsy79)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -164578  -33580   -8613   15686  309851 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## hgc_factor0     12150      39907   0.304 0.760798    
## hgc_factor1         0      56438   0.000 1.000000    
## hgc_factor3     19882      56438   0.352 0.724648    
## hgc_factor5         0      39907   0.000 1.000000    
## hgc_factor6     17673      18813   0.939 0.347576    
## hgc_factor7      9770      10666   0.916 0.359693    
## hgc_factor8     20708       7000   2.958 0.003111 ** 
## hgc_factor9     17720       5508   3.217 0.001304 ** 
## hgc_factor10    19041       5008   3.802 0.000145 ***
## hgc_factor11    22701       5089   4.461 8.37e-06 ***
## hgc_factor12    35122       1298  27.055  < 2e-16 ***
## hgc_factor13    45874       2914  15.740  < 2e-16 ***
## hgc_factor14    48245       2794  17.267  < 2e-16 ***
## hgc_factor15    53461       4394  12.168  < 2e-16 ***
## hgc_factor16    81130       2278  35.620  < 2e-16 ***
## hgc_factor17    74130       4770  15.541  < 2e-16 ***
## hgc_factor18    97361       4548  21.408  < 2e-16 ***
## hgc_factor19   120435       6947  17.336  < 2e-16 ***
## hgc_factor20   164578       7348  22.399  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56440 on 4315 degrees of freedom
##   (1776 observations deleted due to missingness)
## Multiple R-squared:  0.4963, Adjusted R-squared:  0.4941 
## F-statistic: 223.8 on 19 and 4315 DF,  p-value: < 2.2e-16

Regressing the natural logarithm of earnings on an indicator variable for being male to determine the estimated relationship between log earnings of men versus women.

nlsy79 <- nlsy79 %>%
  mutate(log_ind = ifelse(earn2009>0,log(earn2009),NA),male = (sex=="MALE"))
reg_4 <- lm(data = nlsy79, log_ind~male)
summary(reg_4)
## 
## Call:
## lm(formula = log_ind ~ male, data = nlsy79)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6544 -0.4070  0.1103  0.5964  2.4247 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.32655    0.02421  426.52   <2e-16 ***
## maleTRUE     0.52160    0.03439   15.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 3557 degrees of freedom
##   (2551 observations deleted due to missingness)
## Multiple R-squared:  0.06074,    Adjusted R-squared:  0.06048 
## F-statistic:   230 on 1 and 3557 DF,  p-value: < 2.2e-16

Regressing log earnings on years of education and an indicator for being male. Next, regress log earnings on years of education and an indicator for being female. Compare the estimated returns to education from both specifications. What do you notice?

reg_5 <- lm(data = nlsy79, log_ind~hgc+male)
summary(reg_5)
## 
## Call:
## lm(formula = log_ind ~ hgc + male, data = nlsy79)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4606 -0.3248  0.1363  0.5648  2.6467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.372145   0.093144   89.88   <2e-16 ***
## hgc         0.144366   0.006665   21.66   <2e-16 ***
## maleTRUE    0.533960   0.032533   16.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.963 on 3503 degrees of freedom
##   (2604 observations deleted due to missingness)
## Multiple R-squared:  0.1718, Adjusted R-squared:  0.1713 
## F-statistic: 363.3 on 2 and 3503 DF,  p-value: < 2.2e-16

Or compared to the log regression of earnings on years of education separately for the samples of men and women.

nlsy79 <- nlsy79 %>%
  mutate(female = (sex=="FEMALE"))
reg_6 <- lm(data = nlsy79, log_ind~female)
summary(reg_6)
## 
## Call:
## lm(formula = log_ind ~ female, data = nlsy79)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6544 -0.4070  0.1103  0.5964  2.4247 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.84815    0.02442  444.18   <2e-16 ***
## femaleTRUE  -0.52160    0.03439  -15.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 3557 degrees of freedom
##   (2551 observations deleted due to missingness)
## Multiple R-squared:  0.06074,    Adjusted R-squared:  0.06048 
## F-statistic:   230 on 1 and 3557 DF,  p-value: < 2.2e-16