Regression Analysis
Kyle Brewster
The data in this analysis is from the National Longitudinal Survey of Youth 1979 to measure the correlation between income and years of education.
The data file nlsy79.csv
includes this information and some additional variables we may use later in future analyses. A description of the variables in the data is in the table below.
Name | Description |
---|---|
CASEID |
Unique identifier |
earn2009 |
Earnings in 2009 |
hgc |
Years of education |
race |
Race and Ethnicity |
sex |
Gender |
bmonth |
Birth Month |
byear |
Birth Year |
afqt |
Armed Forces Qualifying Test Percentile |
region_1979 |
Region |
faminc1978 |
Family Income in 1978 |
nsibs79 |
Number of Siblings |
Regressing earnings on years of education to determine the average increase in earnings for every additional year of schooling.
library(readr)
library(ggplot2)
library(dplyr)
setwd("C:/Users/Kyle/Desktop/FALL 21 - UO/Econometrics")
nlsy79 <- read_csv("nlsy79.csv")
## Rows: 6110 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): race, sex
## dbl (9): CASEID, earn2009, bmonth, byear, afqt, region_1979, faminc1978, hgc...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
nlsy79 %>% filter(earn2009>0 & is.na(earn2009)==0)
reg_1 <- lm(earn2009~hgc, data=nlsy79)
summary(reg_1)
##
## Call:
## lm(formula = earn2009 ~ hgc, data = nlsy79)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116311 -35064 -9092 16297 308252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -82664.1 4719.5 -17.52 <2e-16 ***
## hgc 9948.8 349.7 28.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57060 on 4332 degrees of freedom
## (1776 observations deleted due to missingness)
## Multiple R-squared: 0.1575, Adjusted R-squared: 0.1573
## F-statistic: 809.6 on 1 and 4332 DF, p-value: < 2.2e-16
Using graphing functionality to plot the conditional expectation of earnings with respect to years of education.
ggplot(nlsy79,aes(x=hgc,y=earn2009))+
geom_point(stat = "summary", fun.y="mean")+
geom_smooth(method = "lm", se=FALSE)+
geom_smooth(method = "loess", color="green", se=FALSE)+
xlab("Years of Education")+xlab("Income")+
theme_bw(base_size = 24)
# From looking at a linear regression (in blue) compared to for example a loess regression (in green), the return doesn't seems to be more exponential than linear.
Creating a variable that equals years of education squared, then regressing earnings on years of education and years of education squared. How much do earnings increase for someone who gets 10 instead of 9 years of schooling? What about someone who gets 17 instead of 16
nlsy79$hgc_sq <- nlsy79$hgc^2
reg_2 <- lm(data = nlsy79, earn2009~hgc+hgc_sq)
summary(reg_2)
##
## Call:
## lm(formula = earn2009 ~ hgc + hgc_sq, data = nlsy79)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140661 -33628 -9090 15742 310794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47122.7 16206.1 2.908 0.00366 **
## hgc -9711.8 2375.7 -4.088 4.43e-05 ***
## hgc_sq 719.4 86.0 8.365 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56610 on 4331 degrees of freedom
## (1776 observations deleted due to missingness)
## Multiple R-squared: 0.1709, Adjusted R-squared: 0.1705
## F-statistic: 446.2 on 2 and 4331 DF, p-value: < 2.2e-16
# We know that marginal benefit is the partial derivative of our regression formula with respect to our variable of interest (in this case, grades completed). Using the results from the regression including h^2, we can calculate that to be
## -9711.8+2(719.4*9) = 3,237.40 for hgc from 9 to 10
## -9711.8+2(719.4*16) = 13,309.00 for hgc from 16 to 17
Coding years of education as a factor, then regressing earnings on years of education.
nlsy79 <- nlsy79 %>%
mutate(hgc_factor = factor(hgc))
reg_3 <- lm(data = nlsy79, earn2009~hgc_factor-1)
summary(reg_3)
##
## Call:
## lm(formula = earn2009 ~ hgc_factor - 1, data = nlsy79)
##
## Residuals:
## Min 1Q Median 3Q Max
## -164578 -33580 -8613 15686 309851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## hgc_factor0 12150 39907 0.304 0.760798
## hgc_factor1 0 56438 0.000 1.000000
## hgc_factor3 19882 56438 0.352 0.724648
## hgc_factor5 0 39907 0.000 1.000000
## hgc_factor6 17673 18813 0.939 0.347576
## hgc_factor7 9770 10666 0.916 0.359693
## hgc_factor8 20708 7000 2.958 0.003111 **
## hgc_factor9 17720 5508 3.217 0.001304 **
## hgc_factor10 19041 5008 3.802 0.000145 ***
## hgc_factor11 22701 5089 4.461 8.37e-06 ***
## hgc_factor12 35122 1298 27.055 < 2e-16 ***
## hgc_factor13 45874 2914 15.740 < 2e-16 ***
## hgc_factor14 48245 2794 17.267 < 2e-16 ***
## hgc_factor15 53461 4394 12.168 < 2e-16 ***
## hgc_factor16 81130 2278 35.620 < 2e-16 ***
## hgc_factor17 74130 4770 15.541 < 2e-16 ***
## hgc_factor18 97361 4548 21.408 < 2e-16 ***
## hgc_factor19 120435 6947 17.336 < 2e-16 ***
## hgc_factor20 164578 7348 22.399 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56440 on 4315 degrees of freedom
## (1776 observations deleted due to missingness)
## Multiple R-squared: 0.4963, Adjusted R-squared: 0.4941
## F-statistic: 223.8 on 19 and 4315 DF, p-value: < 2.2e-16
Regressing the natural logarithm of earnings on an indicator variable for being male to determine the estimated relationship between log earnings of men versus women.
nlsy79 <- nlsy79 %>%
mutate(log_ind = ifelse(earn2009>0,log(earn2009),NA),male = (sex=="MALE"))
reg_4 <- lm(data = nlsy79, log_ind~male)
summary(reg_4)
##
## Call:
## lm(formula = log_ind ~ male, data = nlsy79)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6544 -0.4070 0.1103 0.5964 2.4247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.32655 0.02421 426.52 <2e-16 ***
## maleTRUE 0.52160 0.03439 15.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 3557 degrees of freedom
## (2551 observations deleted due to missingness)
## Multiple R-squared: 0.06074, Adjusted R-squared: 0.06048
## F-statistic: 230 on 1 and 3557 DF, p-value: < 2.2e-16
Regressing log earnings on years of education and an indicator for being male. Next, regress log earnings on years of education and an indicator for being female. Compare the estimated returns to education from both specifications. What do you notice?
reg_5 <- lm(data = nlsy79, log_ind~hgc+male)
summary(reg_5)
##
## Call:
## lm(formula = log_ind ~ hgc + male, data = nlsy79)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4606 -0.3248 0.1363 0.5648 2.6467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.372145 0.093144 89.88 <2e-16 ***
## hgc 0.144366 0.006665 21.66 <2e-16 ***
## maleTRUE 0.533960 0.032533 16.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.963 on 3503 degrees of freedom
## (2604 observations deleted due to missingness)
## Multiple R-squared: 0.1718, Adjusted R-squared: 0.1713
## F-statistic: 363.3 on 2 and 3503 DF, p-value: < 2.2e-16
Or compared to the log regression of earnings on years of education separately for the samples of men and women.
nlsy79 <- nlsy79 %>%
mutate(female = (sex=="FEMALE"))
reg_6 <- lm(data = nlsy79, log_ind~female)
summary(reg_6)
##
## Call:
## lm(formula = log_ind ~ female, data = nlsy79)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6544 -0.4070 0.1103 0.5964 2.4247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.84815 0.02442 444.18 <2e-16 ***
## femaleTRUE -0.52160 0.03439 -15.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 3557 degrees of freedom
## (2551 observations deleted due to missingness)
## Multiple R-squared: 0.06074, Adjusted R-squared: 0.06048
## F-statistic: 230 on 1 and 3557 DF, p-value: < 2.2e-16