Penalized Regression with LASSO, Ridge, and Elasticnet

Project 002

Part 0: Prep Work

pacman::p_load(dplyr, magrittr,
               glmnet, tidymodels,
               dvmisc, parsnip, tidyverse, tidyr,tune,
               workflows, recipes, yardstick, caret, rsample)
election = read.csv("election-2016.csv")

# Setting seed
set.seed(123)

Part 1: Penalized regression

First, to pre-process the data

# Processing recipe
elect_recipe = election %>% recipe(i_republican_2016 ~ .) %>% 
  update_role(fips, new_role = "id variable") %>% 
  step_normalize(all_predictors() & all_numeric()) %>% 
  step_dummy(all_predictors() & all_nominal()) %>% 
  step_rename_at(everything(), fn = stringr::str_to_lower)

# Juicing
elect_clean = elect_recipe %>% prep() %>% juice()

# Defining folds
elect_cv = elect_clean %>% vfold_cv(v =5)

# Setting range of λ and α
lambdas = 10^seq(from =5, to = -2, length =1e3)
alphas = seq(from =0, to =1, by =0.1)

01. Using 5-fold cross validation: tune a Lasso regression model

# Creating cross-validated model
lasso_mod = cv.glmnet(
   x = elect_clean %>% select(-i_republican_2016, -fips) %>% as.matrix(),
   y = elect_clean$i_republican_2016,
   standardize =F,
   alpha = 1,
   lambda = lambdas,
   type.measure = "mse",
   nfolds = 5)

# Create data frame of our results
lasso_summary = data.frame(
  lambda = lasso_mod$lambda,
  rmse = sqrt(lasso_mod$cvm))

# Printing various results
summary(lasso_mod)
##            Length Class  Mode     
## lambda     1000   -none- numeric  
## cvm        1000   -none- numeric  
## cvsd       1000   -none- numeric  
## cvup       1000   -none- numeric  
## cvlo       1000   -none- numeric  
## nzero      1000   -none- numeric  
## call          8   -none- call     
## name          1   -none- character
## glmnet.fit   12   elnet  list     
## lambda.min    1   -none- numeric  
## lambda.1se    1   -none- numeric  
## index         2   -none- numeric
plot(lasso_mod)

lasso_summary %>% 
    group_by(lambda) %>% 
    slice(which.min(rmse)) %>%
    head(.,n=1)
## # A tibble: 1 × 2
## # Groups:   lambda [1]
##   lambda  rmse
##    <dbl> <dbl>
## 1   0.01 0.206

02. What is the penalty for your ‘best’ model?

library(ggplot2)
ggplot(
  data = lasso_summary,
  aes(x = lambda, y = rmse)) +
   geom_line() +
   geom_point(data = lasso_summary %>% filter(rmse == min(rmse)),
  size = 3.5,
  color = "blue") +
   scale_y_continuous("RMSE") +
   scale_x_continuous(expression(lambda),trans = "log10",
                      labels = c("0.1", "10", "1,000", "100,000"),
                      breaks = c(0.1, 10, 1000, 100000),) 

Also as shown in the previous question, the λ (i.e. penalty) is 0.01 with an RMSE of 0.205627.

03. Which metric did you use to define the ‘best’ model? Does it make sense in this setting? Explain your answer.

RMSE was selected to define what is our best model because as we asses model performs with a large number of lambda values, we can see how the associated RMSE measurement changes and thus minimize the measurement of error with the associated lambda value.

04. Now tune an elasticnet prediction model.

net_mod = cv.glmnet(
   x = elect_clean %>% select(-i_republican_2016, -fips) %>% as.matrix(),
   y = elect_clean$i_republican_2016,
   standardize =T,
   alpha = .9, # Arbitrary selection
   lambda = lambdas,
   type.measure = "mse",
   nfolds = 5)

# Create data frame of our results
lasso_summary = data.frame(
  lambda = net_mod$lambda,
  rmse = sqrt(net_mod$cvm))

# Printing various results
summary(net_mod)
##            Length Class  Mode     
## lambda     1000   -none- numeric  
## cvm        1000   -none- numeric  
## cvsd       1000   -none- numeric  
## cvup       1000   -none- numeric  
## cvlo       1000   -none- numeric  
## nzero      1000   -none- numeric  
## call          8   -none- call     
## name          1   -none- character
## glmnet.fit   12   elnet  list     
## lambda.min    1   -none- numeric  
## lambda.1se    1   -none- numeric  
## index         2   -none- numeric
plot(net_mod)

lasso_summary %>% 
    group_by(lambda) %>% 
    slice(which.min(rmse)) %>%
    head(.,n=1)
## # A tibble: 1 × 2
## # Groups:   lambda [1]
##   lambda  rmse
##    <dbl> <dbl>
## 1   0.01 0.205
net_mod2 = cv.glmnet(
   x = elect_clean %>% select(-i_republican_2016, -fips) %>% as.matrix(),
   y = elect_clean$i_republican_2016,
   standardize =T,
   alpha = .1, # Arbitrary selection
   lambda = lambdas,
   type.measure = "mse",
   nfolds = 5)

# Create data frame of our results
lasso_summary2 = data.frame(
  lambda = net_mod2$lambda,
  rmse = sqrt(net_mod2$cvm))

# Printing various results
summary(net_mod2)
##            Length Class  Mode     
## lambda     1000   -none- numeric  
## cvm        1000   -none- numeric  
## cvsd       1000   -none- numeric  
## cvup       1000   -none- numeric  
## cvlo       1000   -none- numeric  
## nzero      1000   -none- numeric  
## call          8   -none- call     
## name          1   -none- character
## glmnet.fit   12   elnet  list     
## lambda.min    1   -none- numeric  
## lambda.1se    1   -none- numeric  
## index         2   -none- numeric
plot(net_mod2)

lasso_summary2 %>% 
    group_by(lambda) %>% 
    slice(which.min(rmse)) %>%
    head(.,n=1)
## # A tibble: 1 × 2
## # Groups:   lambda [1]
##   lambda  rmse
##    <dbl> <dbl>
## 1   0.01 0.221

The two elastic net models with alpha values of 0.9 and 0.1 gave us RMSE measurements of 0.2049942 and 0.2241721, respectively. We can compare this information to the RMSE of the LASSO model of 0.205627.

05. What do the chosen hyperparameters for the elasticnet tell you about the Ridge vs. Lasso in this setting?

Since we can see that the RMSE for the elasticnet models is better (lower) when the alpha value is higher and the best performing models of all the above was the LASSO model, we can be confident in saying that LASSO regression would be a better in the scope of this setting supported by the insight provided by the elasticnet models.

Part 2: Logistic regression

06. Now fit a logistic regression (logistic_reg() in tidymodels) model—using 5-fold cross validation to get a sense of your model’s performance (record the following metrics: accuracy, precision, specificity, sensitivity, ROC AUC).

Hint: You can tell tune_grid() or fit_resamples() which metrics to collect via the metrics argument. You’ll want to give the argument a metric_set().

election2 <- election %>% select(-c("county")) %>%
   mutate(state = as.factor(state),
          i_republican_2012 = as.factor(i_republican_2012),
          i_republican_2016 = as.factor(i_republican_2016))

# Processing recipe
elect_recipe = election2 %>%
   recipe(i_republican_2016 ~ .) %>% 
  update_role(fips, new_role = "id variable") %>% 
  step_normalize(all_predictors() & all_numeric()) %>% 
  step_dummy(all_predictors() & all_nominal()) %>% 
  step_rename_at(everything(), fn = stringr::str_to_lower)
elect_clean = elect_recipe %>% prep() %>% juice()
elect_cv = elect_clean %>% vfold_cv(v =5)

# Defining model
mod_logi = logistic_reg(mode = "classification") %>% 
   set_engine("glm")

# Fitting
lm_form_fit <- mod_logi %>%
   fit_xy(elect_cv,
      x = elect_clean %>% select(-i_republican_2016, -fips) %>% as.matrix(),
      y = elect_clean$i_republican_2016,
      control = control_parsnip())  
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Making predictions
elect_clean$preds = predict(lm_form_fit, elect_clean)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
temp = elect_clean %>% select(preds) %>% 
   mutate(new_c = as.factor(elect_clean$i_republican_2016),
          preds = as.factor(if_else(preds==1,1,0))) %>% as.data.frame()
   
# Confusion matrix and other metrics
confusionMatrix(temp$preds, temp$new_c)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  463   34
##          1   29 2590
##                                           
##                Accuracy : 0.9798          
##                  95% CI : (0.9742, 0.9844)
##     No Information Rate : 0.8421          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9243          
##                                           
##  Mcnemar's Test P-Value : 0.6143          
##                                           
##             Sensitivity : 0.9411          
##             Specificity : 0.9870          
##          Pos Pred Value : 0.9316          
##          Neg Pred Value : 0.9889          
##              Prevalence : 0.1579          
##          Detection Rate : 0.1486          
##    Detection Prevalence : 0.1595          
##       Balanced Accuracy : 0.9640          
##                                           
##        'Positive' Class : 0               
## 

07. What is the cross-validated accuracy of this logistic model?

The accuracy is 97.98 percent.

08. Is your accuracy “good”? Explain your answer—including a comparison to the null classifier.

I would say that this is a fairly good model. Several of the metrics are in the ninetieth percentile and the confidence interval is also higher and narrow.

Part 3: Logistic Lasso

10. Now fit a logistic Lasso regression (logistic_reg() in tidymodels, but now tuning the penalty) model—using 5-fold cross validation. Again: record the following metrics: accuracy, precision, specificity, sensitivity, ROC AUC.

election2 <- election %>% select(-c("county")) %>%
   mutate(state = as.factor(state),
          i_republican_2012 = as.factor(i_republican_2012),
          i_republican_2016 = as.factor(i_republican_2016))

# Processing recipe
elect_recipe = election2 %>%
   recipe(i_republican_2016 ~ .) %>% 
  update_role(fips, new_role = "id variable") %>% 
  step_normalize(all_predictors() & all_numeric()) %>% 
  step_dummy(all_predictors() & all_nominal()) %>% 
  step_rename_at(everything(), fn = stringr::str_to_lower)
elect_clean = elect_recipe %>% prep() %>% juice()
elect_cv = elect_clean %>% vfold_cv(v =5)

lasso_mod_fin = cv.glmnet(
   x = elect_clean %>% select(-i_republican_2016, -fips) %>% as.matrix(),
   y = elect_clean$i_republican_2016,
   standardize =F,
   family="binomial",
   alpha = 1,
   lambda = lambdas,
   type.measure = "mse",
   nfolds = 5)

mode1 <- glmnet(
      x = elect_clean %>% select(-i_republican_2016, -fips) %>% as.matrix(),
       y = elect_clean$i_republican_2016,
      alpha = 1, family = "binomial",
      lambda = lasso_mod_fin$lambda.min)
head(coef(mode1),n=15)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                 s0
## (Intercept)              0.5078591
## n_votes_republican_2012  .        
## n_votes_democrat_2012    .        
## n_votes_other_2012       .        
## n_votes_total_2012       .        
## pop                      .        
## pop_pct_change           .        
## pop_pct_below18          .        
## pop_pct_above65          .        
## pop_pct_female           .        
## pop_pct_asian            .        
## pop_pct_black           -0.4662835
## pop_pct_native           .        
## pop_pct_pacific          .        
## pop_pct_white            0.3988517
plot(lasso_mod_fin)

# Making predictions
elect_clean$pred_final = predict(mode1,type = "response",
               s = lasso_mod_fin$lambda.min,
               newx = elect_clean %>% select(-i_republican_2016, -fips) %>% as.matrix())

lassss = elect_clean %>% select(pred_final,i_republican_2016) %>%
   mutate( # Using different cutoff values to see variation
      sm_conf = as.factor(
         if_else(pred_final<0.2,0,1)),
      fifty_fifty = as.factor(
         if_else(pred_final<0.5,0,1)),
      bg_conf = as.factor(
         if_else(pred_final<0.8,0,1)))

# Creating confusion matrices
confusionMatrix(lassss$i_republican_2016,
                lassss$sm_conf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  293  199
##          1    7 2617
##                                           
##                Accuracy : 0.9339          
##                  95% CI : (0.9246, 0.9424)
##     No Information Rate : 0.9037          
##     P-Value [Acc > NIR] : 1.141e-09       
##                                           
##                   Kappa : 0.7046          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.97667         
##             Specificity : 0.92933         
##          Pos Pred Value : 0.59553         
##          Neg Pred Value : 0.99733         
##              Prevalence : 0.09628         
##          Detection Rate : 0.09403         
##    Detection Prevalence : 0.15789         
##       Balanced Accuracy : 0.95300         
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(lassss$i_republican_2016,
                lassss$fifty_fifty)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  441   51
##          1   53 2571
##                                           
##                Accuracy : 0.9666          
##                  95% CI : (0.9597, 0.9726)
##     No Information Rate : 0.8415          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8747          
##                                           
##  Mcnemar's Test P-Value : 0.9219          
##                                           
##             Sensitivity : 0.8927          
##             Specificity : 0.9805          
##          Pos Pred Value : 0.8963          
##          Neg Pred Value : 0.9798          
##              Prevalence : 0.1585          
##          Detection Rate : 0.1415          
##    Detection Prevalence : 0.1579          
##       Balanced Accuracy : 0.9366          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(lassss$i_republican_2016,
                lassss$bg_conf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  477   15
##          1  156 2468
##                                           
##                Accuracy : 0.9451          
##                  95% CI : (0.9365, 0.9529)
##     No Information Rate : 0.7969          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8152          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7536          
##             Specificity : 0.9940          
##          Pos Pred Value : 0.9695          
##          Neg Pred Value : 0.9405          
##              Prevalence : 0.2031          
##          Detection Rate : 0.1531          
##    Detection Prevalence : 0.1579          
##       Balanced Accuracy : 0.8738          
##                                           
##        'Positive' Class : 0               
## 

11. How does the performance of this logistic Lasso compare to the logistic regression in Part 2?

While there are slight differences in some of the metrics by one to several percentage points, such as with accuracy, the 95% confidence interval, the kappa value, etc, there was a considering amount of variation in the P-value depending on the slight variations made to the model. This number can provide significant depending on the selected significance threshold. If the p-value is below that level (often 0.05), then we reject the null hypothesis of the test that the two models are equal.

Looking at the results above, we can support the idea of having a cutoff value higher or lower than .5 will improve the performance of our model.

12. Do you think moving to a logistic elasticnet would improve anything? Explain.

I don’t think that it would change anything. We saw earlier in the assignment that, at least in the context of this analysis, the pure LASSO regression had slightly stronger (if not, then no statistically significant difference) compared to the elasticnet. Therefore we would not be gaining any performance with such modeling.

Part 4: Reflection

13. Why might we prefer Lasso to elasticnet (or vice versa)?

There may be a situation where it is preferable to be positioned in a more extreme position on the spectrum of the bias-variance tradeoff because of the context. The weight associated with the importance of a given variable could be have a high cost if incorrectly predicted and thus it would be important to have high levels of certainty in the model predictions.

If the goal of the analytical process is to produce a well-rounded model capable of making accurate predictions with precision and consistency, then a more-balanced position in the bias-variance tradeoff would be ideal to avoid the potential to overfit while also accurately describing trends.

14. What the the differences between logistic regression and linear regression? What are the similarities?

Linear regression is used when the outcome variable is a continuous numeric variable and logistic regression is used when the outcome variable is a factor with 2 or more levels. As the name suggests, linear regression is better for capturing linear trends, whereas logistic regression can provide a fit when linear trends don’t exist or can’t be applied. L

They are similar because they both describe trends in the data and use information from a list of explanatory variables (and possibly their interactions) to make inferences about the dependent/outcome variable.

15. Imagine you worked for a specific political party. Which metric would you use for assessing model performance?

In the context of politics, the information and insight from this analysis could be helpful for understanding voter trends and demographics. During a campaign, for an example, a candidate who knows that a certain county is 99% likely to vote for their party then they can devote less time and resources to that particular county and can focus on counties that have less certain outcomes and would be more influential in winning the election.

With the context in mind, I would use accuracy for assessing model performance. Additional transformation of the data could also happen such that our outcome variable attempts to predict counties whose certainty-of-outcome is less than 30% (and arbitrary number, just for an example). Then states that meet this criteria would be grouped and labeled as “swing states” and considered to more important in the race and require more money for campaign efforts.

Due to the binary outcomes of campaign, accuracy would be important because it would be the best way to allocate limited campaign funds. In terms of marginal products, the marginal benefit generated by spending an additional dollar in campaigning would be higher for swing counties/states compared to places that have trends in their voting patterns.