Predicting Rates of Opioid Prescriptions

eexectuive_summary.knit

Executive Summary

The purpose of this project is to provide exploratory data analysis to better understand trends in opioid prescription rates that may exist among the different distinguishable characteristics of providers included in the data. There are adverse effects on health and finance resulting from higher national rates of drug abuse, including of drugs that are obtained through legal means (i.e. prescribed by a doctor). Over a million of Americans have misused opioids in the past year.

In this analysis, models were constructed to predict total number of opioid prescriptions written based on distinguishable characteries of a providers. Additional models were constructed to predict the gender of a provider based on provider characteristics. The best performing models were able to predict gender with an accuracy of 77.37 percent and predicted total opioid prescriptions with MSE of 109.4 (with standard deviation of 224).

The Big Picture

In 2017, the US Department of Health and Human Services (HHS) declared a public health emergency to combat the growing opioid crises in the US and the resulting strain. In 2016, more than 42,000 deaths were contributed to opioid overdoses. According to the most recent figures provides by HHS, 1.6 million American had an opioid use disorder in the past year, 10.1 million people misused prescriptions, and more than 48,000 deaths were attributed to overdose on synthetic opioids other than methadone (more on terms and definitions by HHS)

This problem is also associated with an overall drug misuse problem in the country. The National Survey on Drug Use and Health (2019) found that 745,000 people used heroin in the past year, 50,000 used heroin for the first time with around 14,500 deaths resulting from heroin overdose.

According to the National Institute on Drug Abuse, there is a correlation between misuse of opioids and heroin usage. This is because heroin is classified as an opiate, a class of natural opioid, that has induce similar bodily reactions and provide similar potential for abuse as synthetically processed opioids (such as those prescribed by a doctor).

In this analysis, I develop models using classification methods to look at trends involving gender of the provider, and I use regression methods to look at trends in number of prescriptions written. I created a variable called summ_var to use for modeling prescribing rates of opioids that is the sum of all opioid prescriptions from a single provider. The original data only had total prescriptions written for each different drug in the data which also included many non-opioid medications, so the summ_var variable was helpful to provide insight from a broader scope.

The Data

The data sets used in this analysis were found from Kaggle, sourced from CMS.gov (Centers for Medicare & Medicaid Services)

One challenge working with this data was regarding the formatting. There were thousands of spelling inconsistencies for different providers specialties/credentials that resulted in some observations being lost when cleaning the data.

The year of the data’s collection being only 2014 also presented some issues. While there is some insight we can gain by examining characteristics of providers from a single year, it would be difficult to understand how the trends have changed over time and how much can be explained by new policies (e.g. Prescription Drug Monitoring Programs) or developments in healthcare (e.g. technology; drug alternatives).

A time series panel of data with these same variables for multiple years would provide better insight into things considered worth knowing from a healthcare or policy perspective. Pulling in additional sources of data alone with the time panel component would also help to enrich the insight from this analysis.

Methods

Forward and backward stepwise selection was used to optimally select variables to be included in regression. For the classification models, variable importance and selection was algorithmically determined. Random Forest was also used in attempt to predict the summ_var variable. Binned grouping of means that are represented by different factor levels was able to convert the regression problem into a classification problem to provide a more generalizable prediction model for prescribing rates.

Results and Conclusions

MSE was used to assess accuracy when predicting the numeric outcomes. The out-of-sample MSE for the multivariate regression model was 109.4 when predicting the summ_var variable (with training MSE of 103.6). While it would be ideal that this value is closer to zero, it doesn’t nicely indicate that the model performed poorly since the standard deviation was above 220 for training/testing and the residuals were evenly distributed.

Accuracy was an important measurement for the classification models. The support vector machine model performed the best out-of-sample when predicting gender with an accuracy of 77.37%. A random forest was used when predicting the summ_var variable as a classification problem predicting different levels determined by bins of means. When the number of bins was set to n = 30 the accuracy was 61.4%, but became more accurate as the number of bins used decreased.

Something that limited performance on this project was the computational capability of my computer. The rendering of the final HTML took a while to process since I didn’t I didn’t utilize parallelization in the analysis. It also would have helpful to have looked for a time series panel for this data as well as additional sources. Not only would this have enriched my findings, but also could provide insight in other trends, such as heroin usage/overdose rates when compared to opioid prescription rates over time.

Exploratory Data Analysis - Trends in Opioid Prescriptions

Data Cleaning & Setup

library(pacman)
p_load(dplyr, readr, skimr, ggplot2, tidyverse, tidymodels, janitor, magrittr)

opioid_list = read.csv("opioids.csv")
overdose_df = read.csv("overdoses.csv")
prescriber_df = read.csv("prescriber_info.csv")

Overview

#skim(prescriber_df)

Cleaning up

The Credentials column

# To get an idea of some of the more-common types of credentials
top_counts(prescriber_df$Credentials, max_levels = 20)
## [1] "MD: 7034, M.D: 6772, DDS: 1145, D.O: 866, PA-: 845, emp: 763, D.D: 717, DO: 549, NP: 469, DMD: 449, PA: 437, O.D: 353, FNP: 262, ARN: 256, DPM: 247, M.D: 243, D.M: 234, OD: 175, P.A: 168, CRN: 150"
# Cleaning up the acronyms for some of the most common types of credential
scribe_MD <- prescriber_df %>% 
  group_by(Credentials) %>%
  filter(Credentials == "MD"|
         Credentials == "M.D."|
         Credentials == "M.D"|
         Credentials == "M.D.,"|
         Credentials == "M,D"|
         Credentials == "MD."|
         Credentials == "M.D,"|
         Credentials == "M.D")
scribe_DDS <- prescriber_df %>% 
  filter(Credentials == "DDS"|
         Credentials == "DDS, MS,"|
         Credentials == "D.D.S."|
         Credentials == "DDS, MS"|
         Credentials == "DDS. MS"|
         Credentials == "MD, DDS"|
         Credentials == "DDS,MD"|
         Credentials == "DMD"| 
         Credentials == "D.M.D."|
         Credentials == "D.M.D"|
         Credentials == "DDS, MD")
scribe_DO <- prescriber_df %>% 
  group_by(Credentials) %>%
  filter(Credentials == "DO"|
         Credentials == "D.O."|
         Credentials == "D.O"|
         Credentials == "DO,"|
         Credentials == "D.O., M.D"|
         Credentials == "D.O., M.D."|
         Credentials == "D,O."|
         Credentials == "D.O.,")
scribe_PA <- prescriber_df %>% 
  group_by(Credentials) %>%
  filter(Credentials == "PA"|
         Credentials == "P.A."|
         Credentials == "P.A.-C"|
         Credentials == "PA-C"|
         Credentials == "P.A"|
         Credentials == "PA-"|
         Credentials == "PA,"|
         Credentials == "PA."|
         Credentials == "P,A."|
         Credentials == "P.A.C"|
         Credentials == "P.A.,")  
scribe_NP <- prescriber_df %>% 
  group_by(Credentials) %>%
  filter(Credentials == "NP"|
         Credentials == "N.P."|
         Credentials == "N.P"|
         Credentials == "NP."|
         Credentials == "NP,"|
         Credentials == "N.P.,"|
         Credentials == "N P"|
         Credentials == "NURSE PRACTITIONER"|
         Credentials == "NP-C"|
         Credentials == "NP."|
         Credentials == "N P")
    ## Now to make the values more uniform
scribe_DDS = replace(scribe_DDS, 4,"DDS_DDM")
scribe_DO  = replace(scribe_DO, 4, "DO")
scribe_MD  = replace(scribe_MD, 4, "MD")
scribe_PA  = replace(scribe_PA, 4, "PA")
scribe_NP  = replace(scribe_NP, 4, "NP")
scribe_joint = rbind(scribe_DDS, scribe_DO,
                     scribe_MD, scribe_NP,
                     scribe_PA)

The Specialty column

Looking at the top specialties

tab <- as.data.frame(table(prescriber_df$Specialty))
tab =arrange(tab, desc(Freq)) 

Filtering the data to only some the more-common values

scribe_joint = scribe_joint %>%
  group_by(Specialty) %>%
  filter(Specialty == "Internal Medicine"|
         Specialty == "Family Practice"|
         Specialty == "Dentist"|
         Specialty == "Physician Assistant"|
         Specialty == "Nurse Practitioner"|
         Specialty == "Emergency Medicine"|
         Specialty == "Psychiatry"|
         Specialty == "Cardiology"|
         Specialty == "Obstetrics/Gynecology"|
         Specialty == "Orthopedic Surgery"|
         Specialty == "Optometry"|
         Specialty == "General Surgery")

The State Naming Convention

Using the state.abb package to save some effort in narrowing this set down to 50 states and excluding territories, miscellaneous, or missing categories

list_of_states = as.data.frame(cbind(state.abb))
list_of_states = list_of_states %>%
  mutate(State = state.abb) # Let's make the state variable have the same naming convention in both; and then filter out providers outside of our scope of interest
scribe_joint_nonstates = anti_join(scribe_joint, list_of_states, by = 'State')

# And now to get our sets for only the 50 states
scribe_joint = scribe_joint[
  !scribe_joint$State %in% scribe_joint_nonstates$State,
  ]

Creating Variable for Total Number of Opioids Prescribed

Since we are working with multiple variables involving opioids, we therefore must avoid the mistake of including an opioid-variable in both the predictor and the outcome variables. Doing so would result in skewness of model accuracy since the predictors and the outcome are directly correlated and would predict an outcome that we already know.

Looking at the opioid_list data frame.

unique(opioid_list$Generic.Name) # To see which medications are opioids
##  [1] "FENTANYL CITRATE"               "ACETAMINOPHEN WITH CODEINE"    
##  [3] "CODEINE/BUTALBITAL/ASA/CAFFEIN" "DIHYDROCODEINE/ASPIRIN/CAFFEIN"
##  [5] "MORPHINE SULFATE"               "OPIUM/BELLADONNA ALKALOIDS"    
##  [7] "BUPRENORPHINE HCL"              "BUTALBIT/ACETAMIN/CAFF/CODEINE"
##  [9] "BUTORPHANOL TARTRATE"           "BUPRENORPHINE"                 
## [11] "CODEINE/CARISOPRODOL/ASPIRIN"   "CODEINE SULFATE"               
## [13] "HYDROCODONE/ACETAMINOPHEN"      "TRAMADOL HCL"                  
## [15] "MEPERIDINE HCL"                 "MEPERIDINE HCL/PF"             
## [17] "HYDROMORPHONE HCL"              "HYDROMORPHONE HCL/PF"          
## [19] "METHADONE HCL"                  "FENTANYL"                      
## [21] "MORPHINE SULFATE/PF"            "OXYCODONE HCL/ACETAMINOPHEN"   
## [23] "OXYCODONE HCL/ASPIRIN"          "HYDROCODONE/IBUPROFEN"         
## [25] "LEVORPHANOL TARTRATE"           "NALBUPHINE HCL"                
## [27] "TAPENTADOL HCL"                 "OXYMORPHONE HCL"               
## [29] "OPIUM TINCTURE"                 "OXYCODONE HCL"                 
## [31] "IBUPROFEN/OXYCODONE HCL"        "PENTAZOCINE HCL/ACETAMINOPHEN" 
## [33] "PENTAZOCINE HCL/NALOXONE HCL"   "PENTAZOCINE LACTATE"           
## [35] "TRAMADOL HCL/ACETAMINOPHEN"     "DHCODEINE BT/ACETAMINOPHN/CAFF"
## [37] "HYDROCODONE BITARTRATE"
top_opioids = rbind(
  opioid_list[grep("BUPRENORPHINE", opioid_list$Generic.Name),],
  opioid_list[grep("BUTORPHANOL", opioid_list$Generic.Name),],
  opioid_list[grep("CODEINE", opioid_list$Generic.Name),],
  opioid_list[grep("FENTANYL", opioid_list$Generic.Name),],
  opioid_list[grep("HYDROCODONE", opioid_list$Generic.Name),],
  opioid_list[grep("HYDROMORPHONE", opioid_list$Generic.Name),],
  opioid_list[grep("LEVORPHANOL", opioid_list$Generic.Name),],  
  opioid_list[grep("MEPERIDINE", opioid_list$Generic.Name),],
  opioid_list[grep("METHADONE", opioid_list$Generic.Name),],
  opioid_list[grep("MORPHINE", opioid_list$Generic.Name),],
  opioid_list[grep("NALBUPHINE", opioid_list$Generic.Name),],
  opioid_list[grep("OPIUM", opioid_list$Generic.Name),],
  opioid_list[grep("OXYCODONE", opioid_list$Generic.Name),],
  opioid_list[grep("OXYMORPHONE", opioid_list$Generic.Name),],
  opioid_list[grep("PENTAZOCINE", opioid_list$Generic.Name),],
  opioid_list[grep("TRAMADOL", opioid_list$Generic.Name),])

Here we can see the drug class variables that are opioids from the original prescriber_df. Next we can use these different drug types to count the total number of prescriptions written by each provider.

working_df = scribe_joint %>%
  select(contains(c("NPI","Gender","State","Credentials","Specialty",
                  "HYDROCODONE","OXYCONTIN",
                  "CODEINE","FENTANYL","PENTAZOCINE",
                  "DIHYDROCODEINE","Opioid.Prescriber",
                  "OPIUM","BUPRENORPHINE","BUTALBIT","BUTORPHANOL",
                  "TRAMADOL","MEPERIDINE","HYDROMORPHONE","METHADONE",
                  "MORPHINE","OXYCODONE","LEVORPHANOL","NALBUPHINE",
                  "TAPENTADOL","OXYMORPHONE","OPIUM")))

# Now we can calculate the sum of opioid prescription written by each provider for the year
working_df$summ_var = rowSums(working_df[ ,c(6:18)])

# adding that column to our original data frame
scribe_joint$summ_var <- working_df$summ_var

Removing Outliers

Looking at the summ_var variable that we created, we can see that there is one provider in particular that prescribed a significantly higher amount than any other providers. Let’s remove that observation from our data.

scribe_joint = subset(scribe_joint, summ_var<4000)

Creating Factor Variables

Thinking back to the overview of the data,

scribe_joint = scribe_joint %>%
  mutate(Gender = as.factor(Gender),
         State  = as.factor(State),
         Credentials = as.factor(Credentials),
         Specialty   = as.factor(Specialty))

Adjustments for Population

# Creating a variable for per capita rates
overdose_df$Deaths <- as.numeric(gsub(",","",overdose_df$Deaths))
overdose_df$Population <- as.numeric(
  gsub(",","",overdose_df$Population))
    # Removing commas first to avoid errors in math
overdose_df$deaths_per_capita = (
  overdose_df$Deaths/overdose_df$Population
  )*100000 # per 100 thousand

df = scribe_joint %>%
  group_by(State) %>%
  summarise(state_opioid_volume = sum(summ_var)) %>%
  arrange(State)
overdose_df$state_opioid_volume <- df$state_opioid_volume
    # To give us total opioid prescriptions in each state

# which will allow us to determine the rate-per-thousand people
overdose_df %<>% mutate(opioids_per_cap =
                          (state_opioid_volume/Population)*1000)

Data Visualization

# Differences when grouped by specialty
ggp1 = ggplot(
  scribe_joint,aes(x = Specialty, y = summ_var, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Specialty",
       y = "Individual Prescriptions Written",
       fill = "Gender",
       title = "Opioid Prescription Rates by Specialty & Gender") +
  theme(plot.title = element_text(hjust = 0.5, face="bold.italic"), 
        axis.title.x = element_text(size = 12),
        axis.text.x = element_text(size = 7, angle=50,margin = margin(t = 20)),
        axis.title.y = element_text(size = 12),
        legend.title = element_text(face="bold", size = 10))
# Differences when grouped by credentials
ggp2 = ggplot(
  scribe_joint,aes(x = Credentials, y = summ_var, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Credential",
       y = "Individual Prescriptions Written",
       fill = "Gender",
       title = "Opioid Prescription Rates by Credentials & Gender") +
  theme(plot.title = element_text(hjust = 0.5, face="bold.italic"), 
        axis.title.x = element_text(size = 12),
        axis.text.x = element_text(size = 7),
        axis.title.y = element_text(size = 12),
        legend.title = element_text(face="bold", size = 10))
ggp1

ggp2

ggp3 = scribe_joint %>%
  group_by(Gender) %>%
  summarise(plyr::count(Specialty))%>%
  ggplot(.,aes(x = x, y = freq, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge")+
  labs(x = "Specialty",
       y = "Number of Providers",
       fill = "Gender",
       title = "Number of Providers by Specialty & Gender") +
  theme(plot.title = element_text(hjust = 0.5, face="bold.italic"), 
        axis.title.x = element_text(size = 12),
        axis.text.x = element_text(size = 7, angle=50,margin = margin(t = 20)),
        axis.title.y = element_text(size = 12),
        legend.title = element_text(face="bold", size = 10))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
ggp4 = scribe_joint %>%
  group_by(Gender) %>%
  summarise(plyr::count(Credentials))%>%
  ggplot(.,aes(x = x, y = freq, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge")+
  labs(x = "Credential",
       y = "Number of Providers",
       fill = "Gender",
       title = "Number of Providers by Credentials & Gender") +
  theme(plot.title = element_text(hjust = 0.5, face="bold.italic"), 
        axis.title.x = element_text(size = 12),
        axis.text.x = element_text(size = 7),
        axis.title.y = element_text(size = 12),
        legend.title = element_text(face="bold", size = 10))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
ggp3

ggp4

library(usmap)
library(ggpattern)
statepop <- statepop %>% filter(abbr!='DC')
statepop$per_cap_death <- overdose_df$deaths_per_capita
  
map_of_deaths <- plot_usmap(data = statepop, values = "per_cap_death", color = "black")+
  labs(title = "Deaths From Opioid Overdose in US",
               subtitle = "Population per 100,000")+
  theme(plot.title = element_text(face="bold", size = 14),
        plot.subtitle = element_text(face = "italic"),
        legend.title = element_text(face = "bold", size = 8),
        legend.position = c(0.9, 0.15))+
  scale_fill_continuous(
    low = "white", high = "red", name = "Death Rate (2014)")
map_of_deaths

plot_overdose_script = ggplot(
  overdose_df, aes(x=opioids_per_cap, y=deaths_per_capita)) + 
  geom_point(size=4, alpha=.6, color="black")+
  geom_smooth(method = 'lm', se=F, color="red")+
  geom_smooth(method = "loess", se=F, color="blue")+
  xlim(0,102)+
  ylab("Overdose Death Rate (per 100,000)")+
  xlab("Opioid Prescriptions Written (per 1,000 people)")+
  labs(title = 
         "State Overdose Death Rate & Number of Opioid Prescrptiions")
plot_overdose_script
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Training & Testing

Now that the data has been cleaned and we have seen it graphically, we can begin to train a model to fit the data. There are several different ways that this data could be modeled, we will be focusing on a couple of models ### Partioining of Data for Training and Testing

set.seed(1234) # Setting seed for reproducibility
train_df = scribe_joint %>%
  sample_frac(0.8) # Partitioning 20% to be used for testing 
test_df  = anti_join(
  scribe_joint, train_df, by = 'NPI')

# Now lets remove the `NPI` variable to make work easier henceforth
train_df = train_df %>% select(-c("NPI"))
test_df = test_df %>% select(-c("NPI"))

Model 1: Multivariate Regression

# Let's make a copy of the train/test set that we can play with for this model
m1_train <- train_df
m1_test  <- test_df

If we want to answer this question, we should remove the variables of the individual opioids and only include the summ_var variable we created for total opioids prescribed. Removing the additional variables would prevent double-counting in the data.

# Using variable from working_df from before to get list of variables to remove from m1_train
colnames(working_df)
##  [1] "NPI"                       "Gender"                   
##  [3] "State"                     "Credentials"              
##  [5] "Specialty"                 "HYDROCODONE.ACETAMINOPHEN"
##  [7] "OXYCONTIN"                 "ACETAMINOPHEN.CODEINE"    
##  [9] "FENTANYL"                  "Opioid.Prescriber"        
## [11] "IPRATROPIUM.BROMIDE"       "TRAMADOL.HCL"             
## [13] "HYDROMORPHONE.HCL"         "METHADONE.HCL"            
## [15] "MORPHINE.SULFATE"          "MORPHINE.SULFATE.ER"      
## [17] "OXYCODONE.ACETAMINOPHEN"   "OXYCODONE.HCL"            
## [19] "summ_var"
m1_train = m1_train %>%
  select(-c("HYDROCODONE.ACETAMINOPHEN",
            "OXYCONTIN","ACETAMINOPHEN.CODEINE",
            "FENTANYL","Opioid.Prescriber",
            "TRAMADOL.HCL","HYDROMORPHONE.HCL",
            "METHADONE.HCL","MORPHINE.SULFATE",
            "MORPHINE.SULFATE.ER",
            "OXYCODONE.ACETAMINOPHEN",
            "OXYCODONE.HCL"))
p_load(tidyverse, caret, leaps)

set.seed(123)
# Forward step wise selection
fwd_step = train(
  summ_var ~.,
  data = m1_train,
  trControl = trainControl(method = "cv", number = 5),
  method = "leapForward",
  tuneGrid = expand.grid(nvmax = 1:15)
)
fwd_step$results
##    nvmax     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      1 142.9462 0.5889351 67.95358 9.012730 0.02018443 1.853411
## 2      2 128.1006 0.6694549 62.71504 4.961419 0.01403452 1.281382
## 3      3 120.2452 0.7102281 59.75820 5.833873 0.01121955 1.506387
## 4      4 115.5554 0.7320890 57.86897 4.411791 0.01680322 1.402685
## 5      5 113.0243 0.7437949 55.51938 4.602188 0.01436761 1.105121
## 6      6 111.1988 0.7516913 53.73136 3.880417 0.01627388 1.393002
## 7      7 108.0603 0.7648023 52.33373 3.102894 0.01421277 0.809943
## 8      8 107.1782 0.7686659 51.76848 3.298020 0.01190997 1.205059
## 9      9 106.0462 0.7735734 51.37989 3.802985 0.01091996 1.261378
## 10    10 104.9922 0.7784379 51.00413 3.694032 0.01140687 1.131310
## 11    11 105.3073 0.7775196 50.87328 4.063669 0.01591674 1.419487
## 12    12 104.6231 0.7803991 49.95771 4.434914 0.01709791 1.191680
## 13    13 106.2271 0.7742025 49.73829 3.873788 0.01478199 1.207771
## 14    14 106.0036 0.7751666 49.52943 4.773148 0.01895048 1.298049
## 15    15 105.7240 0.7764202 49.50593 4.313467 0.01663582 1.149661
fwd_step$bestTune
##    nvmax
## 12    12
# Backward step wise selection
back_step = train(
  summ_var ~.,
  data = m1_train,
  trControl = trainControl(method = "cv", number = 5),
  method = "leapBackward",
  tuneGrid = expand.grid(nvmax = 1:15)
)
back_step$results
##    nvmax     RMSE  Rsquared      MAE    RMSESD  RsquaredSD    MAESD
## 1      1 147.5288 0.5616769 71.38966 13.139949 0.042923270 8.705169
## 2      2 131.1565 0.6529783 63.98359  8.908965 0.019530571 4.074801
## 3      3 122.5604 0.6973548 60.70188  9.706783 0.025613525 3.701787
## 4      4 116.6978 0.7260008 58.50078  7.923951 0.013846784 3.196471
## 5      5 115.5123 0.7317133 57.38717  7.242762 0.011822267 2.972636
## 6      6 112.4236 0.7456669 55.99648  6.190942 0.009952097 3.008983
## 7      7 110.3052 0.7549938 54.12002  5.867603 0.011367704 2.565357
## 8      8 108.2452 0.7638922 52.85418  5.909550 0.011694650 2.515367
## 9      9 105.9964 0.7736059 52.04085  5.262633 0.010746906 2.488759
## 10    10 106.6327 0.7712167 51.33036  8.351863 0.021385624 2.788668
## 11    11 105.8898 0.7742964 51.09143  6.888476 0.017936588 2.433166
## 12    12 105.6434 0.7750603 50.81237  6.862724 0.021387713 2.200488
## 13    13 105.0630 0.7772380 50.40047  5.924330 0.022142826 1.958428
## 14    14 106.1059 0.7731446 50.34965  7.771442 0.028811771 1.938923
## 15    15 106.4856 0.7717127 50.37024  8.212358 0.029832796 1.689108
back_step$r
## NULL
fin_fwd_step_list = as.data.frame(coef(fwd_step$finalModel,12))
fin_bck_step_list = as.data.frame(coef(back_step$finalModel,13))
print(fin_bck_step_list)
##                             coef(back_step$finalModel, 13)
## (Intercept)                                     13.3716205
## SpecialtyOrthopedic Surgery                    119.6345098
## BACLOFEN                                         0.9503030
## CARISOPRODOL                                     3.8013034
## CYCLOBENZAPRINE.HCL                              2.9703701
## DIVALPROEX.SODIUM.ER                            -1.7089292
## GABAPENTIN                                       0.4771488
## LORAZEPAM                                        0.9163253
## LYRICA                                           1.8918060
## METHOCARBAMOL                                    2.6123489
## ONDANSETRON.HCL                                  2.5591120
## PREDNISONE                                       0.6453914
## PROAIR.HFA                                       0.5089056
## TIZANIDINE.HCL                                   1.6215341
print(fin_fwd_step_list)
##                             coef(fwd_step$finalModel, 12)
## (Intercept)                                    12.8747926
## SpecialtyOrthopedic Surgery                   119.0406428
## CARISOPRODOL                                    4.0987142
## CYCLOBENZAPRINE.HCL                             3.2235913
## DIVALPROEX.SODIUM.ER                           -1.5447693
## GABAPENTIN                                      0.5583611
## LEVOTHYROXINE.SODIUM                            0.1294654
## LORAZEPAM                                       0.7885353
## LYRICA                                          1.8600224
## METHOCARBAMOL                                   2.6424817
## ONDANSETRON.HCL                                 2.3856531
## PREDNISONE                                      0.6304748
## TIZANIDINE.HCL                                  1.6664675

Now that we have a list of variables, let’s think about which to include in our model. The Orthopedic Surgery specialty is considerably significant in both models, so let’s create a dummy variable indicating if a provider is an orthopedic surgeon. CARISOPRODOL, CYCLOBENZAPRINE.HCL, METHOCARBAMOL, ONDANSETRON.HCL, LYRICA, TIZANIDINE.HCL, LORAZEPAM,PREDNISONE, and GABAPENTIN. These are all the different drugs that appeared in both best-selected models.

DIVALPROEX.SODIUM.ER was a negative indicator in both models, so we will also include it in our final model.

m1_train = m1_train %>%
  mutate(orth_surg_dummy = as.factor(
    if_else(Specialty == "Orthopedic Surgery",1,0))
  )

mod1 = lm(data = m1_train, method = "lm", summ_var~
           orth_surg_dummy + CARISOPRODOL +
           CYCLOBENZAPRINE.HCL+METHOCARBAMOL+
           ONDANSETRON.HCL+LYRICA+TIZANIDINE.HCL+
            LORAZEPAM+PREDNISONE+GABAPENTIN)
summary(mod1)
## 
## Call:
## lm(formula = summ_var ~ orth_surg_dummy + CARISOPRODOL + CYCLOBENZAPRINE.HCL + 
##     METHOCARBAMOL + ONDANSETRON.HCL + LYRICA + TIZANIDINE.HCL + 
##     LORAZEPAM + PREDNISONE + GABAPENTIN, data = m1_train, method = "lm")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1235.33   -16.99   -13.54    12.46  1688.54 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          13.53637    1.09557   12.36   <2e-16 ***
## orth_surg_dummy1    116.78447    5.03220   23.21   <2e-16 ***
## CARISOPRODOL          4.09686    0.10102   40.56   <2e-16 ***
## CYCLOBENZAPRINE.HCL   3.41291    0.10672   31.98   <2e-16 ***
## METHOCARBAMOL         2.71512    0.19653   13.81   <2e-16 ***
## ONDANSETRON.HCL       2.67239    0.15231   17.55   <2e-16 ***
## LYRICA                2.10255    0.07984   26.34   <2e-16 ***
## TIZANIDINE.HCL        1.56360    0.07336   21.32   <2e-16 ***
## LORAZEPAM             0.57493    0.02746   20.93   <2e-16 ***
## PREDNISONE            0.87213    0.04227   20.63   <2e-16 ***
## GABAPENTIN            0.65028    0.02476   26.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.7 on 11436 degrees of freedom
## Multiple R-squared:  0.7835, Adjusted R-squared:  0.7833 
## F-statistic:  4138 on 10 and 11436 DF,  p-value: < 2.2e-16

Looking at some measurements of model accuracy

m1_train$resid <- resid(mod1)
m1_train$fitted <- fitted(mod1)

sqrt(mean((m1_train$summ_var - m1_train$fitted)^2)) # to calculate MSE
## [1] 103.6392

Not the best-performing model, but let’s see nonetheless how it performs when facing the testing set. Perhaps the insight to learn is that we cannot successfully predict opioid prescription rates from these lists of other prescribed medications

Model 2: Decision Trees

Are we able to predict the Specialty of a provider based on prescriptions rates and other variables?

library(rpart.plot)
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
m2_train <- train_df %>% na.omit(.)
set.seed(1234)
default_cv = m2_train %>% vfold_cv(v =5)
# Define the decision tree
default_tree = decision_tree(mode ="classification",
                             cost_complexity = tune(),
                             tree_depth = tune()) %>%
  set_engine("rpart")
# Defining recipe
default_recipe = recipe(Gender ~., data = m2_train)
# Defining workflow
default_flow = workflow() %>%
  add_model(default_tree) %>%
  add_recipe(default_recipe)
# Tuning
default_cv_fit = default_flow %>%
  tune_grid(
    default_cv,
    grid = expand_grid(
      cost_complexity = seq(0, 0.15, by = 0.01),
      tree_depth = c(1,2,5,10),
    ),
    metrics = metric_set(accuracy, roc_auc)
  )
# Fitting the best model
best_flow = default_flow %>%
  finalize_workflow(select_best(default_cv_fit, metric = "accuracy")) %>%
  fit(data = m2_train)
# Choosing the best model
best_tree = best_flow %>% extract_fit_parsnip()
# Plotting the tree
best_tree$fit %>% rpart.plot(roundint = FALSE)

Model 3: Binscatter Random Forest

m3_train <- train_df
m3_test <- test_df
# Adding indicator vriable for Surgons
m3_train %<>%
  mutate(orth_surg_dummy = as.factor(
    if_else(Specialty == "Orthopedic Surgery",1,0)))
m3_test %<>%
  mutate(orth_surg_dummy = as.factor(
    if_else(Specialty == "Orthopedic Surgery",1,0)))

# Splitting data into bins
m3_train %<>% mutate(new_bin = ntile(summ_var, n=30)) %>%
   mutate(new_bin = as.factor(new_bin))
str(m3_train$new_bin)
##  Factor w/ 30 levels "1","2","3","4",..: 1 1 1 24 1 1 26 1 1 1 ...
m3_train = m3_train %>%
   group_by(new_bin) %>%
   mutate(avg_sum_var = mean(summ_var))

The training accuracy for this model is 61.4%.

Model 4: Support Vector Machine (SVM)

For this model, let’s see if we can accurately predict the gender of a provider.

m4_train <- train_df
m4_test <- test_df

pacman::p_load(e1071)

Now to prepare the data for making our predictions of interest

# Defining our variables of prediction and outcome
x = m4_train[,-1]
y = m4_train[1]

Now running the model (takes a while, be patient)

svm_model = svm(Gender ~ .,
                 data = m4_train)
summary(svm_model)
## 
## Call:
## svm(formula = Gender ~ ., data = m4_train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  7820
## 
##  ( 4270 3550 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M
preds = predict(svm_model,x)
# Ensure caret package is loaded for line below
confusionMatrix(preds, y$Gender)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    F    M
##          F 2011  554
##          M 2059 6823
##                                           
##                Accuracy : 0.7717          
##                  95% CI : (0.7639, 0.7794)
##     No Information Rate : 0.6444          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4569          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4941          
##             Specificity : 0.9249          
##          Pos Pred Value : 0.7840          
##          Neg Pred Value : 0.7682          
##              Prevalence : 0.3556          
##          Detection Rate : 0.1757          
##    Detection Prevalence : 0.2241          
##       Balanced Accuracy : 0.7095          
##                                           
##        'Positive' Class : F               
## 
summary(svm_model)
## 
## Call:
## svm(formula = Gender ~ ., data = m4_train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  7820
## 
##  ( 4270 3550 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M

Testing

Model 1: Multivariate Regression

m1_test = m1_test %>%
  mutate(orth_surg_dummy = as.factor(
    if_else(Specialty == "Orthopedic Surgery",1,0))
  )
# Using the variables selected with stepwise algorithm
mod1t = lm(data = m1_test, summ_var~
           orth_surg_dummy + CARISOPRODOL +
           CYCLOBENZAPRINE.HCL+METHOCARBAMOL+
           ONDANSETRON.HCL+LYRICA+TIZANIDINE.HCL+
            LORAZEPAM+PREDNISONE+GABAPENTIN)
summary(mod1t)
## 
## Call:
## lm(formula = summ_var ~ orth_surg_dummy + CARISOPRODOL + CYCLOBENZAPRINE.HCL + 
##     METHOCARBAMOL + ONDANSETRON.HCL + LYRICA + TIZANIDINE.HCL + 
##     LORAZEPAM + PREDNISONE + GABAPENTIN, data = m1_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1267.8   -14.2   -14.2    12.9  1168.4 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         14.19839    2.31422   6.135 9.68e-10 ***
## orth_surg_dummy1    77.78581   10.64788   7.305 3.57e-13 ***
## CARISOPRODOL         6.09944    0.23139  26.360  < 2e-16 ***
## CYCLOBENZAPRINE.HCL  3.88640    0.20764  18.717  < 2e-16 ***
## METHOCARBAMOL        1.85239    0.38520   4.809 1.60e-06 ***
## ONDANSETRON.HCL      2.00514    0.33900   5.915 3.72e-09 ***
## LYRICA               1.83852    0.20389   9.017  < 2e-16 ***
## TIZANIDINE.HCL       0.95089    0.16993   5.596 2.40e-08 ***
## LORAZEPAM            0.67389    0.05293  12.733  < 2e-16 ***
## PREDNISONE           0.41070    0.08398   4.890 1.06e-06 ***
## GABAPENTIN           0.83131    0.05162  16.103  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.6 on 2852 degrees of freedom
## Multiple R-squared:  0.7672, Adjusted R-squared:  0.7663 
## F-statistic: 939.7 on 10 and 2852 DF,  p-value: < 2.2e-16
m1_test$resid <- resid(mod1t)
m1_test$fitted <- fitted(mod1t)
sqrt(mean((m1_test$summ_var - m1_test$fitted)^2)) # to calculate MSE
## [1] 109.4119

Looking at the testing results, while the model may not be very accurate, it is at least consistently inaccurate when looking at the data. Looking at the plot below, we can see a wide range of values of predicted versus actual, but the linear line of approximation

plot(x=m1_test$fitted,
     m1_test$summ_var,
     xlab = "Predicted Values",
     ylab = "Observed Values",
     main = "Linear Model")
abline(a = 0,                                        
       b = 1,
       col = "red",
       lwd = 2)

Model 2: Decision Tree

m2_test <- test_df %>% na.omit(.)
set.seed(1234)
default_cv = m2_test %>% vfold_cv(v =5)
default_recipe = recipe(Gender ~., data = m2_test)
# Defining workflow
default_flow = workflow() %>%
  add_model(default_tree) %>%
  add_recipe(default_recipe)
# Tuning
default_cv_fit = default_flow %>%
  tune_grid(
    default_cv,
    grid = expand_grid(
      cost_complexity = seq(0, 0.15, by = 0.01),
      tree_depth = c(1,2,5,10),
    ),
    metrics = metric_set(accuracy, roc_auc)
  )
# Fitting the best model
best_flow = default_flow %>%
  finalize_workflow(select_best(default_cv_fit, metric = "accuracy")) %>%
  fit(data = m2_test)
# Choosing the best model
best_tree = best_flow %>% extract_fit_parsnip()
# Plotting the tree
best_tree$fit %>% rpart.plot(roundint = FALSE)

Model 3: Random Forest

m3_test %<>% mutate(new_bin = ntile(summ_var, n=30)) %>%
   mutate(new_bin = as.factor(new_bin)) %>%
   group_by(new_bin) %>%
   mutate(avg_sum_var = mean(summ_var))

rf_mod = randomForest::randomForest(formula = new_bin ~.,
                                    data = m3_train,
                                    ntree = 50)
m3_test$pred_rf = predict(rf_mod, newdata = m3_test, type = "response")

# to obtain the number of times the predicted values were the actual values
length(which(m3_test$pred_rf==m3_test$new_bin)) -> accurate #1758
# divided by total observations to obtain accuracy
accurate / (nrow(m3_test))
## [1] 0.5864478

Model 4: Support Vector Machine

Using the same SVM approach as with the testing set.

xt = m4_test[,-1]
yt = m4_test[1]

svm_model_test = svm(Gender ~ .,
                 data = m4_test)
summary(svm_model_test)
## 
## Call:
## svm(formula = Gender ~ ., data = m4_test)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  2257
## 
##  ( 997 1260 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M
predst = predict(svm_model_test,xt)
# Ensure caret package is loaded for line below
confusionMatrix(predst, yt$Gender)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    F    M
##          F  484  110
##          M  538 1731
##                                           
##                Accuracy : 0.7737          
##                  95% CI : (0.7579, 0.7889)
##     No Information Rate : 0.643           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4563          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4736          
##             Specificity : 0.9402          
##          Pos Pred Value : 0.8148          
##          Neg Pred Value : 0.7629          
##              Prevalence : 0.3570          
##          Detection Rate : 0.1691          
##    Detection Prevalence : 0.2075          
##       Balanced Accuracy : 0.7069          
##                                           
##        'Positive' Class : F               
## 

Again, the results are comparable.

Conclusions

The linear model predicting the summ_var variable did a decent job at fitting the general trend in the data, but had a Mean Squared Error (MSE) which might suggest some concern. But considering the range of the values was around 3000, that is a pretty small margin in the context. It was also reassuring to see that the same general trend existed in both the training and the testing sets

plot(x=m1_test$fitted,
     m1_test$summ_var,
     xlab = "Predicted Values",
     ylab = "Observed Values",
     main = "Linear Model")
abline(a = 0,                                        
       b = 1,
       col = "red",
       lwd = 2)

We can go one step further and create a binned scatter plot of the actual versus predicted values and see that the linear fit becomes stronger

p1= ggplot(m1_test, aes(x=fitted,y=summ_var)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())
## Warning: `fun.y` is deprecated. Use `fun` instead.
p2 = ggplot(m1_test, aes(x=fitted,y=summ_var)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=15, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())
## Warning: `fun.y` is deprecated. Use `fun` instead.
ggpubr::ggarrange(p1,p2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Especially if we change the number of bins used

p3= ggplot(m1_test, aes(x=fitted,y=summ_var)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=75, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())+
      xlim(0,2000)+ylim(0,2000)
## Warning: `fun.y` is deprecated. Use `fun` instead.
p4= ggplot(m1_test, aes(x=fitted,y=summ_var)) +
  geom_point(size=2, alpha = 0.1, color="red") +
  stat_summary_bin(fun.y='mean', bins=15, 
                   color='blue', size=1.5, geom='point')+
   geom_smooth(method = "lm", color="blue",se=F)+
   theme(axis.title.x = element_blank(),
         axis.title.y = element_blank())+
   xlim(0,2000)+ylim(0,2000)
## Warning: `fun.y` is deprecated. Use `fun` instead.
ggpubr::ggarrange(p3,p4)
## Warning: Removed 5 rows containing non-finite values (stat_summary_bin).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing non-finite values (stat_summary_bin).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

The Random forest model we used has an accuracy of 61.4%. Compared to the The SVM model had an accuracy of 77.37% (and with a 95% confidence interval of (0.7579, 0.7889)).The decision tree model had an accuracy of 73.94% as calculated from the confusion matrix below:

m2_test$tree_mod = predict(best_tree, new_data =  m2_test)
temp_df = m2_test %>% rename(tree_mod = 257) %>%
   mutate(tree_pred = if_else(tree_mod =="M", "MALE","FEMALE" ))
table(Actual = temp_df$Gender,
      Predicted = temp_df$tree_pred)
##       Predicted
## Actual FEMALE MALE
##      F    387  635
##      M    163 1678
(522+1595)/(522+1595+500+246)
## [1] 0.7394342

Accuracy was not a metric that was worth out consideration when predicting continuous variables. Due to the natural smoothing that result from modeling, it is not very likely that the model will get many (if any) predictions that correct down to the integer.

There are also other measurements of accuracy in the models above that we care about and are important to consider when building or assessing a model. Looking at a plot of the residuals for the regression model, we can see a strong clustering around 0, but there is still noise uncaptured by the model.

plot(m1_test$resid)

If we look at the listed variables of importance for the decision tree model, we which different variables were valued more in each model. Knowledge of the construction of these models is important when attempting to use to predict new data where changes in policy or other changes have altered trends in healthcare and opioid prescriptions (like prescription druge monitoring programs).

best_tree$fit$variable.importance
##             Specialty           Credentials              PREMARIN 
##           163.6312629           128.5249192             2.3800911 
##         METHADONE.HCL   MORPHINE.SULFATE.ER CLOBETASOL.PROPIONATE 
##             0.8925342             0.8925342             0.5950228

When it came to the classification prediction of the gender variable, the SVM outperformed the decision tree. It would be interested to see how these models would perform when facing more-recent data since the methods that we used to contrast the model should adjust for potential differences in volume by using proportions as a whole