Executive Summary
The purpose of this project is to provide exploratory data analysis to better understand trends in opioid prescription rates that may exist among the different distinguishable characteristics of providers included in the data. There are adverse effects on health and finance resulting from higher national rates of drug abuse, including of drugs that are obtained through legal means (i.e. prescribed by a doctor). Over a million of Americans have misused opioids in the past year.
In this analysis, models were constructed to predict total number of opioid prescriptions written based on distinguishable characteries of a providers. Additional models were constructed to predict the gender of a provider based on provider characteristics. The best performing models were able to predict gender with an accuracy of 77.37 percent and predicted total opioid prescriptions with MSE of 109.4 (with standard deviation of 224).
The Big Picture
In 2017, the US Department of Health and Human Services (HHS) declared a public health emergency to combat the growing opioid crises in the US and the resulting strain. In 2016, more than 42,000 deaths were contributed to opioid overdoses. According to the most recent figures provides by HHS, 1.6 million American had an opioid use disorder in the past year, 10.1 million people misused prescriptions, and more than 48,000 deaths were attributed to overdose on synthetic opioids other than methadone (more on terms and definitions by HHS)
This problem is also associated with an overall drug misuse problem in the country. The National Survey on Drug Use and Health (2019) found that 745,000 people used heroin in the past year, 50,000 used heroin for the first time with around 14,500 deaths resulting from heroin overdose.
According to the National Institute on Drug Abuse, there is a correlation between misuse of opioids and heroin usage. This is because heroin is classified as an opiate, a class of natural opioid, that has induce similar bodily reactions and provide similar potential for abuse as synthetically processed opioids (such as those prescribed by a doctor).
In this analysis, I develop models using classification methods to
look at trends involving gender of the provider, and I use regression
methods to look at trends in number of prescriptions written. I created
a variable called summ_var
to use for modeling prescribing
rates of opioids that is the sum of all opioid prescriptions from a
single provider. The original data only had total prescriptions written
for each different drug in the data which also included many non-opioid
medications, so the summ_var
variable was helpful to
provide insight from a broader scope.
The Data
The data sets used in this analysis were found from Kaggle, sourced from CMS.gov (Centers for Medicare & Medicaid Services)
One challenge working with this data was regarding the formatting. There were thousands of spelling inconsistencies for different providers specialties/credentials that resulted in some observations being lost when cleaning the data.
The year of the data’s collection being only 2014 also presented some issues. While there is some insight we can gain by examining characteristics of providers from a single year, it would be difficult to understand how the trends have changed over time and how much can be explained by new policies (e.g. Prescription Drug Monitoring Programs) or developments in healthcare (e.g. technology; drug alternatives).
A time series panel of data with these same variables for multiple years would provide better insight into things considered worth knowing from a healthcare or policy perspective. Pulling in additional sources of data alone with the time panel component would also help to enrich the insight from this analysis.
Methods
Forward and backward stepwise selection was used to optimally select
variables to be included in regression. For the classification models,
variable importance and selection was algorithmically determined. Random
Forest was also used in attempt to predict the summ_var
variable. Binned grouping of means that are represented by different
factor levels was able to convert the regression problem into a
classification problem to provide a more generalizable prediction model
for prescribing rates.
Results and Conclusions
MSE was used to assess accuracy when predicting the numeric outcomes.
The out-of-sample MSE for the multivariate regression model was 109.4
when predicting the summ_var
variable (with training MSE of
103.6). While it would be ideal that this value is closer to zero, it
doesn’t nicely indicate that the model performed poorly since the
standard deviation was above 220 for training/testing and the residuals
were evenly distributed.
Accuracy was an important measurement for the classification models.
The support vector machine model performed the best out-of-sample when
predicting gender with an accuracy of 77.37%. A random forest was used
when predicting the summ_var
variable as a classification
problem predicting different levels determined by bins of means. When
the number of bins was set to n = 30
the accuracy was
61.4%, but became more accurate as the number of bins used
decreased.
Something that limited performance on this project was the computational capability of my computer. The rendering of the final HTML took a while to process since I didn’t I didn’t utilize parallelization in the analysis. It also would have helpful to have looked for a time series panel for this data as well as additional sources. Not only would this have enriched my findings, but also could provide insight in other trends, such as heroin usage/overdose rates when compared to opioid prescription rates over time.
Exploratory Data Analysis - Trends in Opioid Prescriptions
Kyle Brewster
Data Cleaning & Setup
library(pacman)
p_load(dplyr, readr, skimr, ggplot2, tidyverse, tidymodels, janitor, magrittr)
opioid_list = read.csv("opioids.csv")
overdose_df = read.csv("overdoses.csv")
prescriber_df = read.csv("prescriber_info.csv")
Overview
#skim(prescriber_df)
Cleaning up
The Credentials
column
# To get an idea of some of the more-common types of credentials
top_counts(prescriber_df$Credentials, max_levels = 20)
## [1] "MD: 7034, M.D: 6772, DDS: 1145, D.O: 866, PA-: 845, emp: 763, D.D: 717, DO: 549, NP: 469, DMD: 449, PA: 437, O.D: 353, FNP: 262, ARN: 256, DPM: 247, M.D: 243, D.M: 234, OD: 175, P.A: 168, CRN: 150"
# Cleaning up the acronyms for some of the most common types of credential
scribe_MD <- prescriber_df %>%
group_by(Credentials) %>%
filter(Credentials == "MD"|
Credentials == "M.D."|
Credentials == "M.D"|
Credentials == "M.D.,"|
Credentials == "M,D"|
Credentials == "MD."|
Credentials == "M.D,"|
Credentials == "M.D")
scribe_DDS <- prescriber_df %>%
filter(Credentials == "DDS"|
Credentials == "DDS, MS,"|
Credentials == "D.D.S."|
Credentials == "DDS, MS"|
Credentials == "DDS. MS"|
Credentials == "MD, DDS"|
Credentials == "DDS,MD"|
Credentials == "DMD"|
Credentials == "D.M.D."|
Credentials == "D.M.D"|
Credentials == "DDS, MD")
scribe_DO <- prescriber_df %>%
group_by(Credentials) %>%
filter(Credentials == "DO"|
Credentials == "D.O."|
Credentials == "D.O"|
Credentials == "DO,"|
Credentials == "D.O., M.D"|
Credentials == "D.O., M.D."|
Credentials == "D,O."|
Credentials == "D.O.,")
scribe_PA <- prescriber_df %>%
group_by(Credentials) %>%
filter(Credentials == "PA"|
Credentials == "P.A."|
Credentials == "P.A.-C"|
Credentials == "PA-C"|
Credentials == "P.A"|
Credentials == "PA-"|
Credentials == "PA,"|
Credentials == "PA."|
Credentials == "P,A."|
Credentials == "P.A.C"|
Credentials == "P.A.,")
scribe_NP <- prescriber_df %>%
group_by(Credentials) %>%
filter(Credentials == "NP"|
Credentials == "N.P."|
Credentials == "N.P"|
Credentials == "NP."|
Credentials == "NP,"|
Credentials == "N.P.,"|
Credentials == "N P"|
Credentials == "NURSE PRACTITIONER"|
Credentials == "NP-C"|
Credentials == "NP."|
Credentials == "N P")
## Now to make the values more uniform
scribe_DDS = replace(scribe_DDS, 4,"DDS_DDM")
scribe_DO = replace(scribe_DO, 4, "DO")
scribe_MD = replace(scribe_MD, 4, "MD")
scribe_PA = replace(scribe_PA, 4, "PA")
scribe_NP = replace(scribe_NP, 4, "NP")
scribe_joint = rbind(scribe_DDS, scribe_DO,
scribe_MD, scribe_NP,
scribe_PA)
The Specialty
column
Looking at the top specialties
tab <- as.data.frame(table(prescriber_df$Specialty))
tab =arrange(tab, desc(Freq))
Filtering the data to only some the more-common values
scribe_joint = scribe_joint %>%
group_by(Specialty) %>%
filter(Specialty == "Internal Medicine"|
Specialty == "Family Practice"|
Specialty == "Dentist"|
Specialty == "Physician Assistant"|
Specialty == "Nurse Practitioner"|
Specialty == "Emergency Medicine"|
Specialty == "Psychiatry"|
Specialty == "Cardiology"|
Specialty == "Obstetrics/Gynecology"|
Specialty == "Orthopedic Surgery"|
Specialty == "Optometry"|
Specialty == "General Surgery")
The State Naming Convention
Using the state.abb
package to save some effort in
narrowing this set down to 50 states and excluding territories,
miscellaneous, or missing categories
list_of_states = as.data.frame(cbind(state.abb))
list_of_states = list_of_states %>%
mutate(State = state.abb) # Let's make the state variable have the same naming convention in both; and then filter out providers outside of our scope of interest
scribe_joint_nonstates = anti_join(scribe_joint, list_of_states, by = 'State')
# And now to get our sets for only the 50 states
scribe_joint = scribe_joint[
!scribe_joint$State %in% scribe_joint_nonstates$State,
]
Creating Variable for Total Number of Opioids Prescribed
Since we are working with multiple variables involving opioids, we therefore must avoid the mistake of including an opioid-variable in both the predictor and the outcome variables. Doing so would result in skewness of model accuracy since the predictors and the outcome are directly correlated and would predict an outcome that we already know.
Looking at the opioid_list
data frame.
unique(opioid_list$Generic.Name) # To see which medications are opioids
## [1] "FENTANYL CITRATE" "ACETAMINOPHEN WITH CODEINE"
## [3] "CODEINE/BUTALBITAL/ASA/CAFFEIN" "DIHYDROCODEINE/ASPIRIN/CAFFEIN"
## [5] "MORPHINE SULFATE" "OPIUM/BELLADONNA ALKALOIDS"
## [7] "BUPRENORPHINE HCL" "BUTALBIT/ACETAMIN/CAFF/CODEINE"
## [9] "BUTORPHANOL TARTRATE" "BUPRENORPHINE"
## [11] "CODEINE/CARISOPRODOL/ASPIRIN" "CODEINE SULFATE"
## [13] "HYDROCODONE/ACETAMINOPHEN" "TRAMADOL HCL"
## [15] "MEPERIDINE HCL" "MEPERIDINE HCL/PF"
## [17] "HYDROMORPHONE HCL" "HYDROMORPHONE HCL/PF"
## [19] "METHADONE HCL" "FENTANYL"
## [21] "MORPHINE SULFATE/PF" "OXYCODONE HCL/ACETAMINOPHEN"
## [23] "OXYCODONE HCL/ASPIRIN" "HYDROCODONE/IBUPROFEN"
## [25] "LEVORPHANOL TARTRATE" "NALBUPHINE HCL"
## [27] "TAPENTADOL HCL" "OXYMORPHONE HCL"
## [29] "OPIUM TINCTURE" "OXYCODONE HCL"
## [31] "IBUPROFEN/OXYCODONE HCL" "PENTAZOCINE HCL/ACETAMINOPHEN"
## [33] "PENTAZOCINE HCL/NALOXONE HCL" "PENTAZOCINE LACTATE"
## [35] "TRAMADOL HCL/ACETAMINOPHEN" "DHCODEINE BT/ACETAMINOPHN/CAFF"
## [37] "HYDROCODONE BITARTRATE"
top_opioids = rbind(
opioid_list[grep("BUPRENORPHINE", opioid_list$Generic.Name),],
opioid_list[grep("BUTORPHANOL", opioid_list$Generic.Name),],
opioid_list[grep("CODEINE", opioid_list$Generic.Name),],
opioid_list[grep("FENTANYL", opioid_list$Generic.Name),],
opioid_list[grep("HYDROCODONE", opioid_list$Generic.Name),],
opioid_list[grep("HYDROMORPHONE", opioid_list$Generic.Name),],
opioid_list[grep("LEVORPHANOL", opioid_list$Generic.Name),],
opioid_list[grep("MEPERIDINE", opioid_list$Generic.Name),],
opioid_list[grep("METHADONE", opioid_list$Generic.Name),],
opioid_list[grep("MORPHINE", opioid_list$Generic.Name),],
opioid_list[grep("NALBUPHINE", opioid_list$Generic.Name),],
opioid_list[grep("OPIUM", opioid_list$Generic.Name),],
opioid_list[grep("OXYCODONE", opioid_list$Generic.Name),],
opioid_list[grep("OXYMORPHONE", opioid_list$Generic.Name),],
opioid_list[grep("PENTAZOCINE", opioid_list$Generic.Name),],
opioid_list[grep("TRAMADOL", opioid_list$Generic.Name),])
Here we can see the drug class variables that are opioids from the
original prescriber_df
. Next we can use these different
drug types to count the total number of prescriptions written by each
provider.
working_df = scribe_joint %>%
select(contains(c("NPI","Gender","State","Credentials","Specialty",
"HYDROCODONE","OXYCONTIN",
"CODEINE","FENTANYL","PENTAZOCINE",
"DIHYDROCODEINE","Opioid.Prescriber",
"OPIUM","BUPRENORPHINE","BUTALBIT","BUTORPHANOL",
"TRAMADOL","MEPERIDINE","HYDROMORPHONE","METHADONE",
"MORPHINE","OXYCODONE","LEVORPHANOL","NALBUPHINE",
"TAPENTADOL","OXYMORPHONE","OPIUM")))
# Now we can calculate the sum of opioid prescription written by each provider for the year
working_df$summ_var = rowSums(working_df[ ,c(6:18)])
# adding that column to our original data frame
scribe_joint$summ_var <- working_df$summ_var
Removing Outliers
Looking at the summ_var
variable that we created, we can
see that there is one provider in particular that prescribed a
significantly higher amount than any other providers. Let’s remove that
observation from our data.
scribe_joint = subset(scribe_joint, summ_var<4000)
Creating Factor Variables
Thinking back to the overview of the data,
scribe_joint = scribe_joint %>%
mutate(Gender = as.factor(Gender),
State = as.factor(State),
Credentials = as.factor(Credentials),
Specialty = as.factor(Specialty))
Adjustments for Population
# Creating a variable for per capita rates
overdose_df$Deaths <- as.numeric(gsub(",","",overdose_df$Deaths))
overdose_df$Population <- as.numeric(
gsub(",","",overdose_df$Population))
# Removing commas first to avoid errors in math
overdose_df$deaths_per_capita = (
overdose_df$Deaths/overdose_df$Population
)*100000 # per 100 thousand
df = scribe_joint %>%
group_by(State) %>%
summarise(state_opioid_volume = sum(summ_var)) %>%
arrange(State)
overdose_df$state_opioid_volume <- df$state_opioid_volume
# To give us total opioid prescriptions in each state
# which will allow us to determine the rate-per-thousand people
overdose_df %<>% mutate(opioids_per_cap =
(state_opioid_volume/Population)*1000)
Data Visualization
# Differences when grouped by specialty
ggp1 = ggplot(
scribe_joint,aes(x = Specialty, y = summ_var, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Specialty",
y = "Individual Prescriptions Written",
fill = "Gender",
title = "Opioid Prescription Rates by Specialty & Gender") +
theme(plot.title = element_text(hjust = 0.5, face="bold.italic"),
axis.title.x = element_text(size = 12),
axis.text.x = element_text(size = 7, angle=50,margin = margin(t = 20)),
axis.title.y = element_text(size = 12),
legend.title = element_text(face="bold", size = 10))
# Differences when grouped by credentials
ggp2 = ggplot(
scribe_joint,aes(x = Credentials, y = summ_var, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Credential",
y = "Individual Prescriptions Written",
fill = "Gender",
title = "Opioid Prescription Rates by Credentials & Gender") +
theme(plot.title = element_text(hjust = 0.5, face="bold.italic"),
axis.title.x = element_text(size = 12),
axis.text.x = element_text(size = 7),
axis.title.y = element_text(size = 12),
legend.title = element_text(face="bold", size = 10))
ggp1
ggp2
ggp3 = scribe_joint %>%
group_by(Gender) %>%
summarise(plyr::count(Specialty))%>%
ggplot(.,aes(x = x, y = freq, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge")+
labs(x = "Specialty",
y = "Number of Providers",
fill = "Gender",
title = "Number of Providers by Specialty & Gender") +
theme(plot.title = element_text(hjust = 0.5, face="bold.italic"),
axis.title.x = element_text(size = 12),
axis.text.x = element_text(size = 7, angle=50,margin = margin(t = 20)),
axis.title.y = element_text(size = 12),
legend.title = element_text(face="bold", size = 10))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
ggp4 = scribe_joint %>%
group_by(Gender) %>%
summarise(plyr::count(Credentials))%>%
ggplot(.,aes(x = x, y = freq, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge")+
labs(x = "Credential",
y = "Number of Providers",
fill = "Gender",
title = "Number of Providers by Credentials & Gender") +
theme(plot.title = element_text(hjust = 0.5, face="bold.italic"),
axis.title.x = element_text(size = 12),
axis.text.x = element_text(size = 7),
axis.title.y = element_text(size = 12),
legend.title = element_text(face="bold", size = 10))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
ggp3
ggp4
library(usmap)
library(ggpattern)
statepop <- statepop %>% filter(abbr!='DC')
statepop$per_cap_death <- overdose_df$deaths_per_capita
map_of_deaths <- plot_usmap(data = statepop, values = "per_cap_death", color = "black")+
labs(title = "Deaths From Opioid Overdose in US",
subtitle = "Population per 100,000")+
theme(plot.title = element_text(face="bold", size = 14),
plot.subtitle = element_text(face = "italic"),
legend.title = element_text(face = "bold", size = 8),
legend.position = c(0.9, 0.15))+
scale_fill_continuous(
low = "white", high = "red", name = "Death Rate (2014)")
map_of_deaths
plot_overdose_script = ggplot(
overdose_df, aes(x=opioids_per_cap, y=deaths_per_capita)) +
geom_point(size=4, alpha=.6, color="black")+
geom_smooth(method = 'lm', se=F, color="red")+
geom_smooth(method = "loess", se=F, color="blue")+
xlim(0,102)+
ylab("Overdose Death Rate (per 100,000)")+
xlab("Opioid Prescriptions Written (per 1,000 people)")+
labs(title =
"State Overdose Death Rate & Number of Opioid Prescrptiions")
plot_overdose_script
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Training & Testing
Now that the data has been cleaned and we have seen it graphically, we can begin to train a model to fit the data. There are several different ways that this data could be modeled, we will be focusing on a couple of models ### Partioining of Data for Training and Testing
set.seed(1234) # Setting seed for reproducibility
train_df = scribe_joint %>%
sample_frac(0.8) # Partitioning 20% to be used for testing
test_df = anti_join(
scribe_joint, train_df, by = 'NPI')
# Now lets remove the `NPI` variable to make work easier henceforth
train_df = train_df %>% select(-c("NPI"))
test_df = test_df %>% select(-c("NPI"))
Model 1: Multivariate Regression
# Let's make a copy of the train/test set that we can play with for this model
m1_train <- train_df
m1_test <- test_df
If we want to answer this question, we should remove the variables of
the individual opioids and only include the summ_var
variable we created for total opioids prescribed. Removing the
additional variables would prevent double-counting in the data.
# Using variable from working_df from before to get list of variables to remove from m1_train
colnames(working_df)
## [1] "NPI" "Gender"
## [3] "State" "Credentials"
## [5] "Specialty" "HYDROCODONE.ACETAMINOPHEN"
## [7] "OXYCONTIN" "ACETAMINOPHEN.CODEINE"
## [9] "FENTANYL" "Opioid.Prescriber"
## [11] "IPRATROPIUM.BROMIDE" "TRAMADOL.HCL"
## [13] "HYDROMORPHONE.HCL" "METHADONE.HCL"
## [15] "MORPHINE.SULFATE" "MORPHINE.SULFATE.ER"
## [17] "OXYCODONE.ACETAMINOPHEN" "OXYCODONE.HCL"
## [19] "summ_var"
m1_train = m1_train %>%
select(-c("HYDROCODONE.ACETAMINOPHEN",
"OXYCONTIN","ACETAMINOPHEN.CODEINE",
"FENTANYL","Opioid.Prescriber",
"TRAMADOL.HCL","HYDROMORPHONE.HCL",
"METHADONE.HCL","MORPHINE.SULFATE",
"MORPHINE.SULFATE.ER",
"OXYCODONE.ACETAMINOPHEN",
"OXYCODONE.HCL"))
p_load(tidyverse, caret, leaps)
set.seed(123)
# Forward step wise selection
fwd_step = train(
summ_var ~.,
data = m1_train,
trControl = trainControl(method = "cv", number = 5),
method = "leapForward",
tuneGrid = expand.grid(nvmax = 1:15)
)
fwd_step$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 142.9462 0.5889351 67.95358 9.012730 0.02018443 1.853411
## 2 2 128.1006 0.6694549 62.71504 4.961419 0.01403452 1.281382
## 3 3 120.2452 0.7102281 59.75820 5.833873 0.01121955 1.506387
## 4 4 115.5554 0.7320890 57.86897 4.411791 0.01680322 1.402685
## 5 5 113.0243 0.7437949 55.51938 4.602188 0.01436761 1.105121
## 6 6 111.1988 0.7516913 53.73136 3.880417 0.01627388 1.393002
## 7 7 108.0603 0.7648023 52.33373 3.102894 0.01421277 0.809943
## 8 8 107.1782 0.7686659 51.76848 3.298020 0.01190997 1.205059
## 9 9 106.0462 0.7735734 51.37989 3.802985 0.01091996 1.261378
## 10 10 104.9922 0.7784379 51.00413 3.694032 0.01140687 1.131310
## 11 11 105.3073 0.7775196 50.87328 4.063669 0.01591674 1.419487
## 12 12 104.6231 0.7803991 49.95771 4.434914 0.01709791 1.191680
## 13 13 106.2271 0.7742025 49.73829 3.873788 0.01478199 1.207771
## 14 14 106.0036 0.7751666 49.52943 4.773148 0.01895048 1.298049
## 15 15 105.7240 0.7764202 49.50593 4.313467 0.01663582 1.149661
fwd_step$bestTune
## nvmax
## 12 12
# Backward step wise selection
back_step = train(
summ_var ~.,
data = m1_train,
trControl = trainControl(method = "cv", number = 5),
method = "leapBackward",
tuneGrid = expand.grid(nvmax = 1:15)
)
back_step$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 147.5288 0.5616769 71.38966 13.139949 0.042923270 8.705169
## 2 2 131.1565 0.6529783 63.98359 8.908965 0.019530571 4.074801
## 3 3 122.5604 0.6973548 60.70188 9.706783 0.025613525 3.701787
## 4 4 116.6978 0.7260008 58.50078 7.923951 0.013846784 3.196471
## 5 5 115.5123 0.7317133 57.38717 7.242762 0.011822267 2.972636
## 6 6 112.4236 0.7456669 55.99648 6.190942 0.009952097 3.008983
## 7 7 110.3052 0.7549938 54.12002 5.867603 0.011367704 2.565357
## 8 8 108.2452 0.7638922 52.85418 5.909550 0.011694650 2.515367
## 9 9 105.9964 0.7736059 52.04085 5.262633 0.010746906 2.488759
## 10 10 106.6327 0.7712167 51.33036 8.351863 0.021385624 2.788668
## 11 11 105.8898 0.7742964 51.09143 6.888476 0.017936588 2.433166
## 12 12 105.6434 0.7750603 50.81237 6.862724 0.021387713 2.200488
## 13 13 105.0630 0.7772380 50.40047 5.924330 0.022142826 1.958428
## 14 14 106.1059 0.7731446 50.34965 7.771442 0.028811771 1.938923
## 15 15 106.4856 0.7717127 50.37024 8.212358 0.029832796 1.689108
back_step$r
## NULL
fin_fwd_step_list = as.data.frame(coef(fwd_step$finalModel,12))
fin_bck_step_list = as.data.frame(coef(back_step$finalModel,13))
print(fin_bck_step_list)
## coef(back_step$finalModel, 13)
## (Intercept) 13.3716205
## SpecialtyOrthopedic Surgery 119.6345098
## BACLOFEN 0.9503030
## CARISOPRODOL 3.8013034
## CYCLOBENZAPRINE.HCL 2.9703701
## DIVALPROEX.SODIUM.ER -1.7089292
## GABAPENTIN 0.4771488
## LORAZEPAM 0.9163253
## LYRICA 1.8918060
## METHOCARBAMOL 2.6123489
## ONDANSETRON.HCL 2.5591120
## PREDNISONE 0.6453914
## PROAIR.HFA 0.5089056
## TIZANIDINE.HCL 1.6215341
print(fin_fwd_step_list)
## coef(fwd_step$finalModel, 12)
## (Intercept) 12.8747926
## SpecialtyOrthopedic Surgery 119.0406428
## CARISOPRODOL 4.0987142
## CYCLOBENZAPRINE.HCL 3.2235913
## DIVALPROEX.SODIUM.ER -1.5447693
## GABAPENTIN 0.5583611
## LEVOTHYROXINE.SODIUM 0.1294654
## LORAZEPAM 0.7885353
## LYRICA 1.8600224
## METHOCARBAMOL 2.6424817
## ONDANSETRON.HCL 2.3856531
## PREDNISONE 0.6304748
## TIZANIDINE.HCL 1.6664675
Now that we have a list of variables, let’s think about which to
include in our model. The Orthopedic Surgery specialty is considerably
significant in both models, so let’s create a dummy variable indicating
if a provider is an orthopedic surgeon. CARISOPRODOL
,
CYCLOBENZAPRINE.HCL
, METHOCARBAMOL
,
ONDANSETRON.HCL
, LYRICA
,
TIZANIDINE.HCL
,
LORAZEPAM
,PREDNISONE
, and
GABAPENTIN
. These are all the different drugs that appeared
in both best-selected models.
DIVALPROEX.SODIUM.ER
was a negative indicator in both
models, so we will also include it in our final model.
m1_train = m1_train %>%
mutate(orth_surg_dummy = as.factor(
if_else(Specialty == "Orthopedic Surgery",1,0))
)
mod1 = lm(data = m1_train, method = "lm", summ_var~
orth_surg_dummy + CARISOPRODOL +
CYCLOBENZAPRINE.HCL+METHOCARBAMOL+
ONDANSETRON.HCL+LYRICA+TIZANIDINE.HCL+
LORAZEPAM+PREDNISONE+GABAPENTIN)
summary(mod1)
##
## Call:
## lm(formula = summ_var ~ orth_surg_dummy + CARISOPRODOL + CYCLOBENZAPRINE.HCL +
## METHOCARBAMOL + ONDANSETRON.HCL + LYRICA + TIZANIDINE.HCL +
## LORAZEPAM + PREDNISONE + GABAPENTIN, data = m1_train, method = "lm")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1235.33 -16.99 -13.54 12.46 1688.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.53637 1.09557 12.36 <2e-16 ***
## orth_surg_dummy1 116.78447 5.03220 23.21 <2e-16 ***
## CARISOPRODOL 4.09686 0.10102 40.56 <2e-16 ***
## CYCLOBENZAPRINE.HCL 3.41291 0.10672 31.98 <2e-16 ***
## METHOCARBAMOL 2.71512 0.19653 13.81 <2e-16 ***
## ONDANSETRON.HCL 2.67239 0.15231 17.55 <2e-16 ***
## LYRICA 2.10255 0.07984 26.34 <2e-16 ***
## TIZANIDINE.HCL 1.56360 0.07336 21.32 <2e-16 ***
## LORAZEPAM 0.57493 0.02746 20.93 <2e-16 ***
## PREDNISONE 0.87213 0.04227 20.63 <2e-16 ***
## GABAPENTIN 0.65028 0.02476 26.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.7 on 11436 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.7833
## F-statistic: 4138 on 10 and 11436 DF, p-value: < 2.2e-16
Looking at some measurements of model accuracy
m1_train$resid <- resid(mod1)
m1_train$fitted <- fitted(mod1)
sqrt(mean((m1_train$summ_var - m1_train$fitted)^2)) # to calculate MSE
## [1] 103.6392
Not the best-performing model, but let’s see nonetheless how it performs when facing the testing set. Perhaps the insight to learn is that we cannot successfully predict opioid prescription rates from these lists of other prescribed medications
Model 2: Decision Trees
Are we able to predict the Specialty of a provider based on prescriptions rates and other variables?
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
m2_train <- train_df %>% na.omit(.)
set.seed(1234)
default_cv = m2_train %>% vfold_cv(v =5)
# Define the decision tree
default_tree = decision_tree(mode ="classification",
cost_complexity = tune(),
tree_depth = tune()) %>%
set_engine("rpart")
# Defining recipe
default_recipe = recipe(Gender ~., data = m2_train)
# Defining workflow
default_flow = workflow() %>%
add_model(default_tree) %>%
add_recipe(default_recipe)
# Tuning
default_cv_fit = default_flow %>%
tune_grid(
default_cv,
grid = expand_grid(
cost_complexity = seq(0, 0.15, by = 0.01),
tree_depth = c(1,2,5,10),
),
metrics = metric_set(accuracy, roc_auc)
)
# Fitting the best model
best_flow = default_flow %>%
finalize_workflow(select_best(default_cv_fit, metric = "accuracy")) %>%
fit(data = m2_train)
# Choosing the best model
best_tree = best_flow %>% extract_fit_parsnip()
# Plotting the tree
best_tree$fit %>% rpart.plot(roundint = FALSE)
Model 3: Binscatter Random Forest
m3_train <- train_df
m3_test <- test_df
# Adding indicator vriable for Surgons
m3_train %<>%
mutate(orth_surg_dummy = as.factor(
if_else(Specialty == "Orthopedic Surgery",1,0)))
m3_test %<>%
mutate(orth_surg_dummy = as.factor(
if_else(Specialty == "Orthopedic Surgery",1,0)))
# Splitting data into bins
m3_train %<>% mutate(new_bin = ntile(summ_var, n=30)) %>%
mutate(new_bin = as.factor(new_bin))
str(m3_train$new_bin)
## Factor w/ 30 levels "1","2","3","4",..: 1 1 1 24 1 1 26 1 1 1 ...
m3_train = m3_train %>%
group_by(new_bin) %>%
mutate(avg_sum_var = mean(summ_var))
The training accuracy for this model is 61.4%.
Model 4: Support Vector Machine (SVM)
For this model, let’s see if we can accurately predict the gender of a provider.
m4_train <- train_df
m4_test <- test_df
pacman::p_load(e1071)
Now to prepare the data for making our predictions of interest
# Defining our variables of prediction and outcome
x = m4_train[,-1]
y = m4_train[1]
Now running the model (takes a while, be patient)
svm_model = svm(Gender ~ .,
data = m4_train)
summary(svm_model)
##
## Call:
## svm(formula = Gender ~ ., data = m4_train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 7820
##
## ( 4270 3550 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
preds = predict(svm_model,x)
# Ensure caret package is loaded for line below
confusionMatrix(preds, y$Gender)
## Confusion Matrix and Statistics
##
## Reference
## Prediction F M
## F 2011 554
## M 2059 6823
##
## Accuracy : 0.7717
## 95% CI : (0.7639, 0.7794)
## No Information Rate : 0.6444
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4569
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4941
## Specificity : 0.9249
## Pos Pred Value : 0.7840
## Neg Pred Value : 0.7682
## Prevalence : 0.3556
## Detection Rate : 0.1757
## Detection Prevalence : 0.2241
## Balanced Accuracy : 0.7095
##
## 'Positive' Class : F
##
summary(svm_model)
##
## Call:
## svm(formula = Gender ~ ., data = m4_train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 7820
##
## ( 4270 3550 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
Testing
Model 1: Multivariate Regression
m1_test = m1_test %>%
mutate(orth_surg_dummy = as.factor(
if_else(Specialty == "Orthopedic Surgery",1,0))
)
# Using the variables selected with stepwise algorithm
mod1t = lm(data = m1_test, summ_var~
orth_surg_dummy + CARISOPRODOL +
CYCLOBENZAPRINE.HCL+METHOCARBAMOL+
ONDANSETRON.HCL+LYRICA+TIZANIDINE.HCL+
LORAZEPAM+PREDNISONE+GABAPENTIN)
summary(mod1t)
##
## Call:
## lm(formula = summ_var ~ orth_surg_dummy + CARISOPRODOL + CYCLOBENZAPRINE.HCL +
## METHOCARBAMOL + ONDANSETRON.HCL + LYRICA + TIZANIDINE.HCL +
## LORAZEPAM + PREDNISONE + GABAPENTIN, data = m1_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1267.8 -14.2 -14.2 12.9 1168.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.19839 2.31422 6.135 9.68e-10 ***
## orth_surg_dummy1 77.78581 10.64788 7.305 3.57e-13 ***
## CARISOPRODOL 6.09944 0.23139 26.360 < 2e-16 ***
## CYCLOBENZAPRINE.HCL 3.88640 0.20764 18.717 < 2e-16 ***
## METHOCARBAMOL 1.85239 0.38520 4.809 1.60e-06 ***
## ONDANSETRON.HCL 2.00514 0.33900 5.915 3.72e-09 ***
## LYRICA 1.83852 0.20389 9.017 < 2e-16 ***
## TIZANIDINE.HCL 0.95089 0.16993 5.596 2.40e-08 ***
## LORAZEPAM 0.67389 0.05293 12.733 < 2e-16 ***
## PREDNISONE 0.41070 0.08398 4.890 1.06e-06 ***
## GABAPENTIN 0.83131 0.05162 16.103 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.6 on 2852 degrees of freedom
## Multiple R-squared: 0.7672, Adjusted R-squared: 0.7663
## F-statistic: 939.7 on 10 and 2852 DF, p-value: < 2.2e-16
m1_test$resid <- resid(mod1t)
m1_test$fitted <- fitted(mod1t)
sqrt(mean((m1_test$summ_var - m1_test$fitted)^2)) # to calculate MSE
## [1] 109.4119
Looking at the testing results, while the model may not be very accurate, it is at least consistently inaccurate when looking at the data. Looking at the plot below, we can see a wide range of values of predicted versus actual, but the linear line of approximation
plot(x=m1_test$fitted,
m1_test$summ_var,
xlab = "Predicted Values",
ylab = "Observed Values",
main = "Linear Model")
abline(a = 0,
b = 1,
col = "red",
lwd = 2)
Model 2: Decision Tree
m2_test <- test_df %>% na.omit(.)
set.seed(1234)
default_cv = m2_test %>% vfold_cv(v =5)
default_recipe = recipe(Gender ~., data = m2_test)
# Defining workflow
default_flow = workflow() %>%
add_model(default_tree) %>%
add_recipe(default_recipe)
# Tuning
default_cv_fit = default_flow %>%
tune_grid(
default_cv,
grid = expand_grid(
cost_complexity = seq(0, 0.15, by = 0.01),
tree_depth = c(1,2,5,10),
),
metrics = metric_set(accuracy, roc_auc)
)
# Fitting the best model
best_flow = default_flow %>%
finalize_workflow(select_best(default_cv_fit, metric = "accuracy")) %>%
fit(data = m2_test)
# Choosing the best model
best_tree = best_flow %>% extract_fit_parsnip()
# Plotting the tree
best_tree$fit %>% rpart.plot(roundint = FALSE)
Model 3: Random Forest
m3_test %<>% mutate(new_bin = ntile(summ_var, n=30)) %>%
mutate(new_bin = as.factor(new_bin)) %>%
group_by(new_bin) %>%
mutate(avg_sum_var = mean(summ_var))
rf_mod = randomForest::randomForest(formula = new_bin ~.,
data = m3_train,
ntree = 50)
m3_test$pred_rf = predict(rf_mod, newdata = m3_test, type = "response")
# to obtain the number of times the predicted values were the actual values
length(which(m3_test$pred_rf==m3_test$new_bin)) -> accurate #1758
# divided by total observations to obtain accuracy
accurate / (nrow(m3_test))
## [1] 0.5864478
Model 4: Support Vector Machine
Using the same SVM approach as with the testing set.
xt = m4_test[,-1]
yt = m4_test[1]
svm_model_test = svm(Gender ~ .,
data = m4_test)
summary(svm_model_test)
##
## Call:
## svm(formula = Gender ~ ., data = m4_test)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 2257
##
## ( 997 1260 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
predst = predict(svm_model_test,xt)
# Ensure caret package is loaded for line below
confusionMatrix(predst, yt$Gender)
## Confusion Matrix and Statistics
##
## Reference
## Prediction F M
## F 484 110
## M 538 1731
##
## Accuracy : 0.7737
## 95% CI : (0.7579, 0.7889)
## No Information Rate : 0.643
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4563
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4736
## Specificity : 0.9402
## Pos Pred Value : 0.8148
## Neg Pred Value : 0.7629
## Prevalence : 0.3570
## Detection Rate : 0.1691
## Detection Prevalence : 0.2075
## Balanced Accuracy : 0.7069
##
## 'Positive' Class : F
##
Again, the results are comparable.
Conclusions
The linear model predicting the summ_var
variable did a
decent job at fitting the general trend in the data, but had a Mean
Squared Error (MSE) which might suggest some concern. But considering
the range of the values was around 3000, that is a pretty small margin
in the context. It was also reassuring to see that the same general
trend existed in both the training and the testing sets
plot(x=m1_test$fitted,
m1_test$summ_var,
xlab = "Predicted Values",
ylab = "Observed Values",
main = "Linear Model")
abline(a = 0,
b = 1,
col = "red",
lwd = 2)
We can go one step further and create a binned scatter plot of the actual versus predicted values and see that the linear fit becomes stronger
p1= ggplot(m1_test, aes(x=fitted,y=summ_var)) +
geom_point(size=2, alpha = 0.1, color="red") +
stat_summary_bin(fun.y='mean', bins=75,
color='blue', size=1.5, geom='point')+
geom_smooth(method = "lm", color="blue",se=F)+
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
## Warning: `fun.y` is deprecated. Use `fun` instead.
p2 = ggplot(m1_test, aes(x=fitted,y=summ_var)) +
geom_point(size=2, alpha = 0.1, color="red") +
stat_summary_bin(fun.y='mean', bins=15,
color='blue', size=1.5, geom='point')+
geom_smooth(method = "lm", color="blue",se=F)+
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
## Warning: `fun.y` is deprecated. Use `fun` instead.
ggpubr::ggarrange(p1,p2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Especially if we change the number of bins used
p3= ggplot(m1_test, aes(x=fitted,y=summ_var)) +
geom_point(size=2, alpha = 0.1, color="red") +
stat_summary_bin(fun.y='mean', bins=75,
color='blue', size=1.5, geom='point')+
geom_smooth(method = "lm", color="blue",se=F)+
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())+
xlim(0,2000)+ylim(0,2000)
## Warning: `fun.y` is deprecated. Use `fun` instead.
p4= ggplot(m1_test, aes(x=fitted,y=summ_var)) +
geom_point(size=2, alpha = 0.1, color="red") +
stat_summary_bin(fun.y='mean', bins=15,
color='blue', size=1.5, geom='point')+
geom_smooth(method = "lm", color="blue",se=F)+
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())+
xlim(0,2000)+ylim(0,2000)
## Warning: `fun.y` is deprecated. Use `fun` instead.
ggpubr::ggarrange(p3,p4)
## Warning: Removed 5 rows containing non-finite values (stat_summary_bin).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing non-finite values (stat_summary_bin).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
The Random forest model we used has an accuracy of 61.4%. Compared to the The SVM model had an accuracy of 77.37% (and with a 95% confidence interval of (0.7579, 0.7889)).The decision tree model had an accuracy of 73.94% as calculated from the confusion matrix below:
m2_test$tree_mod = predict(best_tree, new_data = m2_test)
temp_df = m2_test %>% rename(tree_mod = 257) %>%
mutate(tree_pred = if_else(tree_mod =="M", "MALE","FEMALE" ))
table(Actual = temp_df$Gender,
Predicted = temp_df$tree_pred)
## Predicted
## Actual FEMALE MALE
## F 387 635
## M 163 1678
(522+1595)/(522+1595+500+246)
## [1] 0.7394342
Accuracy was not a metric that was worth out consideration when predicting continuous variables. Due to the natural smoothing that result from modeling, it is not very likely that the model will get many (if any) predictions that correct down to the integer.
There are also other measurements of accuracy in the models above that we care about and are important to consider when building or assessing a model. Looking at a plot of the residuals for the regression model, we can see a strong clustering around 0, but there is still noise uncaptured by the model.
plot(m1_test$resid)
If we look at the listed variables of importance for the decision tree model, we which different variables were valued more in each model. Knowledge of the construction of these models is important when attempting to use to predict new data where changes in policy or other changes have altered trends in healthcare and opioid prescriptions (like prescription druge monitoring programs).
best_tree$fit$variable.importance
## Specialty Credentials PREMARIN
## 163.6312629 128.5249192 2.3800911
## METHADONE.HCL MORPHINE.SULFATE.ER CLOBETASOL.PROPIONATE
## 0.8925342 0.8925342 0.5950228
When it came to the classification prediction of the
gender
variable, the SVM outperformed the decision tree. It
would be interested to see how these models would perform when facing
more-recent data since the methods that we used to contrast the model
should adjust for potential differences in volume by using proportions
as a whole