Predicting Housing Prices
Kyle Brewster
2022-05-30
01 - Briefly explain why being able to predict whether my model underpredicted a house’s price means that your model “beats” my model.
If I am able to predict if the model is under-predicting, then that
means I would grouped the data in very similar groups as the model that
predicted undervalued
. Along the way, I will also gather an
idea of how confident the models’ prediction of undervalued
is. With this information, by the end of my modeling I will have (in
theory) predicted whether an observation is undervalued
by
its defined characteristics, but also would have an idea of “how much”
and “in which direction” the difference between the actual and suggested
validation of an observation (i.e. 3>1)
02 - Use two different models to predict undervalued
housing = read.csv("final-data.csv")
Since we are attempting to predict a T/F value, we will be using methods of classification.
To get an overview of the data
# For handy viewing from environment
skimr::skim(housing) -> skim_df
skimr::skim(housing)
Name | housing |
Number of rows | 1460 |
Number of columns | 81 |
_______________________ | |
Column type frequency: | |
character | 43 |
logical | 1 |
numeric | 37 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ms_zoning | 0 | 1.00 | 2 | 7 | 0 | 5 | 0 |
street | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
alley | 1369 | 0.06 | 4 | 4 | 0 | 2 | 0 |
lot_shape | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
land_contour | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
utilities | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
lot_config | 0 | 1.00 | 3 | 7 | 0 | 5 | 0 |
land_slope | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
neighborhood | 0 | 1.00 | 5 | 7 | 0 | 25 | 0 |
condition1 | 0 | 1.00 | 4 | 6 | 0 | 9 | 0 |
condition2 | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
bldg_type | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
house_style | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
roof_style | 0 | 1.00 | 3 | 7 | 0 | 6 | 0 |
roof_matl | 0 | 1.00 | 4 | 7 | 0 | 8 | 0 |
exterior1st | 0 | 1.00 | 5 | 7 | 0 | 15 | 0 |
exterior2nd | 0 | 1.00 | 5 | 7 | 0 | 16 | 0 |
mas_vnr_type | 8 | 0.99 | 4 | 7 | 0 | 4 | 0 |
exter_qual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
exter_cond | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
foundation | 0 | 1.00 | 4 | 6 | 0 | 6 | 0 |
bsmt_qual | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_cond | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_exposure | 38 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_fin_type1 | 37 | 0.97 | 3 | 3 | 0 | 6 | 0 |
bsmt_fin_type2 | 38 | 0.97 | 3 | 3 | 0 | 6 | 0 |
heating | 0 | 1.00 | 4 | 5 | 0 | 6 | 0 |
heating_qc | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
central_air | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
kitchen_qual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
functional | 0 | 1.00 | 3 | 4 | 0 | 7 | 0 |
fireplace_qu | 690 | 0.53 | 2 | 2 | 0 | 5 | 0 |
garage_type | 81 | 0.94 | 6 | 7 | 0 | 6 | 0 |
garage_finish | 81 | 0.94 | 3 | 3 | 0 | 3 | 0 |
garage_qual | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
garage_cond | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
paved_drive | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
pool_qc | 1453 | 0.00 | 2 | 2 | 0 | 3 | 0 |
fence | 1179 | 0.19 | 4 | 5 | 0 | 4 | 0 |
misc_feature | 1406 | 0.04 | 4 | 4 | 0 | 4 | 0 |
sale_type | 0 | 1.00 | 2 | 5 | 0 | 9 | 0 |
sale_condition | 0 | 1.00 | 6 | 7 | 0 | 6 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
undervalued | 0 | 1 | 0.48 | FAL: 754, TRU: 706 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 | ▇▇▇▇▇ |
ms_sub_class | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
lot_frontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
lot_area | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
overall_qual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
overall_cond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
year_built | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
year_remod_add | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
mas_vnr_area | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 | ▇▁▁▁▁ |
bsmt_fin_sf1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 | ▇▁▁▁▁ |
bsmt_fin_sf2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 | ▇▁▁▁▁ |
bsmt_unf_sf | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 | ▇▅▂▁▁ |
total_bsmt_sf | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
x1st_flr_sf | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 | ▇▅▁▁▁ |
x2nd_flr_sf | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 | ▇▃▂▁▁ |
low_qual_fin_sf | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 | ▇▁▁▁▁ |
gr_liv_area | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 | ▇▇▁▁▁ |
bsmt_full_bath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
bsmt_half_bath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
full_bath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▁▇▁▇▁ |
half_bath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
bedroom_abv_gr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
kitchen_abv_gr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
tot_rms_abv_grd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▁▁ |
garage_yr_blt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 | ▁▁▅▅▇ |
garage_cars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
garage_area | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
wood_deck_sf | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 | ▇▂▁▁▁ |
open_porch_sf | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 | ▇▁▁▁▁ |
enclosed_porch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 | ▇▁▁▁▁ |
x3ssn_porch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
screen_porch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 | ▇▁▁▁▁ |
pool_area | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 | ▇▁▁▁▁ |
misc_val | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 | ▇▁▁▁▁ |
mo_sold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 | ▃▆▇▃▃ |
yr_sold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
Looking at the description of the data, we can see that a
NA
value is assigned to observations that do not possess
the amenity described by the variable. Thinking about
bsmt_qual
for example, a house without a basement would
have a NA
value for the variable and would not necessarily
suggest a missing/incomplete response (although the
complete_rate
would need to be considered at the same
time).
Only two numeric variables are missing values while several while several of the character variables are also missing data. We can formally call all variables with missing values with the following commands:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
skim_df %>% filter(skim_type=="numeric" & n_missing>0) -> num_na
skim_df %>% filter(skim_type=="character" & n_missing>0)-> char_na
num_na
Name | housing |
Number of rows | 1460 |
Number of columns | 81 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
lot_frontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59 | 69 | 80 | 313 | ▇▃▁▁▁ |
mas_vnr_area | 8 | 0.99 | 103.69 | 181.07 | 0 | 0 | 0 | 166 | 1600 | ▇▁▁▁▁ |
garage_yr_blt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961 | 1980 | 2002 | 2010 | ▁▁▅▅▇ |
char_na
Name | housing |
Number of rows | 1460 |
Number of columns | 81 |
_______________________ | |
Column type frequency: | |
character | 16 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
alley | 1369 | 0.06 | 4 | 4 | 0 | 2 | 0 |
mas_vnr_type | 8 | 0.99 | 4 | 7 | 0 | 4 | 0 |
bsmt_qual | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_cond | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_exposure | 38 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_fin_type1 | 37 | 0.97 | 3 | 3 | 0 | 6 | 0 |
bsmt_fin_type2 | 38 | 0.97 | 3 | 3 | 0 | 6 | 0 |
electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
fireplace_qu | 690 | 0.53 | 2 | 2 | 0 | 5 | 0 |
garage_type | 81 | 0.94 | 6 | 7 | 0 | 6 | 0 |
garage_finish | 81 | 0.94 | 3 | 3 | 0 | 3 | 0 |
garage_qual | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
garage_cond | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
pool_qc | 1453 | 0.00 | 2 | 2 | 0 | 3 | 0 |
fence | 1179 | 0.19 | 4 | 5 | 0 | 4 | 0 |
misc_feature | 1406 | 0.04 | 4 | 4 | 0 | 4 | 0 |
Even though the complete_rate
for a few of these values
is small enough to perhaps consider removing from the data in some
contexts, we must remember that such is to be expected since not all
houses having a pool, fencing, a fireplace, etc. We can also consider
similar logic for the numeric variables. Not all houses have garages,
streets connected to the property, or masonry veneer areas that can be
quantified since they do not exist. Therefore we will assume that a
missing value for the numeric variables with missing values indicates
the observation does not possess the given feature and will replace
those values with zero
Cleaning
After looking at the data, there are a few things we should change before modeling:
- Convert variables from character class to factors
- Replace
NA
values - Add log transformation variables
- Normalizing data (with min/max scaling)
# Including again to have single chunk to run for cleaning/wrangling
housing = read.csv("final-data.csv")
# For comparing original and modified data frames
clean_house <- housing
# Loading packages
pacman::p_load( # package manager
dplyr, # for wrangling/cleaning/syntax
magrittr, # le pipe
tidyverse, # modeling
caret # also modeling
)
# Converting character variables to factors
clean_house[sapply(clean_house, is.character)] <- lapply(
clean_house[sapply(clean_house, is.character)], as.factor)
# Removing NA's for numeric and factor variables
clean_house %<>% mutate(
across(where(is.numeric),~replace_na(.,0)),
across(where(is.factor),~fct_explicit_na(.,'none')))
# Define min-max normalization function
min_max_norm <- function(x) {
(x - min(x)) / (max(x) - min(x))}
# Applying function to all non-factor variables
clean_house %<>% mutate_if(is.integer, min_max_norm) %>%
mutate(id = seq(nrow(.)),
undervalued = as.factor(undervalued))
# Splitting into training and testing sets by 80-20 splitting
set.seed(123)
train = clean_house %>% mutate(undervalued = as.factor(undervalued)) %>%
sample_frac(0.8)
test = anti_join(clean_house, train, by = 'id')
# Dropping ID variable
train %<>% select(-c("id"))
test %<>% select(-c("id"))
If desired, can also run the code below to get overview of data to check for other issues before moving forward (should have all numeric and factor variable classes at this point)
str(train)
Modeling
Binary Logistic Regression
Let’s take a look at how our predictions would perform if only using regression methods are used to make predictions, first using all variables in the data.
# GLM model using all variables
glm_mod_all = glm(
undervalued ~.,
data = train,
family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_mod_all)
##
## Call:
## glm(formula = undervalued ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.06390 -0.71949 -0.00022 0.71026 2.90806
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.276e+02 3.737e+04 -0.009 0.993007
## ms_sub_class 5.253e+00 2.139e+00 2.456 0.014061 *
## ms_zoningFV 1.166e+00 1.883e+00 0.619 0.535967
## ms_zoningRH 2.254e+00 1.982e+00 1.137 0.255343
## ms_zoningRL 6.449e-01 1.712e+00 0.377 0.706327
## ms_zoningRM 4.738e-02 1.642e+00 0.029 0.976973
## lot_frontage -9.176e-01 9.887e-01 -0.928 0.353362
## lot_area 2.289e+01 4.800e+00 4.769 1.85e-06 ***
## streetPave 3.032e+01 1.195e+03 0.025 0.979763
## alleyPave 4.068e-01 9.098e-01 0.447 0.654779
## alleynone 8.510e-01 6.784e-01 1.254 0.209711
## lot_shapeIR2 7.342e-01 6.167e-01 1.190 0.233891
## lot_shapeIR3 1.206e+00 1.567e+00 0.770 0.441391
## lot_shapeReg 1.710e-01 2.190e-01 0.781 0.434747
## land_contourHLS -2.906e-01 7.989e-01 -0.364 0.716011
## land_contourLow -2.600e+00 9.512e-01 -2.734 0.006258 **
## land_contourLvl -7.018e-01 5.663e-01 -1.239 0.215234
## utilitiesNoSeWa -1.229e+01 3.956e+03 -0.003 0.997521
## lot_configCulDSac -1.874e+00 4.819e-01 -3.888 0.000101 ***
## lot_configFR2 -1.510e+00 5.396e-01 -2.798 0.005148 **
## lot_configFR3 -1.709e+01 2.775e+03 -0.006 0.995087
## lot_configInside -2.528e-01 2.439e-01 -1.036 0.300074
## land_slopeMod 1.256e+00 6.066e-01 2.071 0.038356 *
## land_slopeSev -3.870e+00 2.184e+00 -1.772 0.076370 .
## neighborhoodBlueste -1.635e+01 3.956e+03 -0.004 0.996702
## neighborhoodBrDale -1.364e+00 1.422e+00 -0.960 0.337304
## neighborhoodBrkSide -9.250e-01 1.292e+00 -0.716 0.474100
## neighborhoodClearCr -3.335e+00 1.375e+00 -2.426 0.015286 *
## neighborhoodCollgCr -9.551e-01 9.175e-01 -1.041 0.297885
## neighborhoodCrawfor 1.709e+00 1.186e+00 1.441 0.149633
## neighborhoodEdwards -1.293e+00 1.052e+00 -1.229 0.219104
## neighborhoodGilbert -4.685e-01 9.632e-01 -0.486 0.626675
## neighborhoodIDOTRR -1.071e+00 1.473e+00 -0.727 0.467283
## neighborhoodMeadowV -2.519e+00 1.461e+00 -1.724 0.084650 .
## neighborhoodMitchel -3.493e+00 1.095e+00 -3.189 0.001428 **
## neighborhoodNAmes -1.863e+00 1.012e+00 -1.841 0.065656 .
## neighborhoodNoRidge 2.980e+00 1.090e+00 2.733 0.006269 **
## neighborhoodNPkVill 1.625e+01 2.187e+03 0.007 0.994070
## neighborhoodNridgHt -2.268e-01 9.525e-01 -0.238 0.811798
## neighborhoodNWAmes -1.036e+00 1.020e+00 -1.015 0.309912
## neighborhoodOldTown -2.986e-01 1.333e+00 -0.224 0.822754
## neighborhoodSawyer -1.998e+00 1.043e+00 -1.916 0.055420 .
## neighborhoodSawyerW -2.128e+00 1.029e+00 -2.069 0.038591 *
## neighborhoodSomerst 3.075e-01 1.100e+00 0.280 0.779826
## neighborhoodStoneBr 3.170e+00 1.115e+00 2.843 0.004475 **
## neighborhoodSWISU -3.625e+00 1.627e+00 -2.227 0.025919 *
## neighborhoodTimber -1.623e+00 1.075e+00 -1.510 0.131005
## neighborhoodVeenker -7.997e-01 1.349e+00 -0.593 0.553431
## condition1Feedr 2.357e-01 7.799e-01 0.302 0.762505
## condition1Norm -7.175e-01 6.748e-01 -1.063 0.287658
## condition1PosA -1.344e-01 1.556e+00 -0.086 0.931182
## condition1PosN 1.004e+00 1.069e+00 0.939 0.347820
## condition1RRAe -2.279e+00 1.317e+00 -1.731 0.083539 .
## condition1RRAn 9.407e-01 1.027e+00 0.916 0.359891
## condition1RRNe 1.491e+00 2.063e+00 0.722 0.470070
## condition1RRNn 2.535e-01 2.048e+00 0.124 0.901462
## condition2Feedr 2.226e-01 3.958e+00 0.056 0.955150
## condition2Norm 1.554e+00 3.284e+00 0.473 0.636014
## condition2PosA 5.329e+00 5.595e+03 0.001 0.999240
## condition2PosN -2.184e+01 2.277e+03 -0.010 0.992345
## condition2RRAn -1.937e+01 3.956e+03 -0.005 0.996093
## condition2RRNn 1.762e+01 3.956e+03 0.004 0.996445
## bldg_type2fmCon -1.759e+00 1.980e+00 -0.889 0.374164
## bldg_typeDuplex -3.454e+00 1.324e+00 -2.608 0.009102 **
## bldg_typeTwnhs -3.685e+00 1.486e+00 -2.480 0.013151 *
## bldg_typeTwnhsE -2.461e+00 1.308e+00 -1.881 0.059965 .
## house_style1.5Unf 5.821e-01 1.435e+00 0.406 0.684966
## house_style1Story 2.872e-01 6.433e-01 0.446 0.655328
## house_style2.5Fin -1.784e+01 1.431e+03 -0.012 0.990058
## house_style2.5Unf 4.718e-01 1.332e+00 0.354 0.723233
## house_style2Story -6.294e-02 5.072e-01 -0.124 0.901241
## house_styleSFoyer -8.474e-02 9.172e-01 -0.092 0.926388
## house_styleSLvl 4.004e-01 7.911e-01 0.506 0.612758
## overall_qual -1.099e+01 1.487e+00 -7.388 1.49e-13 ***
## overall_cond 1.134e-01 1.025e+00 0.111 0.911959
## year_built 3.340e+00 1.727e+00 1.934 0.053156 .
## year_remod_add 4.541e-01 4.697e-01 0.967 0.333713
## roof_styleGable -1.558e+01 3.956e+03 -0.004 0.996858
## roof_styleGambrel -1.601e+01 3.956e+03 -0.004 0.996770
## roof_styleHip -1.601e+01 3.956e+03 -0.004 0.996772
## roof_styleMansard -9.735e+00 3.956e+03 -0.002 0.998037
## roof_styleShed 1.247e+01 5.595e+03 0.002 0.998221
## roof_matlCompShg -9.804e+01 3.405e+04 -0.003 0.997703
## roof_matlMetal -1.245e+02 3.451e+04 -0.004 0.997122
## roof_matlRoll -1.110e+02 3.428e+04 -0.003 0.997415
## roof_matlTar&Grv -1.135e+02 3.428e+04 -0.003 0.997357
## roof_matlWdShake -1.009e+02 3.405e+04 -0.003 0.997636
## roof_matlWdShngl -8.174e+01 3.409e+04 -0.002 0.998087
## exterior1stAsphShn -1.644e+01 3.956e+03 -0.004 0.996685
## exterior1stBrkComm 3.191e+01 2.938e+03 0.011 0.991335
## exterior1stBrkFace 1.696e+01 1.962e+03 0.009 0.993102
## exterior1stCBlock -2.042e+01 3.956e+03 -0.005 0.995882
## exterior1stCemntBd 1.106e+01 1.962e+03 0.006 0.995502
## exterior1stHdBoard 1.466e+01 1.962e+03 0.007 0.994039
## exterior1stImStucc 2.967e+01 4.416e+03 0.007 0.994640
## exterior1stMetalSd 1.431e+01 1.962e+03 0.007 0.994178
## exterior1stPlywood 1.451e+01 1.962e+03 0.007 0.994100
## exterior1stStone 2.773e+01 4.416e+03 0.006 0.994990
## exterior1stStucco 1.411e+01 1.962e+03 0.007 0.994261
## exterior1stVinylSd 1.499e+01 1.962e+03 0.008 0.993904
## exterior1stWd Sdng 1.316e+01 1.962e+03 0.007 0.994646
## exterior1stWdShing 1.534e+01 1.962e+03 0.008 0.993761
## exterior2ndAsphShn NA NA NA NA
## exterior2ndBrk Cmn -3.124e+01 2.938e+03 -0.011 0.991516
## exterior2ndBrkFace -1.526e+01 1.962e+03 -0.008 0.993793
## exterior2ndCBlock NA NA NA NA
## exterior2ndCmentBd -1.025e+01 1.962e+03 -0.005 0.995831
## exterior2ndHdBoard -1.444e+01 1.962e+03 -0.007 0.994129
## exterior2ndImStucc -1.484e+01 1.962e+03 -0.008 0.993966
## exterior2ndMetalSd -1.388e+01 1.962e+03 -0.007 0.994354
## exterior2ndPlywood -1.360e+01 1.962e+03 -0.007 0.994468
## exterior2ndStone -1.479e+01 1.962e+03 -0.008 0.993986
## exterior2ndStucco -1.211e+01 1.962e+03 -0.006 0.995075
## exterior2ndVinylSd -1.432e+01 1.962e+03 -0.007 0.994176
## exterior2ndWd Sdng -1.257e+01 1.962e+03 -0.006 0.994887
## exterior2ndWd Shng -1.538e+01 1.962e+03 -0.008 0.993743
## mas_vnr_typeBrkFace 1.441e+00 8.968e-01 1.607 0.108056
## mas_vnr_typeNone 5.383e-01 9.010e-01 0.597 0.550210
## mas_vnr_typeStone 1.202e+00 9.522e-01 1.263 0.206660
## mas_vnr_typenone -1.649e+00 1.577e+00 -1.046 0.295709
## mas_vnr_area -4.149e+00 1.316e+00 -3.152 0.001622 **
## exter_qualFa 2.368e-01 2.034e+00 0.116 0.907293
## exter_qualGd -1.055e+00 6.232e-01 -1.694 0.090338 .
## exter_qualTA 2.654e-01 6.941e-01 0.382 0.702170
## exter_condFa -1.562e+01 3.956e+03 -0.004 0.996849
## exter_condGd -1.583e+01 3.956e+03 -0.004 0.996807
## exter_condPo -3.429e+01 5.595e+03 -0.006 0.995110
## exter_condTA -1.544e+01 3.956e+03 -0.004 0.996886
## foundationCBlock 9.155e-01 4.982e-01 1.837 0.066148 .
## foundationPConc 1.709e-01 5.492e-01 0.311 0.755658
## foundationSlab -2.828e-01 1.819e+00 -0.155 0.876432
## foundationStone -1.053e+00 1.864e+00 -0.565 0.572063
## foundationWood -1.652e+00 1.838e+00 -0.899 0.368855
## bsmt_qualFa -1.722e+00 9.085e-01 -1.896 0.058018 .
## bsmt_qualGd -5.530e-01 4.549e-01 -1.216 0.224116
## bsmt_qualTA -4.060e-01 5.667e-01 -0.716 0.473815
## bsmt_qualnone 1.968e+01 3.956e+03 0.005 0.996031
## bsmt_condGd -9.703e-01 8.049e-01 -1.205 0.228048
## bsmt_condPo 2.265e+01 3.956e+03 0.006 0.995433
## bsmt_condTA -3.868e-01 6.282e-01 -0.616 0.538124
## bsmt_condnone NA NA NA NA
## bsmt_exposureGd -1.057e+00 4.345e-01 -2.432 0.015011 *
## bsmt_exposureMn -2.288e-01 4.081e-01 -0.561 0.575013
## bsmt_exposureNo 2.101e-01 2.947e-01 0.713 0.475865
## bsmt_exposurenone -1.752e+01 3.956e+03 -0.004 0.996467
## bsmt_fin_type1BLQ -3.960e-01 3.818e-01 -1.037 0.299610
## bsmt_fin_type1GLQ 9.821e-02 3.504e-01 0.280 0.779277
## bsmt_fin_type1LwQ -9.037e-02 5.341e-01 -0.169 0.865654
## bsmt_fin_type1Rec 1.707e-01 4.075e-01 0.419 0.675236
## bsmt_fin_type1Unf 1.702e+00 4.291e-01 3.967 7.28e-05 ***
## bsmt_fin_type1none NA NA NA NA
## bsmt_fin_sf1 3.108e+01 4.880e+00 6.368 1.91e-10 ***
## bsmt_fin_type2BLQ -2.654e+00 1.115e+00 -2.380 0.017320 *
## bsmt_fin_type2GLQ 1.227e+00 1.387e+00 0.884 0.376441
## bsmt_fin_type2LwQ -2.066e+00 1.069e+00 -1.931 0.053429 .
## bsmt_fin_type2Rec -1.823e+00 1.050e+00 -1.735 0.082653 .
## bsmt_fin_type2Unf -1.877e+00 1.126e+00 -1.667 0.095534 .
## bsmt_fin_type2none NA NA NA NA
## bsmt_fin_sf2 5.648e+00 1.927e+00 2.931 0.003380 **
## bsmt_unf_sf 6.918e+00 1.858e+00 3.723 0.000197 ***
## total_bsmt_sf NA NA NA NA
## heatingGasA -1.254e+01 3.956e+03 -0.003 0.997470
## heatingGasW -1.436e+01 3.956e+03 -0.004 0.997103
## heatingGrav -1.209e+01 3.956e+03 -0.003 0.997563
## heatingOthW -2.547e+01 4.803e+03 -0.005 0.995769
## heatingWall -1.191e+01 3.956e+03 -0.003 0.997598
## heating_qcFa 2.795e-01 6.874e-01 0.407 0.684318
## heating_qcGd -6.900e-01 2.864e-01 -2.409 0.015979 *
## heating_qcPo -1.458e+01 3.956e+03 -0.004 0.997059
## heating_qcTA 2.524e-03 2.827e-01 0.009 0.992875
## central_airY -2.904e-01 6.529e-01 -0.445 0.656425
## electricalFuseF -8.145e-01 9.155e-01 -0.890 0.373686
## electricalFuseP -2.378e+01 2.477e+03 -0.010 0.992340
## electricalSBrkr -2.521e-01 4.465e-01 -0.565 0.572308
## electricalnone 1.600e+01 3.956e+03 0.004 0.996773
## x1st_flr_sf -5.360e+00 3.748e+00 -1.430 0.152702
## x2nd_flr_sf 6.689e-01 1.627e+00 0.411 0.681047
## low_qual_fin_sf -2.248e+00 1.628e+00 -1.381 0.167259
## gr_liv_area NA NA NA NA
## bsmt_full_bath -2.936e+00 8.218e-01 -3.573 0.000353 ***
## bsmt_half_bath -1.642e+00 8.679e-01 -1.892 0.058428 .
## full_bath -2.507e+00 9.462e-01 -2.649 0.008071 **
## half_bath -1.116e-01 5.714e-01 -0.195 0.845161
## bedroom_abv_gr -1.915e+00 1.600e+00 -1.197 0.231310
## kitchen_abv_gr -1.435e+00 3.517e+00 -0.408 0.683275
## kitchen_qualFa -2.508e+00 9.562e-01 -2.623 0.008718 **
## kitchen_qualGd -1.025e+00 4.742e-01 -2.161 0.030685 *
## kitchen_qualTA -1.422e+00 5.317e-01 -2.675 0.007475 **
## tot_rms_abv_grd 9.037e-01 1.593e+00 0.567 0.570591
## functionalMaj2 -1.763e+01 1.629e+03 -0.011 0.991364
## functionalMin1 1.549e-01 1.274e+00 0.122 0.903216
## functionalMin2 1.048e+00 1.314e+00 0.797 0.425478
## functionalMod -7.276e-01 1.476e+00 -0.493 0.622005
## functionalSev -1.558e+01 3.956e+03 -0.004 0.996858
## functionalTyp 1.181e-01 1.106e+00 0.107 0.914961
## fireplaces -1.353e+00 1.118e+00 -1.211 0.226080
## fireplace_quFa 1.280e-01 9.171e-01 0.140 0.888981
## fireplace_quGd -1.235e-01 6.787e-01 -0.182 0.855593
## fireplace_quPo 5.784e-01 9.969e-01 0.580 0.561765
## fireplace_quTA -4.158e-01 7.180e-01 -0.579 0.562538
## fireplace_qunone -4.572e-02 8.390e-01 -0.055 0.956536
## garage_typeAttchd 1.681e+00 1.744e+00 0.964 0.335223
## garage_typeBasment 1.345e+00 1.977e+00 0.681 0.496126
## garage_typeBuiltIn 1.469e+00 1.788e+00 0.822 0.411041
## garage_typeCarPort 8.613e-01 2.311e+00 0.373 0.709342
## garage_typeDetchd 1.712e+00 1.736e+00 0.986 0.324233
## garage_typenone 9.011e+00 2.700e+03 0.003 0.997337
## garage_yr_blt 2.409e+01 1.950e+01 1.235 0.216823
## garage_finishRFn 2.580e-01 2.699e-01 0.956 0.339213
## garage_finishUnf 1.517e-01 3.383e-01 0.448 0.653846
## garage_finishnone NA NA NA NA
## garage_cars -3.817e+00 1.233e+00 -3.096 0.001961 **
## garage_area 3.412e+00 1.576e+00 2.165 0.030399 *
## garage_qualFa -2.550e+01 4.541e+03 -0.006 0.995520
## garage_qualGd -2.261e+01 4.541e+03 -0.005 0.996028
## garage_qualPo -1.489e+01 4.925e+03 -0.003 0.997589
## garage_qualTA -2.378e+01 4.541e+03 -0.005 0.995821
## garage_qualnone NA NA NA NA
## garage_condFa 5.994e+00 5.283e+03 0.001 0.999095
## garage_condGd -1.292e+01 5.396e+03 -0.002 0.998089
## garage_condPo 8.946e+00 5.283e+03 0.002 0.998649
## garage_condTA 6.596e+00 5.283e+03 0.001 0.999004
## garage_condnone NA NA NA NA
## paved_driveP -2.476e-01 8.771e-01 -0.282 0.777760
## paved_driveY -2.333e-01 5.221e-01 -0.447 0.655037
## wood_deck_sf -1.002e-01 7.091e-01 -0.141 0.887613
## open_porch_sf 1.556e+00 8.841e-01 1.760 0.078452 .
## enclosed_porch 1.718e+00 1.065e+00 1.614 0.106462
## x3ssn_porch 2.481e+00 1.484e+00 1.672 0.094607 .
## screen_porch 2.758e+00 9.139e-01 3.018 0.002547 **
## pool_area 5.202e+02 9.602e+04 0.005 0.995677
## pool_qcFa -6.444e+01 1.567e+04 -0.004 0.996718
## pool_qcGd -1.069e+02 2.707e+04 -0.004 0.996849
## pool_qcnone 3.744e+02 6.947e+04 0.005 0.995701
## fenceGdWo 1.729e+00 7.408e-01 2.334 0.019590 *
## fenceMnPrv 9.479e-01 6.579e-01 1.441 0.149609
## fenceMnWw -2.829e-01 1.072e+00 -0.264 0.791854
## fencenone 9.841e-01 6.075e-01 1.620 0.105268
## misc_featureOthr 3.048e+01 4.359e+03 0.007 0.994421
## misc_featureShed 5.611e+01 3.956e+03 0.014 0.988684
## misc_featureTenC 1.381e+02 1.813e+04 0.008 0.993924
## misc_featurenone 5.768e+01 3.956e+03 0.015 0.988368
## misc_val 4.284e+01 1.844e+01 2.324 0.020142 *
## mo_sold -8.836e-02 3.613e-01 -0.245 0.806805
## yr_sold -1.877e-01 2.801e-01 -0.670 0.502690
## sale_typeCon 1.966e+01 3.956e+03 0.005 0.996034
## sale_typeConLD -1.824e-01 1.468e+00 -0.124 0.901088
## sale_typeConLI -9.632e-02 1.476e+00 -0.065 0.947960
## sale_typeConLw -5.276e-01 1.585e+00 -0.333 0.739176
## sale_typeCWD 3.597e+00 1.876e+00 1.917 0.055244 .
## sale_typeNew -2.999e+00 1.956e+00 -1.533 0.125252
## sale_typeOth 2.478e+00 1.593e+00 1.556 0.119737
## sale_typeWD 7.296e-02 5.553e-01 0.131 0.895463
## sale_conditionAdjLand 2.148e+01 1.995e+03 0.011 0.991412
## sale_conditionAlloca 9.700e-01 1.407e+00 0.690 0.490465
## sale_conditionFamily 1.874e+00 7.691e-01 2.436 0.014839 *
## sale_conditionNormal 1.490e+00 3.942e-01 3.779 0.000157 ***
## sale_conditionPartial 3.555e+00 1.879e+00 1.892 0.058538 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1614.99 on 1167 degrees of freedom
## Residual deviance: 999.38 on 920 degrees of freedom
## AIC: 1495.4
##
## Number of Fisher Scoring iterations: 16
Now let’s narrow down the number of variables used to include only those considered above to be statistically significant.
# Same modeling but only with significant variables
glm_mod_sig = glm(
undervalued ~
ms_sub_class+lot_area+land_contour+lot_config+
bldg_type+overall_qual+mas_vnr_area+bsmt_fin_type1+
bsmt_fin_sf1+bsmt_fin_sf2+bsmt_unf_sf+bsmt_full_bath+
kitchen_qual,
data = train,
family = "binomial"
)
summary(glm_mod_sig)
##
## Call:
## glm(formula = undervalued ~ ms_sub_class + lot_area + land_contour +
## lot_config + bldg_type + overall_qual + mas_vnr_area + bsmt_fin_type1 +
## bsmt_fin_sf1 + bsmt_fin_sf2 + bsmt_unf_sf + bsmt_full_bath +
## kitchen_qual, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3715 -1.0735 -0.7088 1.1288 1.8968
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.24396 0.67286 3.335 0.000853 ***
## ms_sub_class 0.84666 0.59610 1.420 0.155510
## lot_area -1.08436 1.90857 -0.568 0.569930
## land_contourHLS 0.78456 0.44776 1.752 0.079742 .
## land_contourLow 0.27721 0.51219 0.541 0.588350
## land_contourLvl 0.26330 0.31670 0.831 0.405753
## lot_configCulDSac -0.47828 0.29690 -1.611 0.107204
## lot_configFR2 -0.73615 0.38435 -1.915 0.055455 .
## lot_configFR3 -13.17837 374.94743 -0.035 0.971962
## lot_configInside -0.14764 0.16256 -0.908 0.363764
## bldg_type2fmCon -0.21554 0.68297 -0.316 0.752316
## bldg_typeDuplex -1.45946 0.44556 -3.276 0.001054 **
## bldg_typeTwnhs -1.17756 0.59442 -1.981 0.047590 *
## bldg_typeTwnhsE -0.27117 0.38680 -0.701 0.483276
## overall_qual -4.40082 0.70662 -6.228 4.72e-10 ***
## mas_vnr_area -0.95832 0.65646 -1.460 0.144334
## bsmt_fin_type1BLQ -0.05452 0.25160 -0.217 0.828447
## bsmt_fin_type1GLQ -0.08703 0.22180 -0.392 0.694763
## bsmt_fin_type1LwQ -0.21270 0.32220 -0.660 0.509148
## bsmt_fin_type1Rec 0.08247 0.25726 0.321 0.748541
## bsmt_fin_type1Unf 0.31652 0.25462 1.243 0.213826
## bsmt_fin_type1none 0.66984 0.54496 1.229 0.219018
## bsmt_fin_sf1 9.01509 1.85870 4.850 1.23e-06 ***
## bsmt_fin_sf2 1.76016 0.68550 2.568 0.010238 *
## bsmt_unf_sf 1.49910 0.64875 2.311 0.020846 *
## bsmt_full_bath -1.19286 0.52282 -2.282 0.022512 *
## kitchen_qualFa -2.03516 0.55195 -3.687 0.000227 ***
## kitchen_qualGd -0.89918 0.28263 -3.181 0.001466 **
## kitchen_qualTA -1.15107 0.32581 -3.533 0.000411 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1615.0 on 1167 degrees of freedom
## Residual deviance: 1514.1 on 1139 degrees of freedom
## AIC: 1572.1
##
## Number of Fisher Scoring iterations: 12
Now let’s adjust the GLM model once more to remove some extra noise
glm_mod_fin = glm(
undervalued ~
overall_qual + bsmt_fin_sf1 + kitchen_qual + bldg_type,
family = "binomial",
data = train)
cbind(
coef(glm_mod_fin),
odds_ratio=exp(coef(glm_mod_fin)),
exp(confint(glm_mod_fin))
)
## odds_ratio 2.5 % 97.5 %
## (Intercept) 2.83164697 16.97339248 6.242159919 47.44852835
## overall_qual -3.56801540 0.02821179 0.009176981 0.08452687
## bsmt_fin_sf1 2.47021650 11.82500672 2.536111574 56.96379760
## kitchen_qualFa -2.09642922 0.12289448 0.041739934 0.34609383
## kitchen_qualGd -1.02566034 0.35855962 0.208902716 0.60830650
## kitchen_qualTA -1.23559638 0.29066137 0.156725395 0.53070272
## bldg_type2fmCon 0.44222182 1.55616088 0.653857113 3.94345726
## bldg_typeDuplex -1.04574754 0.35142901 0.172328382 0.67928675
## bldg_typeTwnhs -0.89237553 0.40968139 0.164711039 0.92662089
## bldg_typeTwnhsE 0.06949113 1.07196256 0.696911460 1.64488364
To determine what we should have for the cutoff value, we can create multiple classification tables and look at the differences between accuracy, sensitivity, and specificity and then choose a value that minimizes the marginal differences between these measures. We want to minimize the error that the final version of this model has when applied to new data, lest we sacrifice generalization of the model for higher performance in training (i.e. overfitting and the bias-variance tradeoff).
# Cutoff Value set to 0.3
train$predprob <- round(fitted(glm_mod_fin),2)
class.table = table(Actual = train$undervalued,
Predicted = train$predprob>0.3)
class.table
## Predicted
## Actual FALSE TRUE
## FALSE 82 537
## TRUE 36 513
# Cutoff Value set to 0.5
train$predprob <- round(fitted(glm_mod_fin),2)
class.table = table(Actual = train$undervalued,
Predicted = train$predprob>0.5)
class.table
## Predicted
## Actual FALSE TRUE
## FALSE 437 182
## TRUE 273 276
# Cutoff Value set to 0.7
train$predprob <- round(fitted(glm_mod_fin),2)
class.table = table(Actual = train$undervalued,
Predicted = train$predprob>0.7)
class.table
## Predicted
## Actual FALSE TRUE
## FALSE 610 9
## TRUE 529 20
We can see a significant variation in model predictions by adjusting these values. Let’s now take a look at some metrics to judge the strength of our model.
Accuracy can be calculated by taking the sum of the correct predictions divided by the total number of observations (i.e. sum(top-left+bottom-right)/n.total.obs)
m1_0.3 = (82+513)/1168
m2_0.5 = (437+276)/1168
m3_0.7 = (610+20)/1168
metric = "accuracy"
accuracy_metrics = cbind.data.frame(metric, m1_0.3, m2_0.5, m3_0.7)
Sensitivity can be calculated by the number
of correct positive assessments of undervalued
observations
by the total number of values predicted to be undervalued
(i.e. bottom-right/sum(across-bottom))
513/(36+513) -> m1_0.3
276/(273+276) -> m2_0.5
20/(529+20) -> m3_0.7
metric = "senstivity"
accuracy_metrics2 = cbind.data.frame(metric, m1_0.3, m2_0.5, m3_0.7)
Specificity can be calculated by dividing
the accurately predicted observations that were not
undervalued
by the total number of values predicted not to
be undervalued
(i.e. top-left/sum(across-top))
82/(537+82) -> m1_0.3
437/619 -> m2_0.5
610/619 -> m3_0.7
metric = "specificity"
accuracy_metrics3 = cbind.data.frame(metric, m1_0.3, m2_0.5, m3_0.7)
# Overview of measurement metrics
rbind(accuracy_metrics, accuracy_metrics2, accuracy_metrics3)
## metric m1_0.3 m2_0.5 m3_0.7
## 1 accuracy 0.5094178 0.6104452 0.53938356
## 2 senstivity 0.9344262 0.5027322 0.03642987
## 3 specificity 0.1324717 0.7059774 0.98546042
It is important to consider the cutoff value that we use in the modeling because we face a tradeoff between bias and variance when applying the models to the real world or on a testing set. We can see that resulting variation in the table presenting an overview of the metrics above and note a few things:
- The cutoff value of 0.5 has the greatest accuracy of our selected models
- Sensitivity increases significantly for smaller cutoff values
- Specificity increases with higher cutoff values
To save some extra scripting time and potential risk of typos, we can also save this as a template we can use for future calculations, where:
TN
= Number of true negative assessmentsTP
= Number of true positive assessmentsFP
= Number of false positivesFN
= Number of false negatives(TP + FP)
= Total observations with positive assessments(FN + TN)
= Total observations wit negative assessmentsN = (TP + TN + FP + FN)
= Total number of observations
Sensitivity = TP/(TP + FN)
# (Number of true positive assessment)/(Number of
# all positive assessment)
Specificity = TN/(TN + FP)
# (Number of true negative assessment)/(Number of
# all negative assessment)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
# (Number of correct assessments)/Number of
# all assessments)
We could also run the create a model similar to the coding shown below if we wanted to methodologically determine the best variables to select including in our regression models.
# GLM modeling using step-wise selection of variables
glm.step <- MASS::stepAIC(glm_mod_all, trace = FALSE,
data = train, steps = 1000) # 1000 by default
glm.step$annova
Since we know from the results of the preceding chunks, however, we can see that regression might not be the best attempt to make predictions considering the tradeoffs we would have to make and can opt to save the computational time/effort for other models we expect to perform better.
Even though we know that regression might not be the best type of model to use in this context, we can still gain some insight from our calculations. For example, when determining which variables we were going to include, several of the factor variables had individual levels that were considered statistically significant but not the variable as a whole. Perhaps this indicates that there are specific features that carry more weight in home valuation than other feature. If so, then perhaps decision trees can help model out data with greater accuracy
Decision Trees and Random Forest
Let’s see what happens if we algorithmically build a model, this time using decision trees to predict our outcome. algorithmically
# Refreshing our training and testing sets
set.seed(123)
train = clean_house %>% mutate(undervalued = as.factor(undervalued)) %>%
sample_frac(0.8)
test = anti_join(clean_house, train, by = 'id')
# Dropping ID variable
train %<>% select(-c("id"))
test %<>% select(-c("id"))
pacman::p_load(tidymodels,
rpart.plot)
# Chunk takes ~5 minutes to execute
default_cv = train %>% vfold_cv(v =5)
default_tree = decision_tree(mode ="classification",
cost_complexity = tune(),
tree_depth = tune()) %>%
set_engine("rpart")
# Defining recipe
default_recipe = recipe(undervalued ~., data = train)
# Defining workflow
default_flow = workflow() %>%
add_model(default_tree) %>%
add_recipe(default_recipe)
# Tuning
default_cv_fit = default_flow %>%
tune_grid(
default_cv,
grid = expand_grid(
cost_complexity = seq(0, 0.15, by = 0.01),
tree_depth = c(1,2,5,10),
),
metrics = metric_set(accuracy, roc_auc))
# Fitting and selecting best model
best_flow = default_flow %>%
finalize_workflow(select_best(default_cv_fit, metric = "accuracy")) %>%
fit(data = train)
best_tree = best_flow %>% extract_fit_parsnip()
best_tree$fit %>% rpart.plot::rpart.plot(roundint=F)
# Summary statistics and plotting
printcp(best_tree$fit)
##
## Classification tree:
## rpart::rpart(formula = ..y ~ ., data = data, cp = ~0, maxdepth = ~2)
##
## Variables actually used in tree construction:
## [1] gr_liv_area neighborhood overall_qual
##
## Root node error: 549/1168 = 0.47003
##
## n= 1168
##
## CP nsplit rel error xerror xstd
## 1 0.165756 0 1.00000 1.00000 0.031070
## 2 0.078324 1 0.83424 0.90346 0.030770
## 3 0.047359 2 0.75592 0.86885 0.030599
## 4 0.000000 3 0.70856 0.82332 0.030320
best_tree$fit$variable.importance
## neighborhood overall_qual gr_liv_area year_built bsmt_qual
## 38.579931 25.102234 24.271346 9.810069 9.636950
## garage_cars tot_rms_abv_grd ms_zoning x2nd_flr_sf x1st_flr_sf
## 9.579243 6.247153 6.040665 5.726557 5.205961
## bedroom_abv_gr total_bsmt_sf exterior2nd exterior1st land_slope
## 2.863278 2.082384 1.895110 1.776666 1.421333
## alley
## 1.184444
# Predicting values
as.data.frame(predict(best_tree, new_data=train)) -> df1
Looking at our first decision tree, there are several things worth considering:
- The
neighborhood
variables was ranked as the greatest variable of importance.- This might be good to know if we use this model on future data on housing from the same selection of neighborhoods, but wouldn’t perform well on data from other areas where the distinction between neighborhood is less explanatory
- The significant variables are different than those suggested and used in our previous model
For the next trees, we will omit the neighborhood
variable in an effort to produce a model that is more generalizable.
While we could likely develop a way to preserve the information that
this variable contributes to the data (like by creating an additional
variable(s) with metrics pertaining to the given neighborhood such as
crime-rate, median income, average education, neighborhood demographics,
etc.), it could provide insight if create a model that cannot look to
the neighborhood of an observation.
set.seed(123)
train_hold = clean_house %>%
select(-c("neighborhood")) %>%
mutate(undervalued = as.factor(undervalued)) %>%
sample_frac(0.8)
test_hold = anti_join(clean_house, train_hold, by = 'id')
train_hold %<>% select(-c("id"))
test_hold %<>% select(-c("id"))
And now to plant another tree
default_recipe = recipe(undervalued ~., data = train_hold)
default_flow = workflow() %>%
add_model(default_tree) %>%
add_recipe(default_recipe)
default_cv_fit = default_flow %>%
tune_grid(
default_cv,
grid = expand_grid(
cost_complexity = seq(0, 0.15, by = 0.01),
tree_depth = c(1,2,5,10),),
metrics = metric_set(accuracy, roc_auc))
best_flow = default_flow %>%
finalize_workflow(select_best(default_cv_fit, metric = "accuracy")) %>%
fit(data = train_hold)
best_tree2 = best_flow %>% extract_fit_parsnip()
best_tree2$fit %>% rpart.plot::rpart.plot(roundint=F)
# Summary and plotting
printcp(best_tree2$fit)
##
## Classification tree:
## rpart::rpart(formula = ..y ~ ., data = data, cp = ~0.02, maxdepth = ~5)
##
## Variables actually used in tree construction:
## [1] exterior1st gr_liv_area overall_qual
##
## Root node error: 549/1168 = 0.47003
##
## n= 1168
##
## CP nsplit rel error xerror xstd
## 1 0.165756 0 1.00000 1.00000 0.031070
## 2 0.049180 1 0.83424 0.83424 0.030393
## 3 0.047359 2 0.78506 0.85428 0.030516
## 4 0.029144 3 0.73770 0.81421 0.030257
## 5 0.020000 4 0.70856 0.79417 0.030110
best_tree2$fit$variable.importance
## gr_liv_area overall_qual x2nd_flr_sf tot_rms_abv_grd full_bath
## 38.1941946 25.5376667 12.4301505 11.4037635 11.2919544
## exterior1st year_built bsmt_qual garage_cars exterior2nd
## 10.0149479 9.8100685 9.6369497 9.5792434 6.9669203
## x1st_flr_sf total_bsmt_sf bedroom_abv_gr condition2 functional
## 6.9248310 3.6293675 2.8632784 0.2177163 0.2177163
## sale_type
## 0.2177163
# Predicting values
as.data.frame(predict(best_tree2, new_data=train_hold)) -> df2
By removing the neighborhood
variable, we can note
several differences in the resulting model construction. The variable
gr_liv_area
(above-ground living area in squared feet) was
used instead of neighborhood
. We can also see removing the
one-column also changed the listed variables of importance that were
generated. For example, in the second tree we see
functional
and sale_type
that do not appear in
the first tree and see variables like year_built
and
ms_zoning
that appear in the first tree but not the
second.
From this seemingly small change, we can gather some additional
insight into the story that is hiding in our data. Obviously the
neighboorhood
variable must hold some importance since its
commission changes the resulting tree, but I believe there is additional
information that the neighboorhood
could carry.
While the name of a neighborhood is nothing more than a logical arrangement of letters and spaces, geographic boundaries could imply several characteristics that would apply to a home being sold within those boundaries. Even if the floor-plan and characteristics of two given homes are 100% similar apart from the neighborhood, there could still be differences that affect sale prices
- Crime rates
- A quick online search could determine reports of crimes for a given geographic area, knowledge of which could affect the price one is willing to pay for a home if house is located in crime-heavy area
- Public education jurisdiction
- Results in distinct differences determined by specific boundaries. Even the side of the street a home is on could determine if their children attend an A-rated school versus a D-rated school
- Community
- Not as black-and-white as education or crime, but could potentially provide some explanatory effect
- Examples: Willingness to pay more than would otherwise in order to live close to family, friends, cultural significance (e.g. Spanish-speaking), class-status/“prestige”, etc
- Aforementioned examples could potentially result in consumer willingness to pay a price either higher or lower than they would otherwise accept
- Presences of Homeowners Association
- Because who the hell wants to be told what color they are allowed to paint their house, or if they are able to have a garden on their property that is visible from the street, or how long their grass is “allowed” to grow? This is America! - (unless we are looking at data from another country)
Before moving on to another model, let’s first consider the predictions from the previous trees.
df1 %<>% rename(pred_tree = names(.)[1])
# Creating confusion matrix
table(Actual = train$undervalued,
Predicted = df1$pred_tree)
## Predicted
## Actual FALSE TRUE
## FALSE 390 229
## TRUE 160 389
TN = 390
TP = 389
FP = 229
FN = 160
Sensitivity = TP/(TP + FN)
# (Number of true positive assessment)/(Number of
# all positive assessment)
Specificity = TN/(TN + FP)
# (Number of true negative assessment)/(Number of
# all negative assessment)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
# (Number of correct assessments)/Number of
# all assessments)
metrics_tree1 = rbind(Sensitivity,Specificity,Accuracy)
metrics_tree1
## [,1]
## Sensitivity 0.7085610
## Specificity 0.6300485
## Accuracy 0.6669521
df2 %<>% rename(pred_tree = names(.)[1])
table(Actual = train_hold$undervalued,
Predicted = df2$pred_tree)
## Predicted
## Actual FALSE TRUE
## FALSE 446 173
## TRUE 216 333
TN = 446
TP = 333
FP = 173
FN = 216
Sensitivity = TP/(TP + FN)
# (Number of true positive assessment)/(Number of
# all positive assessment)
Specificity = TN/(TN + FP)
# (Number of true negative assessment)/(Number of
# all negative assessment)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
# (Number of correct assessments)/Number of
# all assessments)
metrics_tree2 = rbind(Sensitivity,Specificity,Accuracy)
metrics_tree2
## [,1]
## Sensitivity 0.6065574
## Specificity 0.7205170
## Accuracy 0.6669521
Looks like we were able to produce more stable models with decent accuracy (on the training set at least, we will have to see how it performs on testing). If the single decision trees produced these different results, let’s see how an ensemble of trees (i.e. random forest) will perform
Random Forest
I saw the values of the training/testing sets when
unique(test$neighborhood)
were both the same, so I figured
that it would be good practice to keep the neighborhood
variable for the random forest modeling. If there was an uneven
distribution of neighborhoods in both sets, then we would need to
consider a different resampling method or removing the variable.
# Refreshing our training and testing sets
set.seed(123)
train = clean_house %>% mutate(undervalued = as.factor(undervalued)) %>%
sample_frac(0.8)
test = anti_join(clean_house, train, by = 'id')
train %<>% select(-c("id"))
test %<>% select(-c("id"))
library(randomForest)
mod_rf = randomForest(formula = undervalued ~ .,
data = train,
importance = T,
ntree = 100)
importance(mod_rf)
## FALSE TRUE MeanDecreaseAccuracy MeanDecreaseGini
## ms_sub_class 2.37110510 -0.59289756 2.230481757 6.25671639
## ms_zoning 2.96050446 -0.39285711 2.660849202 2.98502094
## lot_frontage 1.31352137 -0.75593306 0.525960886 14.47293022
## lot_area 0.14130204 2.97691136 2.481027539 21.74603953
## street 0.00000000 1.00503782 1.005037815 0.14684310
## alley 2.08679934 -0.93761809 1.052651776 1.55984791
## lot_shape 0.53106960 -0.86906473 -0.280213261 4.27020668
## land_contour 1.39066020 0.96429231 1.761896764 3.48457769
## utilities 0.00000000 0.00000000 0.000000000 0.01714286
## lot_config 0.84471029 1.31603075 1.538559412 5.68282684
## land_slope 0.47421537 1.33882923 1.771745823 1.46127417
## neighborhood 9.15263439 -1.41384978 7.012149979 49.00285106
## condition1 1.52000381 0.91315354 1.974712262 6.14739613
## condition2 0.00000000 -1.00503782 -1.005037815 0.19665702
## bldg_type 1.12996782 0.12753681 0.942682077 3.37526159
## house_style 1.38336800 0.66013400 1.800959183 5.70180504
## overall_qual 1.68452586 6.57037197 6.717004190 16.11463389
## overall_cond 2.11388375 -0.11103405 1.758216763 6.25304209
## year_built 3.16800540 0.88441877 3.404948037 14.99962580
## year_remod_add 2.88187116 1.79992658 4.174840176 14.38416237
## roof_style 1.49534911 -1.44243954 -0.011418427 3.19500972
## roof_matl -0.01075588 0.33854578 0.254120308 0.59345575
## exterior1st 4.30936617 -1.75965058 2.426072343 17.01934930
## exterior2nd 4.91475920 -1.17302712 2.957139085 18.57794048
## mas_vnr_type 1.29290657 -0.07143388 0.953393308 3.65983121
## mas_vnr_area 2.40413643 -1.79750713 0.688785035 11.17923784
## exter_qual 3.28861788 2.58876793 4.100838893 4.85561014
## exter_cond 3.03014026 -0.62215309 2.054099581 2.12103282
## foundation 1.80653485 2.53050396 3.088919431 4.08226693
## bsmt_qual 0.21747787 1.18317604 1.048360612 3.49740634
## bsmt_cond 1.34455606 0.58695672 1.156748477 1.93102645
## bsmt_exposure 0.39992031 3.24006287 2.459550251 8.83336503
## bsmt_fin_type1 -0.75334871 1.62290860 0.644859625 11.48029368
## bsmt_fin_sf1 0.25873156 3.55411626 2.947305142 16.96627929
## bsmt_fin_type2 1.96117137 -0.71964105 1.023802693 5.58847525
## bsmt_fin_sf2 2.45084614 -0.68022909 1.576533782 3.83102914
## bsmt_unf_sf 2.45604685 1.09652020 3.087944915 18.86575270
## total_bsmt_sf 2.91840785 1.87437567 3.844247593 19.86606691
## heating -0.85114868 -0.46607137 -1.289622553 0.41671989
## heating_qc -0.89839832 -0.04319601 -0.783790997 6.03820039
## central_air 0.31879951 -0.37323215 -0.001255129 0.76931049
## electrical -0.95169657 0.68866317 -0.255548189 1.40753503
## x1st_flr_sf 3.07262823 2.52068801 4.954511150 21.21923469
## x2nd_flr_sf 2.00559147 1.62843691 3.094340721 10.54007485
## low_qual_fin_sf -0.73691411 -0.47976615 -0.672646514 0.60291609
## gr_liv_area 1.05040323 5.21595445 5.685237227 26.44762641
## bsmt_full_bath 0.27016929 0.42142287 0.459635305 3.18937797
## bsmt_half_bath -1.18006922 1.85579676 0.673713255 1.30181844
## full_bath 3.19254213 0.47346541 2.732841615 3.06827333
## half_bath -0.21995308 1.99493077 1.456179460 3.04303319
## bedroom_abv_gr 1.55899450 -0.23970178 0.970044209 5.69155134
## kitchen_abv_gr 1.16430043 1.15335008 1.754335069 0.95336992
## kitchen_qual 0.69174921 -0.16207846 0.567714859 4.03953453
## tot_rms_abv_grd 1.41720781 0.97973296 1.787462605 11.03212578
## functional 0.52206798 0.22259156 0.671310737 2.68818438
## fireplaces 2.51830238 -1.52092312 0.985649842 3.91226395
## fireplace_qu 0.69102897 -0.91224663 -0.070888822 7.15109410
## garage_type -1.57165483 0.25323769 -1.101717772 4.37105256
## garage_yr_blt 1.69176319 -0.03213763 1.368718929 14.56753377
## garage_finish 0.64110738 -1.20190809 -0.512738270 5.79121789
## garage_cars 1.43213918 3.53908494 4.114635413 5.03218419
## garage_area 1.68224003 3.72926157 4.470294774 22.20102103
## garage_qual 1.40040682 1.13201326 1.843622225 2.69165645
## garage_cond 1.27111869 1.59375679 1.956286436 1.68613947
## paved_drive 1.01075020 -2.26737841 -1.005071285 1.35452269
## wood_deck_sf 1.57367433 0.33060731 1.733572642 11.20831386
## open_porch_sf 0.70978466 1.53498262 1.850285953 14.98678612
## enclosed_porch -2.33560242 0.35132891 -1.775893268 3.88019036
## x3ssn_porch 0.05564202 -1.01153916 -0.599054567 1.26608555
## screen_porch 0.55810559 0.05994389 0.480398720 4.11634414
## pool_area 0.00000000 -1.00503782 -1.005037815 0.09971477
## pool_qc 0.00000000 0.03414999 0.031273130 0.10317498
## fence 2.17475352 0.02116753 1.877720232 4.87639051
## misc_feature 1.36030512 -1.48585604 -0.136666022 1.17537986
## misc_val 1.90386644 -0.17248438 1.208188046 1.93359543
## mo_sold 1.49376084 0.49456602 1.248362677 13.38199354
## yr_sold -2.26464306 -0.93846176 -2.191942636 8.68738217
## sale_type -0.91609319 -0.18198627 -0.890808478 2.18995595
## sale_condition 0.89141467 0.57870719 1.058412573 5.50860524
train$pred_rf = predict(mod_rf, type="response", newdata = train)
# Creating confusion matrix
table(Actual = train$undervalued,
Predicted = train$pred_rf)
## Predicted
## Actual FALSE TRUE
## FALSE 619 0
## TRUE 0 549
Looks like the neighborhood ended up being a significant variable according to the random forest model. Good thing we kept it!
Testing
First with the most-recently constructed random forest model
test$pred_rf = predict(mod_rf, type="response", newdata = test)
table(Actual = test$undervalued,
Predicted = test$pred_rf)
## Predicted
## Actual FALSE TRUE
## FALSE 93 42
## TRUE 68 89
TN = 93
TP = 89
FP = 42
FN = 68
Sensitivity = TP/(TP + FN)
# (Number of true positive assessment)/(Number of
# all positive assessment)
Specificity = TN/(TN + FP)
# (Number of true negative assessment)/(Number of
# all negative assessment)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
# (Number of correct assessments)/Number of
# all assessments)
metrics_rf = rbind(Sensitivity,Specificity,Accuracy)
metrics_rf
## [,1]
## Sensitivity 0.5668790
## Specificity 0.6888889
## Accuracy 0.6232877
Not quite the same performance as with training, but we can compare these results to the test predictions of the other models as well.
GLM model with cutoff value of 0.5
# refreshing sets
set.seed(123)
train = clean_house %>% mutate(undervalued = as.factor(undervalued)) %>%
sample_frac(0.8)
test = anti_join(clean_house, train, by = 'id')
train %<>% select(-c("id"))
test %<>% select(-c("id"))
glm_mod_fin = glm(
undervalued ~
overall_qual + bsmt_fin_sf1 + kitchen_qual + bldg_type,
family = "binomial",
data = test)
# Cutoff Value set to 0.5
test$predprob <- round(fitted(glm_mod_fin),2)
table(Actual = test$undervalued,
Predicted = test$predprob>0.5)
## Predicted
## Actual FALSE TRUE
## FALSE 60 75
## TRUE 35 122
TN = 60
TP = 122
FP = 75
FN = 35
Sensitivity = TP/(TP + FN)
# (Number of true positive assessment)/(Number of
# all positive assessment)
Specificity = TN/(TN + FP)
# (Number of true negative assessment)/(Number of
# all negative assessment)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
# (Number of correct assessments)/Number of
# all assessments)
metrics_glm = rbind(Sensitivity,Specificity,Accuracy)
metrics_glm
## [,1]
## Sensitivity 0.7770701
## Specificity 0.4444444
## Accuracy 0.6232877
Single tree (with Neighborhood)
# Predicting values
as.data.frame(predict(best_tree, new_data=test)) -> df1
df1 %<>% rename(pred_tree = names(.)[1])
# Creating confusion matrix
table(Actual = test$undervalued,
Predicted = df1$pred_tree)
## Predicted
## Actual FALSE TRUE
## FALSE 84 51
## TRUE 63 94
TN = 84
TP = 94
FP = 51
FN = 63
Sensitivity = TP/(TP + FN)
# (Number of true positive assessment)/(Number of
# all positive assessment)
Specificity = TN/(TN + FP)
# (Number of true negative assessment)/(Number of
# all negative assessment)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
# (Number of correct assessments)/Number of
# all assessments)
metrics_tree1 = rbind(Sensitivity,Specificity,Accuracy)
metrics_tree1
## [,1]
## Sensitivity 0.5987261
## Specificity 0.6222222
## Accuracy 0.6095890
Single tree (without Neighborhood)
# Predicting values
as.data.frame(predict(best_tree2, new_data=test_hold)) -> df2
df2 %<>% rename(pred_tree = names(.)[1])
# Creating confusion matrix
table(Actual = test_hold$undervalued,
Predicted = df2$pred_tree)
## Predicted
## Actual FALSE TRUE
## FALSE 97 38
## TRUE 75 82
TN = 97
TP = 82
FP = 38
FN = 75
Sensitivity = TP/(TP + FN)
# (Number of true positive assessment)/(Number of
# all positive assessment)
Specificity = TN/(TN + FP)
# (Number of true negative assessment)/(Number of
# all negative assessment)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
# (Number of correct assessments)/Number of
# all assessments)
metrics_tree2 = rbind(Sensitivity,Specificity,Accuracy)
metrics_tree2
## [,1]
## Sensitivity 0.5222930
## Specificity 0.7185185
## Accuracy 0.6130137
03 - How did you do? Compare your models’ levels of accuracy to the null classifier?
as.data.frame(cbind(metrics_glm,
metrics_tree1,metrics_tree2,metrics_rf)) -> comp_df
comp_df %>% rename("GLM" = 1,
"Tree #1" = 2,
"Tree #3" = 3,
"Random Forest"=4)
## GLM Tree #1 Tree #3 Random Forest
## Sensitivity 0.7770701 0.5987261 0.5222930 0.5668790
## Specificity 0.4444444 0.6222222 0.7185185 0.6888889
## Accuracy 0.6232877 0.6095890 0.6130137 0.6232877
As mentioned above, the random forest model performed the best on the
training data, but not as well as we would have hoped for on the testing
data. Something we could consider doing would be to create bins when
attempting binary logistic regression. Doing so might reveal a linear
trend that helps reduce the outlier effect. It would also be interesting
to see the results of some different models that automatically choose
the variables to be included. We can already see the variation that
arises with slight changes to the model or data, so perhaps there is a
model we did not try is the one used to construct the
undervaluded
variable
04 - Are all errors equal in this setting? Briefly explain your answer.
All errors are NOT equal. For example, consider two different
observation that have a TRUE
value of
undervalued
variable. Even though both of these
observations are considered undervalued, suppose that observation 1 is
undervalued by 500% whereas observation 2 is only undervalued by 1%. If
our model was to predict a TRUE
value for the
undervalued
variable for observation 1 and 2, even thought
the marginal cost (in terms of additional model uncertainty from errors)
would be relatively similar (or the exact same), an incorrect
classification of observation 1 would have a greater impact of the
true performance strength of out model.
05 - Why would it be a bad idea to use a linear model here (for example plain OLS or lasso)?
Since the outcome we are predicting is a binary categorical variable rather than a variable with a continuous value. When there are potential outliers, regression models like these can also be more susceptible to the influence of outliers. Even if our data had optimal characteristics for using a regression model, there still might not be a linear trend that the model is able to properly fit. Decision trees, on the other hand, are much better at fitting non-linear trends that exist in the model.
In addition, another assumption in linear models is that the variables included in the model are independent. While we could attempt to navigate around this by using interaction variables in the model, but it would be more worth our while to utilize a different method better suited for classification problems.
test -> testingtest
as.data.frame(cbind(metrics_glm,
metrics_tree1,metrics_tree2,metrics_rf)) -> comp_df
comp_df %>% rename("GLM" = 1,
"Tree #1" = 2,
"Tree #3" = 3,
"Random Forest"=4)
## GLM Tree #1 Tree #3 Random Forest
## Sensitivity 0.7770701 0.5987261 0.5222930 0.5668790
## Specificity 0.4444444 0.6222222 0.7185185 0.6888889
## Accuracy 0.6232877 0.6095890 0.6130137 0.6232877