In this document we will conduct binomial logistic regression modeling on a data set relating to recruiting data and whether a person is hired or not hired based on test scores, interviews, and gender. Data fields:
# Download the recruiting dataset from the URL provided
url <- "https://peopleanalytics-regression-book.org/data/recruiting.csv"
recruiting <- read.csv(url)
# view the first few rows in the data set to acquaint yourself with the data
head(recruiting)
## gender sat gpa apttest int1 int2 int3 hired
## 1 F 1230 3.57 70 4 3 4 0
## 2 F 1250 3.58 20 3 4 3 0
## 3 F 930 2.89 77 2 3 3 0
## 4 F 1110 3.49 85 5 3 4 0
## 5 M 1260 3.39 93 3 4 4 1
## 6 M 1370 3.41 81 2 4 4 0
# What do you think the datatypes are?
# gender: binary factor
# sat: numeric
# gpa: numeric
# apttest: numeric
# int1: ordered factor
# int2: ordered factor
# int3: ordered factor
# hired: binary factor
# Perform any necessary datatype adjustments
recruiting$gender <- as.factor(recruiting$gender)
recruiting$hired <- as.factor(recruiting$hired)
recruiting$int1 <- ordered(recruiting$int1,
levels = 1:5)
recruiting$int2 <- ordered(recruiting$int2,
levels = 1:5)
recruiting$int3 <- ordered(recruiting$int3,
levels = 1:5)
# get to the know the data a bit. what is the split of hired/not hired,
# do you notice differences between the interviewers, etc
# Hint: summary(), GGally::ggpairs() are good first steps
summary(recruiting)
## gender sat gpa apttest int1 int2 int3
## F:494 Min. : 420 Min. :2.46 Min. : 8.00 1: 6 1: 0 1: 0
## M:472 1st Qu.: 970 1st Qu.:3.26 1st Qu.:51.00 2:157 2: 31 2: 10
## Median :1110 Median :3.45 Median :62.00 3:409 3:429 3:554
## Mean :1106 Mean :3.38 Mean :62.25 4:370 4:491 4:396
## 3rd Qu.:1258 3rd Qu.:3.55 3rd Qu.:76.00 5: 24 5: 15 5: 6
## Max. :1600 Max. :3.87 Max. :99.00
## hired
## 0:780
## 1:186
##
##
##
##
GGally::ggpairs(recruiting)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Run a simple binomial logistic regression model to estimate the influence of SAT scores
# on hire success, saving your model using a name of your choice
model <- glm(data = recruiting, formula = hired ~ sat, family = "binomial")
# Examine the coefficients of your saved model
summary(model)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.140822718 0.4942049503 -6.355304 2.080148e-10
## sat 0.001518944 0.0004267464 3.559360 3.717589e-04
# Take a look at the odds ratio for our independent variable. How do you interpret this?
exp(model$coefficients)
## (Intercept) sat
## 0.0432472 1.0015201
# Do you think this is a good model?
DescTools::PseudoR2(model)
## McFadden
## 0.01390745
# What variables look like they might be interesting to include?
GGally::ggpairs(recruiting)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# make a new model with gender and apttest
model2 <- glm(hired ~ apttest + gender, data = recruiting, family = "binomial")
# View the coefficients and log odds.
summary(model2)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.88658877 0.468463826 -12.565727 3.258639e-36
## apttest 0.06362264 0.006391116 9.954857 2.401685e-23
## genderM 0.31346702 0.177697958 1.764044 7.772457e-02
exp(model2$coefficients)
## (Intercept) apttest genderM
## 0.002776432 1.065690176 1.368160340
# Assess model fit - is this better than the simple model?
DescTools::PseudoR2(model2)
## McFadden
## 0.1441964
AIC(model)
## [1] 937.3132
AIC(model2)
## [1] 815.9978
Write a few sentences on the data and the results of your model. Odds can be difficult to explain - can you try to explain to a stakeholder what this means for a candidate with a 60 on their aptitude test vs a candidate with a 70?
Each additional point in the aptitude tests results in a ~6.5% increase in the odds of being hired. The low Pseudo R^2 value means that we are only explaining a small amount of what leads to a person being hired.
A candidate with a 70 has 10 more points than a candidate with 60. Each point relates to a 6.5% increase in odds, so all else being equal, the person with a 70 has 1.065^10 = ~88% increased odds of being hired vs the candidate with a 60.