03A - Binomial Logistic Regression Solutions

In this document we will conduct binomial logistic regression modeling on a data set relating to recruiting data and whether a person is hired or not hired based on test scores, interviews, and gender. Data fields:

gender: gender
sat: SAT test score of applicant
gpa: GPA of applicant
apttest: Score on an aptitude test given to applicant
int1: applicant rating given by one of two line manager interviewers (1-5 where high is positive)
int2: applicant rating given by second of two line manager interviewers (1-5 where high is positive)
int3: applicant rating given by HR interviewer (1-5 where high is positive)
hired: hired (1) or not hired (0)

Exercise 1 - Running a simple binomial logistic regression model

# Download the recruiting dataset from the URL provided
url <- "https://peopleanalytics-regression-book.org/data/recruiting.csv"
recruiting <- read.csv(url)

# view the first few rows in the data set to acquaint yourself with the data
head(recruiting)

##   gender  sat  gpa apttest int1 int2 int3 hired
## 1      F 1230 3.57      70    4    3    4     0
## 2      F 1250 3.58      20    3    4    3     0
## 3      F  930 2.89      77    2    3    3     0
## 4      F 1110 3.49      85    5    3    4     0
## 5      M 1260 3.39      93    3    4    4     1
## 6      M 1370 3.41      81    2    4    4     0

# What do you think the datatypes are?
# gender: binary factor
# sat: numeric
# gpa: numeric
# apttest: numeric
# int1: ordered factor
# int2: ordered factor
# int3: ordered factor
# hired: binary factor

# Perform any necessary datatype adjustments
recruiting$gender <- as.factor(recruiting$gender)
recruiting$hired <- as.factor(recruiting$hired)
recruiting$int1 <- ordered(recruiting$int1,
                           levels = 1:5)
recruiting$int2 <- ordered(recruiting$int2,
                           levels = 1:5)
recruiting$int3 <- ordered(recruiting$int3,
                           levels = 1:5)

# get to the know the data a bit.  what is the split of hired/not hired,
# do you notice differences between the interviewers, etc
# Hint: summary(), GGally::ggpairs() are good first steps
summary(recruiting)

##  gender       sat            gpa          apttest      int1    int2    int3   
##  F:494   Min.   : 420   Min.   :2.46   Min.   : 8.00   1:  6   1:  0   1:  0  
##  M:472   1st Qu.: 970   1st Qu.:3.26   1st Qu.:51.00   2:157   2: 31   2: 10  
##          Median :1110   Median :3.45   Median :62.00   3:409   3:429   3:554  
##          Mean   :1106   Mean   :3.38   Mean   :62.25   4:370   4:491   4:396  
##          3rd Qu.:1258   3rd Qu.:3.55   3rd Qu.:76.00   5: 24   5: 15   5:  6  
##          Max.   :1600   Max.   :3.87   Max.   :99.00                          
##  hired  
##  0:780  
##  1:186  
##         
##         
##         
##

GGally::ggpairs(recruiting)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Run a simple binomial logistic regression model to estimate the influence of SAT scores
# on hire success, saving your model using a name of your choice
model <- glm(data = recruiting, formula = hired ~  sat, family = "binomial")

Exercise 2 - Interpreting the coefficients

# Examine the coefficients of your saved model
summary(model)$coefficients

##                 Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept) -3.140822718 0.4942049503 -6.355304 2.080148e-10
## sat          0.001518944 0.0004267464  3.559360 3.717589e-04

# Take a look at the odds ratio for our independent variable. How do you interpret this?
exp(model$coefficients)

## (Intercept)         sat 
##   0.0432472   1.0015201

# Do you think this is a good model?
DescTools::PseudoR2(model)

##   McFadden 
## 0.01390745

Exercise 3 - Running a multiple binomial logistic regression model

# What variables look like they might be interesting to include?
GGally::ggpairs(recruiting)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# make a new model with gender and apttest
model2 <- glm(hired ~ apttest + gender,  data = recruiting, family = "binomial")

# View the coefficients and log odds.
summary(model2)$coefficients

##                Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept) -5.88658877 0.468463826 -12.565727 3.258639e-36
## apttest      0.06362264 0.006391116   9.954857 2.401685e-23
## genderM      0.31346702 0.177697958   1.764044 7.772457e-02

exp(model2$coefficients)

## (Intercept)     apttest     genderM 
## 0.002776432 1.065690176 1.368160340

# Assess model fit - is this better than the simple model?
DescTools::PseudoR2(model2)

##  McFadden 
## 0.1441964

AIC(model)

## [1] 937.3132

AIC(model2)

## [1] 815.9978

Write a few sentences on the data and the results of your model. Odds can be difficult to explain - can you try to explain to a stakeholder what this means for a candidate with a 60 on their aptitude test vs a candidate with a 70?

Each additional point in the aptitude tests results in a ~6.5% increase in the odds of being hired. The low Pseudo R^2 value means that we are only explaining a small amount of what leads to a person being hired.

A candidate with a 70 has 10 more points than a candidate with 60. Each point relates to a 6.5% increase in odds, so all else being equal, the person with a 70 has 1.065^10 = ~88% increased odds of being hired vs the candidate with a 60.