In this document we will conduct linear regression modeling on a data set relating to graduate salaries in the United States, to try to understand economic factors which influence graduate salaries.
# Download the graduates dataset from the URL provided
url <- "https://peopleanalytics-regression-book.org/data/graduates.csv"
graduates <- read.csv(url)
# view the first few rows in the data set to acquaint yourself with the data
head(graduates)
## Major Discipline Total
## 1 GENERAL AGRICULTURE Agriculture & Natural Resources 128148
## 2 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources 95326
## 3 AGRICULTURAL ECONOMICS Agriculture & Natural Resources 33955
## 4 ANIMAL SCIENCES Agriculture & Natural Resources 103549
## 5 FOOD SCIENCE Agriculture & Natural Resources 24280
## 6 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources 79409
## Unemployment_rate Median_salary
## 1 0.02614711 50000
## 2 0.02863606 54000
## 3 0.03024832 63000
## 4 0.04267890 46000
## 5 0.04918845 62000
## 6 0.03179089 50000
# Determine how many observations we have in this data set
nrow(graduates)
## [1] 173
# use the colnames() function to get the names of the columns in the data set
colnames(graduates)
## [1] "Major" "Discipline" "Total"
## [4] "Unemployment_rate" "Median_salary"
# Use the cor() function to determine the correlation between unemployment rate and median salary
cor(graduates$Unemployment_rate, graduates$Median_salary)
## [1] -0.3019457
# Consider the scale of the Unemployment_rate column.
# Transform it into a more useful scale
# (Hint: we will want to understand the impact of a percentage point change)
graduates$Unemployment_rate <- 100*graduates$Unemployment_rate
# Run a simple linear regression model to estimate the influence of unemployment rate
# on graduate salaries, saving your model using a name of your choice
model <- lm(data = graduates, formula = Median_salary ~ Unemployment_rate)
# Examine the coefficients of your saved model
model$coefficients
## (Intercept) Unemployment_rate
## 70096.911 -2315.512
# Determine the 95% confidence interval for the coefficients
confint(model)
## 2.5 % 97.5 %
## (Intercept) 63424.936 76768.887
## Unemployment_rate -3419.067 -1211.958
Write below an interpretation of these coefficients using the estimates and the confidence intervals which you just calculated.
# EXTENSION: If you are familiar with using ggplot2, create a plot of unemployment rate and
# median salary and show the estimated linear model using geom_smooth()
library(ggplot2)
ggplot(data = graduates, aes(x = Unemployment_rate, y = Median_salary)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Unemployment rate (%)",
y = "Median salary ($)") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
# Run a multiple linear regression to determine the influence of both unemployment rate
# and total graduate employees on median salary
model2 <- lm(Median_salary ~ Unemployment_rate + Total, graduates)
# View the coefficients and confidence intervals
model2$coefficients
## (Intercept) Unemployment_rate Total
## 7.069198e+04 -2.343223e+03 -1.894123e-03
confint(model2)
## 2.5 % 97.5 %
## (Intercept) 6.382648e+04 7.755749e+04
## Unemployment_rate -3.450722e+03 -1.235725e+03
## Total -6.926153e-03 3.137908e-03
Write below an interpretation of the coefficients of this model.
# Add discipline as an input variable to the previous model you created in
# Exercise 3.
model3 <- lm(Median_salary ~ Unemployment_rate + Total + Discipline,
graduates)
# Inspect the results and determine which value was used as a reference for
# discipline in the model
confint(model3)
## 2.5 % 97.5 %
## (Intercept) 5.309890e+04 6.715682e+04
## Unemployment_rate -2.248705e+03 -3.458837e+02
## Total -3.647491e-03 3.819117e-03
## DisciplineArts -1.528832e+04 4.772593e+03
## DisciplineBiology & Life Science -1.064108e+04 4.968144e+03
## DisciplineBusiness -9.672146e+02 1.595158e+04
## DisciplineCommunications & Journalism -1.324829e+04 9.850149e+03
## DisciplineComputers & Mathematics 5.439571e+03 2.224382e+04
## DisciplineEducation -1.788046e+04 -2.630368e+03
## DisciplineEngineering 1.724532e+04 3.113145e+04
## DisciplineHealth -5.639558e+03 1.050717e+04
## DisciplineHumanities & Liberal Arts -1.326690e+04 3.142300e+03
## DisciplineIndustrial Arts & Consumer Services -9.302614e+03 9.497455e+03
## DisciplineInterdisciplinary -2.703539e+04 1.282003e+04
## DisciplineLaw & Public Policy -9.150384e+03 1.206888e+04
## DisciplinePhysical Sciences 8.546786e+02 1.782305e+04
## DisciplinePsychology & Social Work -1.486827e+04 3.889056e+03
## DisciplineSocial Science -7.424027e+03 1.060487e+04
# view all unique values for discipline
unique(graduates$Discipline)
## [1] "Agriculture & Natural Resources" "Biology & Life Science"
## [3] "Engineering" "Humanities & Liberal Arts"
## [5] "Communications & Journalism" "Computers & Mathematics"
## [7] "Industrial Arts & Consumer Services" "Education"
## [9] "Law & Public Policy" "Interdisciplinary"
## [11] "Health" "Social Science"
## [13] "Physical Sciences" "Psychology & Social Work"
## [15] "Arts" "Business"
# it appears that the first discipline in alphabetical order has been used
# as reference (Agriculture & Natural Resources)
Write below your thoughts on whether any specific disciplines have a significant influence on median salary.
Compared to a reference discipline of Agriculture & Natural Resources, the following disciplines have a significant positive influence on median salary (assuming similar unemployment rates and total graduate numbers):
and the following have a significant negative influence:
# EXTENSION: Change your reference discipline to a discipline of your choice,
# rerun the model and inspect the results. How have they changed?
graduates$Discipline <- as.factor(graduates$Discipline) |>
relevel(ref = "Computers & Mathematics")
model3 <- lm(Median_salary ~ Unemployment_rate + Total + Discipline,
graduates)
confint(model3)
## 2.5 % 97.5 %
## (Intercept) 6.590542e+04 8.203370e+04
## Unemployment_rate -2.248705e+03 -3.458837e+02
## Total -3.647491e-03 3.819117e-03
## DisciplineAgriculture & Natural Resources -2.224382e+04 -5.439571e+03
## DisciplineArts -2.820131e+04 -9.997809e+03
## DisciplineBiology & Life Science -2.427128e+04 -9.085053e+03
## DisciplineBusiness -1.432016e+04 1.621136e+03
## DisciplineCommunications & Journalism -2.655835e+04 -4.523188e+03
## DisciplineEducation -3.152224e+04 -1.667198e+04
## DisciplineEngineering 3.671058e+03 1.702232e+04
## DisciplineHealth -1.929639e+04 -3.519399e+03
## DisciplineHumanities & Liberal Arts -2.639708e+04 -1.141091e+04
## DisciplineIndustrial Arts & Consumer Services -2.278212e+04 -4.706431e+03
## DisciplineInterdisciplinary -4.054532e+04 -1.353428e+03
## DisciplineLaw & Public Policy -2.249663e+04 -2.268268e+03
## DisciplinePhysical Sciences -1.268773e+04 3.682062e+03
## DisciplinePsychology & Social Work -2.792243e+04 -1.074018e+04
## DisciplineSocial Science -2.069284e+04 -3.809716e+03
Because we have referenced on one of the higher salary disciplines, we now see a lot more disciplines which have a significantly negative influence on median salary.
# Determine the fit of your model from Exercise 4
summary(model3)$r.squared
## [1] 0.626935
# Determine if any of the input variables can be removed to form a
# more parsimonious model
# we know from Exercise 4 that we can remove Total
model4 <- lm(Median_salary ~ Unemployment_rate + Discipline, graduates)
# Verify that these is no substantial difference in the fit of your
# more parsimonious model
summary(model4)$r.squared
## [1] 0.6269301
# Run a summary of your parsimonious model
summary(model4)
##
## Call:
## lm(formula = Median_salary ~ Unemployment_rate + Discipline,
## data = graduates)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18681 -5185 -665 3455 47871
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 74001.1 4010.1 18.454
## Unemployment_rate -1300.3 475.7 -2.734
## DisciplineAgriculture & Natural Resources -13856.0 4228.1 -3.277
## DisciplineArts -19085.7 4582.8 -4.165
## DisciplineBiology & Life Science -16686.7 3827.0 -4.360
## DisciplineBusiness -6299.8 3871.1 -1.627
## DisciplineCommunications & Journalism -15513.1 5526.3 -2.807
## DisciplineEducation -24089.6 3743.1 -6.436
## DisciplineEngineering 10340.8 3366.1 3.072
## DisciplineHealth -11404.3 3979.8 -2.866
## DisciplineHumanities & Liberal Arts -18893.5 3774.1 -5.006
## DisciplineIndustrial Arts & Consumer Services -13745.8 4560.5 -3.014
## DisciplineInterdisciplinary -20954.1 9887.7 -2.119
## DisciplineLaw & Public Policy -12378.4 5102.9 -2.426
## DisciplinePhysical Sciences -4509.4 4127.7 -1.092
## DisciplinePsychology & Social Work -19320.8 4329.0 -4.463
## DisciplineSocial Science -12238.0 4249.7 -2.880
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Unemployment_rate 0.00699 **
## DisciplineAgriculture & Natural Resources 0.00129 **
## DisciplineArts 5.14e-05 ***
## DisciplineBiology & Life Science 2.35e-05 ***
## DisciplineBusiness 0.10567
## DisciplineCommunications & Journalism 0.00564 **
## DisciplineEducation 1.44e-09 ***
## DisciplineEngineering 0.00251 **
## DisciplineHealth 0.00474 **
## DisciplineHumanities & Liberal Arts 1.49e-06 ***
## DisciplineIndustrial Arts & Consumer Services 0.00301 **
## DisciplineInterdisciplinary 0.03566 *
## DisciplineLaw & Public Policy 0.01642 *
## DisciplinePhysical Sciences 0.27631
## DisciplinePsychology & Social Work 1.54e-05 ***
## DisciplineSocial Science 0.00454 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9432 on 156 degrees of freedom
## Multiple R-squared: 0.6269, Adjusted R-squared: 0.5887
## F-statistic: 16.38 on 16 and 156 DF, p-value: < 2.2e-16
The \(p\)-value of the F-statistic is very small, indicating high confidence that the model is better than a null model.
# Determine if the two numerical input variables have a significant correlation
cor.test(graduates$Unemployment_rate, graduates$Total)
##
## Pearson's product-moment correlation
##
## data: graduates$Unemployment_rate and graduates$Total
## t = -0.87117, df = 171, p-value = 0.3849
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.21355488 0.08355632
## sample estimates:
## cor
## -0.06647261
# Using the package mctest, use the imcdiag() function to test if collinearity
# exists in your full model from Exercise 3. Use the VIF method for this.
library(mctest)
mctest::imcdiag(model3, method = "VIF")
##
## Call:
## mctest::imcdiag(mod = model3, method = "VIF")
##
##
## VIF Multicollinearity Diagnostics
##
## VIF detection
## Unemployment_rate 1.6389 0
## Total 1.2223 0
## DisciplineAgriculture & Natural Resources 1.9038 0
## DisciplineArts 1.8092 0
## DisciplineBiology & Life Science 2.1234 0
## DisciplineBusiness 2.1863 0
## DisciplineCommunications & Journalism 1.3576 0
## DisciplineEducation 2.2914 0
## DisciplineEngineering 3.0790 0
## DisciplineHealth 1.9891 0
## DisciplineHumanities & Liberal Arts 2.2016 0
## DisciplineIndustrial Arts & Consumer Services 1.5704 0
## DisciplineInterdisciplinary 1.0928 0
## DisciplineLaw & Public Policy 1.4217 0
## DisciplinePhysical Sciences 1.8067 0
## DisciplinePsychology & Social Work 1.8024 0
## DisciplineSocial Science 1.7402 0
##
## NOTE: VIF Method Failed to detect multicollinearity
##
##
## 0 --> COLLINEARITY is not detected by the test
##
## ===================================
# View a density plot of the Median salaries in the data set - what do you observe?
density(graduates$Median_salary) |>
plot(main = "Median Salary Density")
The distribution appears to be somewhat left-skewed due to a tail of high median salaries.
# use the qqnorm() function to determine how the residuals of your model
# compare to a normal distribution
# comment on your observations
# consider re-running your model to address any concerns you have
qqnorm(model4$residuals)
Observations: * Linear regression assumes a normal distribution of residuals * Therefore we want this to be as close to a straight line as possible * Two high median salary outliers appear to be affecting this * We could consider removing these outliers to help build a more reliable model
# remove highest two median salaries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
adjusted_graduates <- graduates |>
dplyr::arrange(Median_salary) |>
head(nrow(graduates) - 2)
# rerun model
model5 <- lm(Median_salary ~ Unemployment_rate + Discipline,
adjusted_graduates)
# test residuals now
qqnorm(model5$residuals)
Healthier looking residual distribution.
# view results
summary(model5)
##
## Call:
## lm(formula = Median_salary ~ Unemployment_rate + Discipline,
## data = adjusted_graduates)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16830.0 -4713.6 -577.4 3618.5 22186.0
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 72367.3 3281.2 22.055
## Unemployment_rate -1025.4 389.8 -2.630
## DisciplineAgriculture & Natural Resources -13309.9 3454.6 -3.853
## DisciplineArts -19859.9 3744.8 -5.303
## DisciplineBiology & Life Science -16425.5 3126.5 -5.254
## DisciplineBusiness -6164.0 3162.5 -1.949
## DisciplineCommunications & Journalism -15779.4 4514.7 -3.495
## DisciplineEducation -23741.1 3058.1 -7.763
## DisciplineEngineering 8926.5 2761.8 3.232
## DisciplineHealth -15452.2 3313.7 -4.663
## DisciplineHumanities & Liberal Arts -19168.2 3083.3 -6.217
## DisciplineIndustrial Arts & Consumer Services -13721.3 3725.6 -3.683
## DisciplineInterdisciplinary -21444.3 8077.7 -2.655
## DisciplineLaw & Public Policy -12609.7 4168.8 -3.025
## DisciplinePhysical Sciences -4374.8 3372.0 -1.297
## DisciplinePsychology & Social Work -19827.4 3537.0 -5.606
## DisciplineSocial Science -12409.8 3471.8 -3.574
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Unemployment_rate 0.009400 **
## DisciplineAgriculture & Natural Resources 0.000171 ***
## DisciplineArts 3.89e-07 ***
## DisciplineBiology & Life Science 4.89e-07 ***
## DisciplineBusiness 0.053100 .
## DisciplineCommunications & Journalism 0.000619 ***
## DisciplineEducation 1.07e-12 ***
## DisciplineEngineering 0.001503 **
## DisciplineHealth 6.71e-06 ***
## DisciplineHumanities & Liberal Arts 4.55e-09 ***
## DisciplineIndustrial Arts & Consumer Services 0.000319 ***
## DisciplineInterdisciplinary 0.008771 **
## DisciplineLaw & Public Policy 0.002916 **
## DisciplinePhysical Sciences 0.196442
## DisciplinePsychology & Social Work 9.34e-08 ***
## DisciplineSocial Science 0.000469 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7705 on 154 degrees of freedom
## Multiple R-squared: 0.6957, Adjusted R-squared: 0.6641
## F-statistic: 22.01 on 16 and 154 DF, p-value: < 2.2e-16
Overall fit has improved, with similar significant variables, but some changes to coefficients.