Teaching modern modeling with tidymodels

rstudio::conf(2022)
Designing the data science classroom

Maria Tackett

Session outline

  • Teaching modern modeling
  • Introducing tidymodels
  • Teaching with tidymodels
    • Feature engineering
    • Prediction + model evaluation
    • Train / test data
  • Putting it all together

Teaching modern modeling

GAISE guidelines

  • Teach statistical thinking.
    • Teach statistics as an investigative process of problem-solving and decision-making.
    • Give students experience with multivariable thinking.
  • Integrate real data with a context and purpose.
  • Use technology to explore concepts and analyze data.

See Guidelines for Assessment and Instruction in Statistics Education (GAISE) 2016 Report for full report.

Teaching modern regression

Facilitate opportunities for students to…

  • Regularly engage with real-world applications and complex data

  • Develop proficiency using professional statistical software and using a reproducible workflow

  • Identify appropriate methods based on the primary analysis objective - inference or prediction

  • Develop important non-technical skills, specifically written communication and teamwork

Introducing tidymodels

Tidymodels

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.


install.packages("tidymodels")

Tidymodels

library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.0     ✔ rsample      1.0.0
✔ dials        1.0.0     ✔ tune         1.0.0
✔ infer        1.0.2     ✔ workflows    1.0.0
✔ modeldata    1.0.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.0     ✔ yardstick    1.0.0
✔ recipes      1.0.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

Data: Loans from Lending Club

The data is the loans_full_schema data set from the openintro package and featured in the OpenIntro textbooks . It contains information about 50,000 loans made through the Lending Club platform. The variables we’ll use in this presentation are

  • interest_rate: Interest rate of the loan the applicant received.
  • debt_to_income: Debt-to-income ratio.
  • term: The number of months of the loan the applicant received.
  • delinq_2y: Number of delinquencies on lines of credit in the last 2 years.

Data: Loans from Lending Club

library(openintro)
glimpse(loans_full_schema)
Rows: 10,000
Columns: 55
$ emp_title                        <chr> "global config engineer ", "warehouse…
$ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
$ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
$ homeownership                    <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
$ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
$ verified_income                  <fct> Verified, Not Verified, Source Verifi…
$ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
$ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
$ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
$ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
$ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
$ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
$ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
$ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
$ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
$ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
$ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
$ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
$ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
$ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
$ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
$ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
$ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
$ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
$ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
$ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
$ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
$ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
$ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
$ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ loan_purpose                     <fct> moving, debt_consolidation, other, de…
$ application_type                 <fct> individual, individual, individual, i…
$ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
$ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
$ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
$ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
$ grade                            <fct> C, C, D, A, C, A, C, B, C, A, C, B, C…
$ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
$ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
$ loan_status                      <fct> Current, Current, Current, Current, C…
$ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
$ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
$ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
$ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
$ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
$ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
$ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Exploratory data analysis

ggplot(loans_full_schema, aes(x = interest_rate)) +
  geom_histogram(binwidth = 1) +
  labs(
    x = "Interest rate", y = "Count",
    title = "Distribution of loan interest rates"
  )

Exploratory data analysis

ggplot(data = loans_full_schema, aes(x = debt_to_income, y = interest_rate )) +
  geom_point() + 
  labs(x = "Debt to income ratio", y = "Interest rate", 
       title = "Interest rate by debt to income ratio")

Exploratory data analysis

ggplot(data = loans_full_schema, aes(x = interest_rate, fill = factor(term))) +
  geom_density(alpha= 0.5) + 
  labs(x = "Interest rate", y = "Density", 
       fill = "Term (in months)",
       title = "Interest rate by debt by term length")

Regression syntax

Fit a linear regression model to predict the interest rate using the debt to income ratio.

base_lm <- lm(interest_rate ~ debt_to_income, data = loans_full_schema)
summary(base_lm)

Call:
lm(formula = interest_rate ~ debt_to_income, data = loans_full_schema)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.7391  -3.7203  -0.7945   2.7351  18.6274 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    11.511445   0.080732  142.59   <2e-16 ***
debt_to_income  0.047183   0.003302   14.29   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.948 on 9974 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.02007,   Adjusted R-squared:  0.01997 
F-statistic: 204.2 on 1 and 9974 DF,  p-value: < 2.2e-16
tidy_lm <- linear_reg() |> 
  set_engine("lm") |>
  fit(interest_rate ~ debt_to_income, data = loans_full_schema)

tidy(tidy_lm)
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     11.5      0.0807      143.  0       
2 debt_to_income   0.0472   0.00330      14.3 7.05e-46

Model summaries using broom

Can utilize functions from the broom package to produce tidy summaries of models fit using Base R or the tidymodels framework

  • tidy(): summarizes information about model components

  • glance(): reports information about the entire model

  • augment(): adds information about observations to a data set

tidy()

base_lm <- lm(interest_rate ~ debt_to_income, data = loans_full_schema)
tidy(base_lm)
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     11.5      0.0807      143.  0       
2 debt_to_income   0.0472   0.00330      14.3 7.05e-46
tidy_lm <- linear_reg() |> 
  set_engine("lm") |>
  fit(interest_rate ~ debt_to_income, data = loans_full_schema)
tidy(tidy_lm)
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     11.5      0.0807      143.  0       
2 debt_to_income   0.0472   0.00330      14.3 7.05e-46

glance()

glance(base_lm)
# A tibble: 1 × 12
  r.squared adj.r.s…¹ sigma stati…²  p.value    df  logLik    AIC    BIC devia…³
      <dbl>     <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
1    0.0201    0.0200  4.95    204. 7.05e-46     1 -30105. 60217. 60238. 244166.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
# ℹ Use `colnames()` to see all variable names
glance(tidy_lm)
# A tibble: 1 × 12
  r.squared adj.r.s…¹ sigma stati…²  p.value    df  logLik    AIC    BIC devia…³
      <dbl>     <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
1    0.0201    0.0200  4.95    204. 7.05e-46     1 -30105. 60217. 60238. 244166.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
# ℹ Use `colnames()` to see all variable names

augment()

base_lm_aug <- augment(base_lm)
base_lm_aug
# A tibble: 9,976 × 9
   .rownames interest_rate debt_…¹ .fitted .resid    .hat .sigma .cooksd .std.…²
   <chr>             <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
 1 1                 14.1    18.0     12.4  1.71  1.01e-4   4.95 6.02e-6  0.345 
 2 2                 12.6     5.04    11.7  0.861 1.91e-4   4.95 2.89e-6  0.174 
 3 3                 17.1    21.2     12.5  4.58  1.02e-4   4.95 4.36e-5  0.926 
 4 4                  6.72   10.2     12.0 -5.27  1.38e-4   4.95 7.80e-5 -1.07  
 5 5                 14.1    58.0     14.2 -0.176 7.65e-4   4.95 4.86e-7 -0.0356
 6 6                  6.72    6.46    11.8 -5.10  1.74e-4   4.95 9.22e-5 -1.03  
 7 7                 13.6    23.7     12.6  0.962 1.09e-4   4.95 2.06e-6  0.194 
 8 8                 12.0    16.2     12.3 -0.285 1.05e-4   4.95 1.74e-7 -0.0577
 9 9                 13.6    36.5     13.2  0.357 2.32e-4   4.95 6.04e-7  0.0722
10 10                 6.71   18.9     12.4 -5.69  1.00e-4   4.95 6.64e-5 -1.15  
# … with 9,966 more rows, and abbreviated variable names ¹​debt_to_income,
#   ²​.std.resid
# ℹ Use `print(n = ...)` to see more rows
tidy_lm_aug <- augment(tidy_lm$fit)
tidy_lm_aug
# A tibble: 9,976 × 9
   .rownames interest_rate debt_…¹ .fitted .resid    .hat .sigma .cooksd .std.…²
   <chr>             <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
 1 1                 14.1    18.0     12.4  1.71  1.01e-4   4.95 6.02e-6  0.345 
 2 2                 12.6     5.04    11.7  0.861 1.91e-4   4.95 2.89e-6  0.174 
 3 3                 17.1    21.2     12.5  4.58  1.02e-4   4.95 4.36e-5  0.926 
 4 4                  6.72   10.2     12.0 -5.27  1.38e-4   4.95 7.80e-5 -1.07  
 5 5                 14.1    58.0     14.2 -0.176 7.65e-4   4.95 4.86e-7 -0.0356
 6 6                  6.72    6.46    11.8 -5.10  1.74e-4   4.95 9.22e-5 -1.03  
 7 7                 13.6    23.7     12.6  0.962 1.09e-4   4.95 2.06e-6  0.194 
 8 8                 12.0    16.2     12.3 -0.285 1.05e-4   4.95 1.74e-7 -0.0577
 9 9                 13.6    36.5     13.2  0.357 2.32e-4   4.95 6.04e-7  0.0722
10 10                 6.71   18.9     12.4 -5.69  1.00e-4   4.95 6.64e-5 -1.15  
# … with 9,966 more rows, and abbreviated variable names ¹​debt_to_income,
#   ²​.std.resid
# ℹ Use `print(n = ...)` to see more rows

Why Tidymodels?

There are advantages for more advanced modeling:

  • Consistent syntax for different model types (linear, logistic, random forest, Bayesian, etc.)
  • Streamline modeling workflow
    • Split data into train and test sets
    • Transform and create new variables
    • Assess model performance
    • Use model for prediction and inference

Teaching with tidymodels

Tidymodels syntax

linear_reg() |> 
  set_engine("lm") |>
  fit(interest_rate ~ debt_to_income, data = loans_full_schema) |>
  tidy()
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     11.5      0.0807      143.  0       
2 debt_to_income   0.0472   0.00330      14.3 7.05e-46


Let’s break down the syntax.

1️⃣ Specify model

linear_reg() 
Linear Regression Model Specification (regression)

Computational engine: lm 

2️⃣ Set computational engine

linear_reg() |> 
  set_engine("lm")
Linear Regression Model Specification (regression)

Computational engine: lm 

3️⃣ Fit the model

linear_reg() |> 
  set_engine("lm") |>
  fit(interest_rate ~ debt_to_income, data = loans_full_schema)
parsnip model object


Call:
stats::lm(formula = interest_rate ~ debt_to_income, data = data)

Coefficients:
   (Intercept)  debt_to_income  
      11.51145         0.04718  

4️⃣ Summarize output

linear_reg() |> 
  set_engine("lm") |>
  fit(interest_rate ~ debt_to_income, data = loans_full_schema) |>
  tidy()
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     11.5      0.0807      143.  0       
2 debt_to_income   0.0472   0.00330      14.3 7.05e-46

Consistent syntax for other models

The syntax is the same if we fit a more advanced model, such as a logistic regression model.

Fit a model to predict the loan term length (36 or 60 months) based on the loan amount.

logistic_reg() |>
  set_engine("glm") |>
  fit(factor(term) ~ loan_amount, data = loans_full_schema) |>
  tidy()
# A tibble: 2 × 5
  term          estimate  std.error statistic   p.value
  <chr>            <dbl>      <dbl>     <dbl>     <dbl>
1 (Intercept) -2.28      0.0493         -46.2 0        
2 loan_amount  0.0000814 0.00000232      35.1 1.02e-269

Feature engineering

  • Feature engineering is the process of transforming raw variables in preparation for use in a statistical model.

  • You may be familiar doing feature engineering using dplyr before fitting the model.

  • The recipes package makes it possible to do feature engineering as part of the modeling workflow using “dplyr-like” functions.

Example: Predicting interest rate

  • Goal: Fit a model to predict the interest rate based on the term, debt to income ratio, and number of delinquencies in the past two years.

  • We need to do the following to prepare the predictors for the model:

    • Make term a factor.

    • Mean-center debt_to_income.

    • Split delinq_2y into the categories 0, 1, 2, 3+.

Feature engineering using dplyr

# Feature engineering
loans_full_schema <- loans_full_schema |>
  mutate(term_fct = as_factor(term)) |>
  mutate(debt_to_income_cent = 
           debt_to_income - mean(debt_to_income, na.rm = TRUE)) |>
  mutate(delinq_2y_cat = 
           cut(delinq_2y, breaks = c(-Inf,0,1, 2, 3, Inf)))

# Fit the model 
lm(interest_rate ~ term_fct + debt_to_income_cent + delinq_2y_cat,
   data = loans_full_schema) |>
  tidy()
# A tibble: 7 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)            11.0      0.0584     189.   0        
2 term_fct60              3.86     0.100       38.5  6.67e-303
3 debt_to_income_cent     0.0429   0.00307     14.0  5.47e- 44
4 delinq_2y_cat(0,1]      1.37     0.153        8.95 4.08e- 19
5 delinq_2y_cat(1,2]      1.68     0.291        5.76 8.57e-  9
6 delinq_2y_cat(2,3]      2.46     0.492        5.00 5.78e-  7
7 delinq_2y_cat(3, Inf]   2.85     0.563        5.05 4.39e-  7

Discussion

What is a disadvantage to this approach?

02:00

Feature engineering with recipes

Illustration by Allison Horst

Specify the variables

interest_rec <- recipe(
  interest_rate ~ term + debt_to_income + delinq_2y, 
  data = loans_full_schema 
  )

interest_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          3

Define pre-processing steps

Make term a factor.

interest_rec <- interest_rec |>
  step_mutate(term = as_factor(term))

interest_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          3

Operations:

Variable mutation for as_factor(term)

Define pre-processing steps

Mean-center debt_to_income.

interest_rec <- interest_rec |>
  step_center(debt_to_income)

interest_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          3

Operations:

Variable mutation for as_factor(term)
Centering for debt_to_income

Define pre-processing steps

Break delinq_2y into the categories 0, 1, 2, 3+.

interest_rec <- interest_rec |>
  step_cut(delinq_2y, breaks = c(-Inf,0,1, 2, 3, Inf))

interest_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          3

Operations:

Variable mutation for as_factor(term)
Centering for debt_to_income
Cut numeric for delinq_2y

Putting it all together

interest_rec <- recipe(interest_rate ~ term + debt_to_income + delinq_2y,
                       data = loans_full_schema) |>
  step_mutate(term = as_factor(term))|>
  step_center(debt_to_income) |>
  step_cut(delinq_2y, breaks = c(-Inf,0,1, 2, 3, Inf))

Putting it all together

interest_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          3

Operations:

Variable mutation for as_factor(term)
Centering for debt_to_income
Cut numeric for delinq_2y

Prep and bake to see created variables

interest_rec |>
  prep()|>
  bake(loans_full_schema) |>
  head()
# A tibble: 6 × 4
  term  debt_to_income delinq_2y interest_rate
  <fct>          <dbl> <fct>             <dbl>
1 60             -1.30 [-Inf,0]          14.1 
2 36            -14.3  [-Inf,0]          12.6 
3 36              1.84 [-Inf,0]          17.1 
4 36             -9.15 [-Inf,0]           6.72
5 36             38.7  [-Inf,0]          14.1 
6 36            -12.8  (0,1]              6.72

Exercise 1: Feature engineering

RStudio Cloud > “Model 3 - Tidymodels” > ex-3-1.qmd

Write the recipe for the model in the exercise file. Call the recipe loans_rec. Use the appropriate step_* functions to complete the feature engineering steps shown in the dplyr pipeline.

You can find the list of step_* functions on the recipes reference page.

15:00

Fit the model using workflow()

Workflows bring together models and recipes, making them easier to apply to multiple data sets, e.g, training and test data.

Specify the model

interest_spec <- linear_reg() |>
  set_engine("lm")

Build workflow

interest_workflow <- workflow() |>
  add_model(interest_spec) |>
  add_recipe(interest_rec)

View workflow

interest_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_mutate()
• step_center()
• step_cut()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Fit model to data

interest_fit <- interest_workflow %>%
  fit(data = loans_full_schema)

tidy(interest_fit)
# A tibble: 7 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        11.0      0.0584     189.   0        
2 term60              3.86     0.100       38.5  6.67e-303
3 debt_to_income      0.0429   0.00307     14.0  5.47e- 44
4 delinq_2y(0,1]      1.37     0.153        8.95 4.08e- 19
5 delinq_2y(1,2]      1.68     0.291        5.76 8.57e-  9
6 delinq_2y(2,3]      2.46     0.492        5.00 5.78e-  7
7 delinq_2y(3, Inf]   2.85     0.563        5.05 4.39e-  7

Prediction + Model evaluation

Make predictions

interest_pred <- predict(interest_fit, loans_full_schema) |> 
  bind_cols(loans_full_schema |> select(interest_rate))

interest_pred
# A tibble: 10,000 × 2
   .pred interest_rate
   <dbl>         <dbl>
 1  14.8         14.1 
 2  10.4         12.6 
 3  11.1         17.1 
 4  10.6          6.72
 5  12.7         14.1 
 6  11.8          6.72
 7  15.1         13.6 
 8  16.1         12.0 
 9  13.1         13.6 
10  11.0          6.71
# … with 9,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Model evaluation: \(R^2\)

\(R^2\) is the percent of variability in the interest rate explained by the model.

rsq(interest_pred, truth = interest_rate, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.157

Model evaluation: \(RMSE\)

\(RMSE\) is a measure of the error in the model predictions.

\[ RMSE = \sqrt{\frac{\sum_{i=1}^{n}(\hat{y}_i - y_i)^2}{n}} \]

rmse(interest_pred, truth = interest_rate, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        4.59

Exercise 2: Fit, predict, evaluate

RStudio Cloud > “Model 3 - Tidymodels” > ex-3-2.qmd

Build a workflow and fit the model specified in the exercise. Then calculate predictions and evaluate the model using \(R^2\) and \(RMSE\).

10:00

What is a limitation to the model evaluation we’ve done thus far?

Splitting the data

Splitting the data into training and testing sets allows us to evaluate the model on new data.

Split the data.

set.seed(0725) #to get same split each run
loans_split <- initial_split(loans_full_schema, prop = 0.8) #80% training

Save the training data.

loans_train <- training(loans_split)
dim(loans_train)
[1] 8000   58

Save test data.

loans_test <- testing(loans_split)
dim(loans_test)
[1] 2000   58

Fit the model to the training data

We can fit the model specified in interest_workflow (model spec + recipe) to the training data.

interest_train_fit <- interest_workflow |>
  fit(data = loans_train)

tidy(interest_train_fit)
# A tibble: 7 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        11.0      0.0646     170.   0        
2 term60              3.91     0.112       35.1  1.05e-250
3 debt_to_income      0.0402   0.00331     12.2  1.12e- 33
4 delinq_2y(0,1]      1.33     0.171        7.79 7.77e- 15
5 delinq_2y(1,2]      1.75     0.323        5.41 6.38e-  8
6 delinq_2y(2,3]      2.48     0.536        4.63 3.79e-  6
7 delinq_2y(3, Inf]   3.01     0.617        4.88 1.09e-  6

Evaluate performance on training data

Calculate predictions

interest_train_pred <- predict(interest_train_fit, loans_train) |> 
  bind_cols(loans_train |> select(interest_rate))

Evaluate model

rsq(interest_train_pred, truth = interest_rate, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.161
rmse(interest_train_pred, truth = interest_rate, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        4.55

Evaluate performance on testing data

Calculate predictions

interest_test_pred <- predict(interest_train_fit, loans_test) |> 
  bind_cols(loans_test |> select(interest_rate))

Evaluate model

rsq(interest_test_pred, truth = interest_rate, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.140
rmse(interest_test_pred, truth = interest_rate, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        4.73

Exercise 3: Train / test data (time permitting)

RStudio Cloud > “Model 3 - Tidymodels” > ex-3-3.qmd

Complete the code below to split the data into training (80%) and testing (20%) sets. Fit the model to the training set, then evaluate the model on the training and test sets.

10:00

Putting it all together

Let’s take a look at a full analysis.

RStudio Cloud > “Model 3 - Tidymodels” > ex-3-full-analysis.qmd

Discussion

Discuss the following with your neighbor.

  • What is something from this module you can implement in your data science course?

  • Do you anticipate any challenges / is there anything that makes you hesitant about teaching modeling using tidymodels?

Any other discussion points of interest?

04:00