Ex 3.3: Train / test data

teach_ds :: Teaching modern modeling with tidymodels

library(tidyverse)
library(tidymodels)
library(openintro)

Introduction

Some of the code below has been pre-populated. In these cases, there is a code chunk option set as #| eval: false. Make sure to remove this option prior to running the relevant code chunk to avoid any errors when rendering the document.

Model

The goal is to fit a model to predict the interest rate (interest_rate) based on the debt to income ratio (debt_to_income), type of application (application_type), and whether there are any bankruptcies listed in the public record for the individual (bankrupt). The model should allow the effect of debt to income ratio to differ based on application type.

\[ \begin{align}\widehat{interest\_rate} = b_0 &+ b_1 \times debt\_to\_income \\ &+ b_2 \times application\_type \\ &+ b_3 \times bankrupt \\ &+ b_4 \times debt\_to\_income:application\_type\end{align} \]

Recipe + Workflow

This exercise builds on Exercises 1 and 2. To get you started, the recipe created in Exercise 1 and workflow from Exercise 2 are below.

loans_rec <- recipe(interest_rate ~ debt_to_income + application_type + 
                      public_record_bankrupt, data = loans_full_schema) |>
  step_impute_mean(debt_to_income) |>
  step_mutate(bankrupt = as_factor(if_else(public_record_bankrupt == 0, 
                                           "no", "yes"))) |>
  step_rm(public_record_bankrupt) |>
  step_dummy(all_nominal_predictors()) |>
  step_interact(terms = ~ starts_with("application_type"):debt_to_income)
loans_spec <- linear_reg() |>
  set_engine("lm")

loans_workflow <- workflow() |>
  add_model(loans_spec) |>
  add_recipe(loans_rec)

Train / test data

Complete the code below to split the data into training (80%) and testing (20%) sets.

set.seed(12345)
loans_split <- initial_split(_____)
loans_train <- _____
loans_test <- ______

Fit the model to training data

Fit the model to the training data.

loans_train_fit <- ______

Evaluate model

Calculate \(R^2\) and \(RMSE\) for the training and test data.

# calculate predictions 

# calculate r-sq

# calculate rmse
# calculate predictions 

# calculate r-sq

# calculate rmse

Discussion

Does the model perform well if the primary objective is to

  • explain variability in interest rates?
  • predict the interest rate for an application?

Why or why not?