library(tidyverse)
library(tidymodels)
library(openintro)
Ex 3.3: Train / test data
teach_ds :: Teaching modern modeling with tidymodels
Introduction
Some of the code below has been pre-populated. In these cases, there is a code chunk option set as #| eval: false
. Make sure to remove this option prior to running the relevant code chunk to avoid any errors when rendering the document.
Model
The goal is to fit a model to predict the interest rate (interest_rate
) based on the debt to income ratio (debt_to_income
), type of application (application_type
), and whether there are any bankruptcies listed in the public record for the individual (bankrupt
). The model should allow the effect of debt to income ratio to differ based on application type.
\[ \begin{align}\widehat{interest\_rate} = b_0 &+ b_1 \times debt\_to\_income \\ &+ b_2 \times application\_type \\ &+ b_3 \times bankrupt \\ &+ b_4 \times debt\_to\_income:application\_type\end{align} \]
Recipe + Workflow
This exercise builds on Exercises 1 and 2. To get you started, the recipe created in Exercise 1 and workflow from Exercise 2 are below.
<- recipe(interest_rate ~ debt_to_income + application_type +
loans_rec data = loans_full_schema) |>
public_record_bankrupt, step_impute_mean(debt_to_income) |>
step_mutate(bankrupt = as_factor(if_else(public_record_bankrupt == 0,
"no", "yes"))) |>
step_rm(public_record_bankrupt) |>
step_dummy(all_nominal_predictors()) |>
step_interact(terms = ~ starts_with("application_type"):debt_to_income)
<- linear_reg() |>
loans_spec set_engine("lm")
<- workflow() |>
loans_workflow add_model(loans_spec) |>
add_recipe(loans_rec)
Train / test data
Complete the code below to split the data into training (80%) and testing (20%) sets.
set.seed(12345)
<- initial_split(_____)
loans_split <- _____
loans_train <- ______ loans_test
Fit the model to training data
Fit the model to the training data.
<- ______ loans_train_fit
Evaluate model
Calculate \(R^2\) and \(RMSE\) for the training and test data.
# calculate predictions
# calculate r-sq
# calculate rmse
# calculate predictions
# calculate r-sq
# calculate rmse
Discussion
Does the model perform well if the primary objective is to
- explain variability in interest rates?
- predict the interest rate for an application?
Why or why not?