Ex 3.1: Feature engineering

teach_ds :: Teaching modern modeling with tidymodels

library(tidyverse)
library(tidymodels)
library(openintro)

Introduction

Some of the code below has been pre-populated. In these cases, there is a code chunk option set as #| eval: false. Make sure to remove this option prior to running the relevant code chunk to avoid any errors when rendering the document.

Data

The data is the loans_full_schema data set from the openintro package and featured in the OpenIntro textbooks . It contains information about 50,000 loans made through the Lending Club platform.

The data has a bit of peculiarity about it, specifically the application_type variable is a factor variable with an empty level.

levels(loans_full_schema$application_type)

[1] ""           "individual" "joint"

Let’s clean up this variable using the droplevels() function first. And let’s apply that to the whole data set, in case there are other variables with similar issues.

loans_full_schema <- droplevels(loans_full_schema)

The variables we’ll use in this analysis are:

interest_rate: Interest rate of the loan the applicant received.
debt_to_income: Debt-to-income ratio.
public_record_bankrupt: Number of bankruptcies listed in the public record for this applicant.
application_type: The type of application: either individual or joint.

Model

The goal is to fit a model to predict the interest rate (interest_rate) based on the debt to income ratio (debt_to_income), type of application (application_type), and whether there are any bankruptcies listed in the public record for the individual (bankrupt). The model should allow the effect of debt to income ratio to differ based on application type.

\[ \begin{align}\widehat{interest\_rate} = b_0 &+ b_1 \times debt\_to\_income \\ &+ b_2 \times application\_type \\ &+ b_3 \times bankrupt \\ &+ b_4 \times debt\_to\_income:application\_type\end{align} \]

Feature engineering with dplyr

The feature engineering to transform and create new variables using dplyr is shown below.

loans_full_schema_mod <- loans_full_schema |>
  mutate(debt_to_income = 
           if_else(is.na(debt_to_income), mean(debt_to_income, na.rm = TRUE), debt_to_income)) |>
  mutate(bankrupt = 
           as_factor(if_else(public_record_bankrupt == 0, "no", "yes")))|>
  mutate(app_type_joint = if_else(application_type == "joint", 1, 0)) |>
  mutate(debt_app_int = debt_to_income * app_type_joint) |>
  select(-public_record_bankrupt)

loans_full_schema_mod |>
  select(debt_to_income, bankrupt, app_type_joint, debt_app_int) |>
  head()

# A tibble: 6 × 4
  debt_to_income bankrupt app_type_joint debt_app_int
           <dbl> <fct>             <dbl>        <dbl>
1          18.0  no                    0          0  
2           5.04 yes                   0          0  
3          21.2  no                    0          0  
4          10.2  no                    0          0  
5          58.0  no                    1         58.0
6           6.46 no                    0          0

Feature engineering using recipes

Complete the code below to write the recipe for this model. Call the recipe loans_rec. Use the appropriate step_* functions to complete the feature engineering steps shown in the dplyr pipeline above. You can find the list of step_* functions on the recipes reference page.

loans_rec <- recipe(interest_rate ~ _______,
                       data = loans_full_schema) |>

Use prep() and bake() to view the transformed and newly created variables.

loans_rec |>
  prep()|>
  bake(_____) |>
  head()

Discussion

What was easy / straightforward about writing the recipe? What was challenging?