library(tidyverse)
library(tidymodels)
library(openintro)
Ex 3.1: Feature engineering
teach_ds :: Teaching modern modeling with tidymodels
Introduction
Some of the code below has been pre-populated. In these cases, there is a code chunk option set as #| eval: false
. Make sure to remove this option prior to running the relevant code chunk to avoid any errors when rendering the document.
Data
The data is the loans_full_schema
data set from the openintro package and featured in the OpenIntro textbooks . It contains information about 50,000 loans made through the Lending Club platform.
The data has a bit of peculiarity about it, specifically the application_type
variable is a factor variable with an empty level.
levels(loans_full_schema$application_type)
[1] "" "individual" "joint"
Let’s clean up this variable using the droplevels()
function first. And let’s apply that to the whole data set, in case there are other variables with similar issues.
<- droplevels(loans_full_schema) loans_full_schema
The variables we’ll use in this analysis are:
interest_rate
: Interest rate of the loan the applicant received.debt_to_income
: Debt-to-income ratio.public_record_bankrupt
: Number of bankruptcies listed in the public record for this applicant.application_type
: The type of application: eitherindividual
orjoint
.
Model
The goal is to fit a model to predict the interest rate (interest_rate
) based on the debt to income ratio (debt_to_income
), type of application (application_type
), and whether there are any bankruptcies listed in the public record for the individual (bankrupt
). The model should allow the effect of debt to income ratio to differ based on application type.
\[ \begin{align}\widehat{interest\_rate} = b_0 &+ b_1 \times debt\_to\_income \\ &+ b_2 \times application\_type \\ &+ b_3 \times bankrupt \\ &+ b_4 \times debt\_to\_income:application\_type\end{align} \]
Feature engineering with dplyr
The feature engineering to transform and create new variables using dplyr is shown below.
<- loans_full_schema |>
loans_full_schema_mod mutate(debt_to_income =
if_else(is.na(debt_to_income), mean(debt_to_income, na.rm = TRUE), debt_to_income)) |>
mutate(bankrupt =
as_factor(if_else(public_record_bankrupt == 0, "no", "yes")))|>
mutate(app_type_joint = if_else(application_type == "joint", 1, 0)) |>
mutate(debt_app_int = debt_to_income * app_type_joint) |>
select(-public_record_bankrupt)
|>
loans_full_schema_mod select(debt_to_income, bankrupt, app_type_joint, debt_app_int) |>
head()
# A tibble: 6 × 4
debt_to_income bankrupt app_type_joint debt_app_int
<dbl> <fct> <dbl> <dbl>
1 18.0 no 0 0
2 5.04 yes 0 0
3 21.2 no 0 0
4 10.2 no 0 0
5 58.0 no 1 58.0
6 6.46 no 0 0
Feature engineering using recipes
Complete the code below to write the recipe for this model. Call the recipe loans_rec
. Use the appropriate step_*
functions to complete the feature engineering steps shown in the dplyr pipeline above. You can find the list of step_*
functions on the recipes reference page.
<- recipe(interest_rate ~ _______,
loans_rec data = loans_full_schema) |>
Use prep()
and bake()
to view the transformed and newly created variables.
|>
loans_rec prep()|>
bake(_____) |>
head()
Discussion
What was easy / straightforward about writing the recipe? What was challenging?