rstudio::conf(2022)
Designing the data science classroom
Mine Çetinkaya-Rundel
Assumption 1:
Teach authentic tools
Assumption 2:
Teach R as the authentic tool
The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.
Data: Thousands of loans made through the Lending Club, a peer-to-peer lending platform available in the openintro package, with a few modifications.
library(tidyverse)
library(openintro)
loans <- loans_full_schema |>
mutate(
homeownership = str_to_title(homeownership),
bankruptcy = if_else(public_record_bankrupt >= 1, "Yes", "No")
) |>
filter(annual_income >= 10) |>
select(
loan_amount, homeownership, bankruptcy,
application_type, annual_income, interest_rate
)
# A tibble: 9,976 × 6
loan_amount homeownership bankruptcy application_type annual_income interest_rate
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows
Calculate the mean loan amount.
# A tibble: 9,976 × 6
loan_amount homeownership bankruptcy application_type annual_income interest_rate
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows
Error in mean(loan_amount): object 'loan_amount' not found
How would you calculate the mean loan amount?
Add your answer to Discord.
Approach 1: With attach()
:
Not recommended. What if you had another data frame you’re working with concurrently called car_loans
that also had a variable called loan_amount
in it?
Approach 2: Using $
:
Approach 4: The tidyverse approach:
# A tibble: 1 × 1
mean_loan_amount
<dbl>
1 16358.
Tidyverse functions take a data
argument that allows them to localize computations inside the specified data frame
Does not muddy the concept of what is in the current environment: variables always accessed from within in a data frame without the use of an additional function (like with()
) or quotation marks, never as a vector
RStudio Cloud > “Module 2 - Tidyverse” > ex-2-1.qmd
Based on the applicants’ home ownership status, compute the number of applicants and the average loan amount. Display the results in descending order of number of applicants.
Compare answers with your neighbor and choose an approach you would teach in an intro course. Then, type your chosen answer on Discord along with some narrative about how you would approach teaching it describing how you would teach it.
Homeownership | Number of applicants | Average loan amount |
---|---|---|
Mortgage | 4,778 | $18,132 |
Rent | 3,848 | $14,396 |
Own | 1,350 | $15,665 |
10:00
Based on the applicants’ home ownership status, computer the number of applicants and the average loan amount. Display the results in descending order of number of applicants.
# A tibble: 9,976 × 6
loan_amount homeownership bankruptcy application_type annual_income interest_rate
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows
Based on the applicants’ home ownership status, compute the number of applicants and the average loan amount. Display the results in descending order of number of applicants.
[input] data frame
# A tibble: 9,976 × 6
# Groups: homeownership [3]
loan_amount homeownership bankruptcy application_type annual_income interest_rate
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows
data frame [output]
Based on the applicants’ home ownership status, compute the number of applicants and the average loan amount. Display the results in descending order of number of applicants.
Based on the applicants’ home ownership status, compute the number of applicants and the average loan amount. Display the results in descending order of number of applicants.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of number of applicants.
[input] data frame
loans |>
group_by(homeownership) |>
summarize(
n_applicants = n(),
avg_loan_amount = mean(loan_amount)
) |>
arrange(desc(n_applicants))
# A tibble: 3 × 3
homeownership n_applicants avg_loan_amount
<chr> <int> <dbl>
1 Mortgage 4778 18132.
2 Rent 3848 14396.
3 Own 1350 15665.
[output] data frame
aggregate()
aggregate()
aggregate()
formula syntax
passing functions as arguments
merging datasets
square bracket notation for accessing rows
tapply()
tapply()
tapply()
Not so good:
apply()
functionsarray
)RStudio Cloud > “Module 2 - Tidyverse” > ex-2-2.qmd
Using the loans
data, create side-by-side box plots that shows the relationship between loan amount and application type, faceted by homeownership.
Compare answers with your neighbor and choose an approach you would teach in an intro course. Then, type your chosen answer and some narrative about how you would approach teaching it describing how you would teach it.
See next style for desired output.
10:00
ggplot()
boxplot()
boxplot()
Visualize the relationship between interest rate and annual income, conditioned on whether the applicant had a bankruptcy.
ggplot()
ggplot()
ggplot(loans,
aes(y = interest_rate, x = annual_income,
color = bankruptcy)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", size = 2, se = FALSE) +
scale_x_log10(labels = scales::label_dollar()) +
scale_y_continuous(labels = scales::label_percent(scale = 1)) +
scale_color_OkabeIto() +
labs(x = "Annual Income", y = "Interest Rate",
color = "Previous\nBankruptcy") +
theme_minimal(base_size = 18)
plot()
# From the OkabeIto palette
cols = c(No = "#e6a003", Yes = "#57b4e9")
plot(
loans$annual_income,
loans$interest_rate,
pch = 16,
col = adjustcolor(cols[loans$bankruptcy], alpha.f = 0.1),
log = "x",
xlab = "Annual Income ($)",
ylab = "Interest Rate (%)",
xaxp = c(1000, 10000000, 1)
)
lm_b_no = lm(
interest_rate ~ log10(annual_income),
data = loans[loans$bankruptcy == "No",]
)
lm_b_yes = lm(
interest_rate ~ log10(annual_income),
data = loans[loans$bankruptcy == "Yes",]
)
abline(lm_b_no, col = cols["No"], lwd = 3)
abline(lm_b_yes, col = cols["Yes"], lwd = 3)
legend(
"topright",
legend = c("Yes", "No"),
title = "Previous\nBankruptcy",
col = cols[c("Yes", "No")],
pch = 16, lwd = 1
)
plot()
Modeling and inference with tidymodels:
A unified interface to modeling functions available in a large variety of packages
Sticking to the data frame in / data frame out paradigm
Guardrails for methodology
Next module is teaching with tidymodels!
No matter which approach or tool you use, you should strive to be consistent in the classroom whenever possible
Tidyverse offers consistency, something we believe to be of the utmost importance, allowing students to move knowledge about function arguments to their long-term memory
Challenge: Google and Stack Overflow can be less useful – demo problem solving
Counter-proposition: teach all (or multiple) syntaxes at once – trying to teach two (or more!) syntaxes at once will slow the pace of the course, introduce unnecessary syntactic confusion, and make it harder for students to complete their work.
“Disciplined in what we teach, liberal in what we accept”
Mix with base R code or code from other packages
In fact, you can’t not mix with base R code!
Adding a new variable to a visualization or a new summary statistic doesn’t require introducing a numerous functions, interfaces, and data structures
Interfaces designed with user experience (and learning) in mind
Continuous feedback collection and iterative improvements based on user experiences improve functions’ and packages’ usability (and learnability)
Interfaces that are designed to produce readable code
The encouraging and inclusive tidyverse community is one of the benefits of the paradigm
Each package comes with a website, each of these websites are similarly laid out, and results of example code are displayed, and extensive vignettes describe how to use various functions from the package together
Get SQL for free with dplyr verbs!
Start with library(tidyverse)
Teach by learning goals, not packages
Blog posts highlight updates, along with the reasoning behind them and worked examples
Lifecycle stages and badges
We are all converts to the tidyverse and have made a conscious choice to use it in our research and our teaching. We each learned R without the tidyverse and have all spent quite a few years teaching without it at a variety of levels from undergraduate introductory statistics courses to graduate statistical computing courses. This paper is a synthesis of the reasons supporting our tidyverse choice, along with benefits and challenges associated with teaching statistics with the tidyverse.
Do you teach with the tidyverse?
Any other discussion points of interest?
Discuss with your partner for a few minutes first, before sharing with the big group.
04:00
Let’s take a look at the source code for these slides for some of the highlighting tricks!
🔗 rstd.io/teach-ds-conf22 / Module 2