Hello #teachds

rstudio::conf(2022)
Designing the data science classroom

Mine Çetinkaya-Rundel + Maria Tackett

Welcome

Introductions

Dr. Mine Çetinkaya-Rundel

Dr. Maria Tackett

Teaching Assistants

Becky Tang - Duke University
Simon Couch - RStudio

Your turn!

Introduce yourselves:

Name
Affiliation and/or where you’re joining from (geographically)
Where you are in your teaching (or learning) journey
Your favourite thing to teach

Workshop materials

One link for all materials

🔗 https://rstd.io/teach-ds-conf22

Schedule

Day 1

Time	Activity
09:00 - 10:30	Hello #teachds
10:30 - 11:00	Coffee break
11:00 - 12:30	Teaching data science with the tidyverse
12:30 - 13:30	Lunch break
13:30 - 15:00	Teaching modern modeling with tidymodels
15:00 - 15:30	Coffee break
15:30 - 17:00	Interactivity and immediate feedback with learnr

Day 2

Time	Activity
09:00 - 10:30	Computing infrastructure with RStudio Cloud
10:30 - 11:00	Coffee break
11:00 - 12:30	Reproducible workflows: Quarto, Git, GitHub
12:30 - 13:30	Lunch break
13:30 - 15:00	Making a data package
15:00 - 15:30	Coffee break
15:30 - 17:00	Organizing teaching materials + Wrap-up / Q&A

WiFi

Username: conf22

Password: together!

Code of Conduct

All details are available at https://www.rstudio.com/conference/2022/2022-conf-code-of-conduct/. Please review them carefully.

You can report Code of Conduct violations in person (any rstudio::conf staff ), by email (conf@rstudio.com), or by phone (844-448-1212). Please see the policy linked above for contact information.

Covid19 specific policies:

RStudio requires that you wear a mask that fully covers your mouth and nose at all times in all public spaces.
We strongly recommend that you use a correctly fitted N95, KN95, or similar particulate filtering mask; there is a limited supply available upon request.

Other useful info

There are gender neutral bathrooms by the National Harbor rooms.
The meditation room is located at National Harbor 9. Open 8am - 5pm, Monday - Thursday. The hotel also has a dedicated room behind the reception.
The lactation room is located at Potomac Dressing Room. Open 8am - 5pm, Monday - Thursday.
Participants who do not wish to be photographed have red lanyards, please note everyone’s lanyard colors before taking a photo and respect their choices.

Asking for help (Stickies)

I’m stuck

I’m done

I have a general question

Discord

You should have received an email with an invitation and instructions for joining the conference’s discord server.

This workshop has a private channel under Workshops:

#📚designing-the-data-science-classroom

This is a great place to ask questions, share responses to exercises, post resources, memes, or most anything else before, during, and after the workshop.

Take a minute to

Join the Discord server for conf: https://discord.gg/FRxNvG7KP9.
If you’re already in, let us know if you’re not in the channel for this workshop.

Computational Environment

RStudio Cloud

You can use the following link to join the workshops RStudio cloud space,

rstd.io/teach-ds-conf22-cloud

Once you have joined, navigate to Projects on the top menu.

Using your own system

If you’d like to use your own system, please see https://rstudio-conf-2022.github.io/teach-ds/#install.

Principles for designing introductory data science curricula

Baking a cake

Imagine you’re new to baking, and you’re in a baking class. I’m going to present two options for starting the class. Which one gives you better sense of the final product?

Baking a cake

Today we’re going to make a pineapple and coconut sandwich sponge cake with these ingredients

Baking a cake

Today we’re going to make a pineapple and coconut sandwich sponge cake with these ingredients

OK, hold on to that thought!

Design foundation 1: Backwards design

Set goals for educational curriculum before choosing instructional methods + forms of assessment

Identify desired results
Determine acceptable evidence
Plan learning experiences and instruction

Designing backwards

Identify desired data analysis results
Determine building blocks
Plan learning experiences and instruction

Design foundation 2: GAISE

2016 Guidelines for Assessment and Instruction in Statistics Education

Teach statistical thinking.
- Teach statistics as an investigative process of problem-solving and decision making.
- Give students experience with multivariable thinking […] to answer challenging questions that require them to investigate and explore relationships among many variables.

Focus on conceptual understanding.
Integrate real data with a context and purpose.
Foster active learning.
Use technology to explore concepts and analyse data.
Use assessments to improve and evaluate student learning.

2016 GAISE

Teach statistical thinking.
- Teach statistics as an investigative process of problem-solving and decision making.
- Give students experience with multivariable thinking […] to answer challenging questions that require them to investigate and explore relationships among many variables.

Focus on conceptual understanding.
Integrate real data with a context and purpose.
Foster active learning.
Use technology to explore concepts and analyse data.
Use assessments to improve and evaluate student learning.

NOT a commonly used subset of tests and intervals and produce them with hand calculations

2016 GAISE

Teach statistical thinking.
- Teach statistics as an investigative process of problem-solving and decision making.
- Give students experience with multivariable thinking […] to answer challenging questions that require them to investigate and explore relationships among many variables.

Focus on conceptual understanding.
Integrate real data with a context and purpose.
Foster active learning.
Use technology to explore concepts and analyse data.
Use assessments to improve and evaluate student learning.

Multivariate analysis requires the use of computing

2016 GAISE

Teach statistical thinking.
- Teach statistics as an investigative process of problem-solving and decision making.
- Give students experience with multivariable thinking […] to answer challenging questions that require them to investigate and explore relationships among many variables.

Focus on conceptual understanding.
Integrate real data with a context and purpose.
Foster active learning.
Use technology to explore concepts and analyse data.
Use assessments to improve and evaluate student learning.

NOT use technology that is only applicable in the intro course or that doesn’t follow good science principles

2016 GAISE

Teach statistical thinking.
- Teach statistics as an investigative process of problem-solving and decision making.
- Give students experience with multivariable thinking […] to answer challenging questions that require them to investigate and explore relationships among many variables.

Focus on conceptual understanding.
Integrate real data with a context and purpose.
Foster active learning.
Use technology to explore concepts and analyse data.
Use assessments to improve and evaluate student learning.

Data analysis isn’t just inference and modelling, it’s also data importing, cleaning, preparation, exploration, and visualization

So, where do we go with all this?

Discussion

Discuss in pairs and then as a large group.

What are your first reactions to the curriculum we just described for an intro data science course?
Which components can you see yourself (or do you already) include in an intro data science curriculum?
Which components do you have reservations about, and why?

05:00

Which kitchen would you rather bake in?

Design principle 1: Cherish day one

Install R
Install RStudio
Install the following packages:
- tidyverse
- rmarkdown
- …
Load these packages
Install git

Go to rstudio.cloud (or some other server based solution)
Log in with your ID & pass

> hello R!

Your turn: UN Votes

Go to rstd.io/teach-ds-conf22-cloud to join the RStudio Cloud workspace for this workshop > Log in > Project (top left) > Start “Module 1 - Hello” > ex-1-1.qmd

Open the Quarto document called ex-1-1.qmd, render the document, view the result. Then, change “Turkey” to another country, and render again.

15:00

How do you prefer your cake recipes? Words only, or words & pictures?

Design principle 2: Start with cake

Open today’s demo project
Knit the document and discuss the visualisation you made with your neighbor
Then, change Turkey to a different country, and plot again

x <- 8
y <- "monkey"
z <- FALSE

class(x)

[1] "numeric"

class(y)

[1] "character"

class(z)

[1] "logical"

Practically speaking…

With with great examples, comes a great amount of code…
So explicitly encourage students to focus on the task at hand

Open today’s demo project
Knit the document and discuss the visualization you made with your neighbor
Then, change Turkey to a different country, and plot again

Focusing on the task at hand

un_votes |>
  filter(country %in% c("United States", "Turkey")) |>
  inner_join(un_roll_calls, by = "rcid") |>
  inner_join(un_roll_call_issues, by = "rcid") |>
  mutate(issue = ifelse(issue == "Nuclear weapons and nuclear material",
                        "Nuclear weapons and materials", issue)) |>
  group_by(country, year = year(date), issue) |>
  summarize(
    votes = n(),
    percent_yes = mean(vote == "yes")
    ) |>
  filter(votes > 5) |>
  ggplot(mapping = aes(x = year, y = percent_yes, color = country)) +
    geom_point() +
    geom_smooth(method = "loess", se = FALSE) +
    facet_wrap(~ issue) +
    labs(
      title = "Percentage of Yes votes in the UN General Assembly",
      subtitle = "1946 to 2015",
      y = "% Yes",
      x = "Year",
      color = "Country"
    )

Focusing on the task at hand

un_votes |>
  filter(country %in% c("United States", "Turkey")) |>
  inner_join(un_roll_calls, by = "rcid") |>
  inner_join(un_roll_call_issues, by = "rcid") |>
  group_by(country, year = year(date), issue) |>
  summarize(
    votes = n(),
    perc_yes = mean(vote == "yes")
    ) |>
  filter(votes > 5) |>
  ggplot(mapping = aes(x = year, y = perc_yes, color = country)) +
    geom_point() +
    geom_smooth(method = "loess", se = FALSE) +
    facet_wrap(~ issue) +
    labs(
      title = "Percentage of Yes votes in the UN General Assembly",
      subtitle = "1946 to 2015",
      y = "% Yes", x = "Year", color = "Country"
    )

Focusing on the task at hand

un_votes |>
  filter(country %in% c("United States", "France")) |>
  inner_join(un_roll_calls, by = "rcid") |>
  inner_join(un_roll_call_issues, by = "rcid") |>
  group_by(country, year = year(date), issue) |>
  summarize(
    votes = n(),
    perc_yes = mean(vote == "yes")
    ) |>
  filter(votes > 5) |>
  ggplot(mapping = aes(x = year, y = perc_yes, color = country)) +
    geom_point() +
    geom_smooth(method = "loess", se = FALSE) +
    facet_wrap(~ issue) +
    labs(
      title = "Percentage of Yes votes in the UN General Assembly",
      subtitle = "1946 to 2015",
      y = "% Yes", x = "Year", color = "Country"
    )

Your turn: Exercise design

RStudio Cloud > “Module 1 - Hello” > ex-1-2.qmd

Your challenge is to go from nothing to a data visualization in 15 minutes of your first class. Don’t worry about the computing infrastructure (we’ll get to that later in the day), assume students have successfully landed in RStudio Cloud like you did earlier. Design an exercise for them to “create” their first visualization.
If you need inspiration, you can use the ggplot2::diamonds or dplyr::starwars dataset or any dataset from nycflights13 or gapminder packages.
If you already have a first day exercise you like, you’re welcomed to modify it to fit the challenge: from nothing to a data visualization in 15 minutes!

15:00

Which motivates you more to learn how to cook: perfectly chopped onions or ratatouille?

Design principle 3: Skip baby steps

Practically speaking…

Non-trivial examples can be motivating, but need to avoid…

Scaffold and layer in between!

Discussion

The following code is used to create the multivariate visualisation we saw earlier. How much of the code would you show/hide when just starting teaching ggplot2?

un_votes |>
  filter(country %in% c("United States")) |>
  inner_join(un_roll_calls, by = "rcid") |>
  inner_join(un_roll_call_issues, by = "rcid") |>
  mutate(
    importantvote = ifelse(importantvote == 0, "No", "Yes"),
    issue = ifelse(issue == "Nuclear weapons and nuclear material", "Nuclear weapons and materials", issue)
    ) |>
  ggplot(aes(y = importantvote, fill = vote)) +
  geom_bar(position = "fill") +
  facet_wrap(~ issue, ncol = 1) +
  labs(
    title = "How the US voted in the UN", 
    subtitle = "By issue and importance of vote",
    x = "Important vote", y = "", fill = "Vote"
    ) +
  theme_minimal() +
  scale_fill_viridis_d(option = "E")

05:00

Designing code snippets for teaching

Write it out to your heart’s desire and polish it
Then, split into three parts:
- Pre-process: Required, but isn’t directly connected to / far off from learning goals of current lesson
- Stash: Not required, and not directly connected to learning goals of current lesson
  - Likely concepts that fit better into future lessons)
- Feature: Heart of the lesson (and maybe a review of a previous lessons)
Finally, decide on the pace at which to scaffold and layer

Pre-process

We’ll call the highlighted lines us_votes

un_votes |>
  filter(country %in% c("United States")) |>
  inner_join(un_roll_calls, by = "rcid") |>
  inner_join(un_roll_call_issues, by = "rcid") |>
  mutate(
    importantvote = ifelse(importantvote == 0, "No", "Yes"),
    issue = ifelse(issue == "Nuclear weapons and nuclear material", "Nuclear weapons and materials", issue)
    ) |>
  ggplot(aes(y = importantvote, fill = vote)) +
  geom_bar(position = "fill") +
  facet_wrap(~ issue, ncol = 1) +
  labs(
    title = "How the US voted in the UN", 
    subtitle = "By issue and importance of vote",
    x = "Important vote", y = "", fill = "Vote"
    ) +
  theme_minimal() +
  scale_fill_viridis_d(option = "E")

Preprocess

us_votes

# A tibble: 5,718 × 14
    rcid country       country_code vote  session importantvote date       unres      amend  para short                       descr                                                     short_name issue
   <dbl> <chr>         <chr>        <fct>   <dbl> <chr>         <date>     <chr>      <int> <int> <chr>                       <chr>                                                     <chr>      <chr>
 1     6 United States US           no          1 No            1946-01-04 R/1/107        0     0 DECLARATION OF HUMAN RIGHTS "TO ADOPT A CUBAN PROPOSAL (A/3-C) THAT AN ITEM ON A DEC… hr         4    
 2     8 United States US           no          1 No            1946-01-05 R/1/297        1     0 ECOSOC POWERS               "TO ADOPT A SECOND 6TH COMM. AMENDMENT (A/14) TO THE PRO… ec         3    
 3    11 United States US           yes         1 No            1946-02-05 R/1/376        0     0 TRUSTEESHIP AMENDMENTS      "TO ADOPT DRAFT RESOLUTIONS I AND II AS A WHOLE, OF THE … co         1    
 4    11 United States US           yes         1 No            1946-02-05 R/1/376        0     0 TRUSTEESHIP AMENDMENTS      "TO ADOPT DRAFT RESOLUTIONS I AND II AS A WHOLE, OF THE … ec         3    
 5    18 United States US           no          1 No            1946-02-03 R/1/532        1     0 ECOSOC CONSULTANTS          "TO ADOPT USSR (ORAL) AMENDMENT REPLACING THE 1ST COMM. … ec         3    
 6    19 United States US           yes         1 No            1946-02-03 R/1/534        0     0 ECOSOC CONSULTANTS          "TO ADOPT THE 1ST COMM. DRAFT RESOLUTION (A/54/REV.1) PR… ec         3    
 7    24 United States US           yes         1 No            1946-12-05 R/1/1229       0     0 ECOSOC ELECTIONS            "TO ADOPT BELGIAN ORAL PROPOSAL TO SURRENDER BELGIUM'S S… ec         3    
 8    26 United States US           no          1 No            1946-12-06 R/1/1286       0     0 TRUSTEESHIP AGREEMENTS      "TO ADOPT USSR ORAL RESOL. REJECTING 8 DRAFT TRUSTEESHIP… co         1    
 9    27 United States US           yes         1 No            1946-12-06 R/1/1287/A     0     0 NEW GUINEA TRUSTEESHIP      "TO ADOPT THE TRUSTEESHIP AGREEMENT FOR NEW GUINEA SUBMI… co         1    
10    28 United States US           yes         1 No            1946-12-06 R/1/1287/B     0     0 RUANDA-URUNDI TRUSTEESHIP   "TO ADOPT THE TRUSTEESHIP AGREEMENT FOR RUANDA-URUNDI SU… co         1    
# … with 5,708 more rows

Stash

un_votes |>
  filter(country %in% c("United States")) |>
  inner_join(un_roll_calls, by = "rcid") |>
  inner_join(un_roll_call_issues, by = "rcid") |>
  mutate(
    importantvote = ifelse(importantvote == 0, "No", "Yes"),
    issue = ifelse(issue == "Nuclear weapons and nuclear material", "Nuclear weapons and materials", issue)
    ) |>
  ggplot(aes(y = importantvote, fill = vote)) +
  geom_bar(position = "fill") +
  facet_wrap(~ issue, ncol = 1) +
  labs(
    title = "How the US voted in the UN", 
    subtitle = "By issue and importance of vote",
    x = "Important vote", y = "", fill = "Vote"
    ) +
  theme_minimal() +
  scale_fill_viridis_d(option = "E")

Feature

us_votes |>
  ggplot(aes(y = importantvote, fill = vote)) +
  geom_bar(position = "fill") +
  facet_wrap(~ issue, ncol = 1) +
  labs(
    title = "How the US voted in the UN", 
    subtitle = "By issue and importance of vote",
    x = "Important vote", y = "", fill = "Vote"
    )

Scaffold 1

ggplot(data = us_votes)

Scaffold 2

ggplot(data = us_votes, 
  mapping = aes(y = importantvote,
                fill = vote))

Scaffold 3

ggplot(data = us_votes, 
  mapping = aes(y = importantvote,
                fill = vote)) +
  geom_bar(position = "fill")

Scaffold 4

ggplot(data = us_votes, 
  mapping = aes(y = importantvote,
                fill = vote)) +
  geom_bar(position = "fill") +
  facet_wrap(~ issue, ncol = 1)

Scaffold 5

ggplot(data = us_votes, 
  mapping = aes(y = importantvote,
                fill = vote)) +
  geom_bar(position = "fill") +
  facet_wrap(~ issue, ncol = 1) +
  labs(
    title = "How the US voted in the UN",
    subtitle = "By issue and importance of vote", 
    x = "Important vote", 
    y = "" 
    )

Scaffold 6

ggplot(data = us_votes, 
  mapping = aes(y = importantvote,
                fill = vote)) +
  geom_bar(position = "fill") +
  facet_wrap(~ issue, ncol = 1) +
  labs(
    title = "How the US voted in the UN",
    subtitle = "By issue and importance of vote", 
    x = "Important vote", 
    y = "",
    fill = "Vote"
    )

Skip Re-insert baby steps

Which is more likely to appeal to someone who has never tried broccoli?

Design principle 4: Hide the veggies

Today we’re going to do web scraping

Using the rvest package
And with the help of regular expressions

Today we go from this to that

and do so in a way that is easy to replicate for another state

Practically speaking…

Students will encounter lots of new challenges along the way
Let that happen, and then provide a solution

Start with a mini-lecture

Lesson: Web scraping essentials for turning a structured table into a data frame in R.

Follow up with a hands-on exercise

Lesson: Web scraping essentials for turning a structured table into a data frame in R.
Ex 1: Scrape the table off the web and save as a data frame.

And a thought exercise

Lesson: Web scraping essentials for turning a structured table into a data frame in R.
Ex 1: Scrape the table off the web and save as a data frame.
Ex 2: What other information do we need represented as variables in the data to obtain the desired facets?

And finally, the veggies!

Lesson: Web scraping essentials for turning a structured table into a data frame in R.
Ex 1: Scrape the table off the web and save as a data frame.
Ex 2: What other information do we need represented as variables in the data to obtain the desired facets?
Lesson: “Just enough” string parsing and regular expressions to go from

If you are already taking a baking class, which will be easier to venture on to?

Design principle 5: Leverage the ecosystem

Estimate the difference between the average evaluation score of male and female faculty.

evals |>
  specify(score ~ gender) |>
  generate(reps = 100, 
    type = "bootstrap") |>
  calculate(stat = "diff in means", 
    order = c("male", "female")) |>
  summarise(
    l = quantile(stat, 0.025), 
    u = quantile(stat, 0.975)
    )

# A tibble: 1 × 2
       l     u
   <dbl> <dbl>
1 0.0256 0.231

t.test(evals$score ~ evals$gender)


    Welch Two Sample t-test

data:  evals$score by evals$gender
t = -2.7507, df = 398.7, p-value = 0.006218
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -0.24264375 -0.04037194
sample estimates:
mean in group female   mean in group male 
            4.092821             4.234328

infer \(\in\) tidymodels

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.

infer 1

evals |>
  specify(score ~ gender)

Response: score (numeric)
Explanatory: gender (factor)
# A tibble: 463 × 2
   score gender
   <dbl> <fct> 
 1   4.7 female
 2   4.1 female
 3   3.9 female
 4   4.8 female
 5   4.6 male  
 6   4.3 male  
 7   2.8 male  
 8   4.1 male  
 9   3.4 male  
10   4.5 female
# … with 453 more rows

infer 2

set.seed(1234)
evals |>
  specify(score ~ gender) |>
  generate(reps = 100, type = "bootstrap")

Response: score (numeric)
Explanatory: gender (factor)
# A tibble: 46,300 × 3
# Groups:   replicate [100]
   replicate score gender
       <int> <dbl> <fct> 
 1         1   4   female
 2         1   3.1 male  
 3         1   5   male  
 4         1   4.4 male  
 5         1   3.5 female
 6         1   4.5 female
 7         1   4.5 male  
 8         1   4.9 male  
 9         1   4.4 male  
10         1   3.5 male  
# … with 46,290 more rows

infer 3

set.seed(1234)
evals |>
  specify(score ~ gender) |>
  generate(reps = 100, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("male", "female"))

Response: score (numeric)
Explanatory: gender (factor)
# A tibble: 100 × 2
   replicate     stat
       <int>    <dbl>
 1         1  0.230  
 2         2  0.134  
 3         3  0.100  
 4         4  0.230  
 5         5  0.128  
 6         6  0.201  
 7         7  0.168  
 8         8  0.130  
 9         9 -0.00490
10        10  0.123  
# … with 90 more rows

infer 4

set.seed(1234)
evals |>
  specify(score ~ gender) |>
  generate(reps = 100, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("male", "female")) |>
  visualise()

infer 4

set.seed(1234)
evals |>
  specify(score ~ gender) |>
  generate(reps = 100, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("male", "female")) |>
  summarise(l = quantile(stat, 0.025), u = quantile(stat, 0.975))

# A tibble: 1 × 2
       l     u
   <dbl> <dbl>
1 0.0285 0.238

One other way to “leverage the ecosystem”

Do it all in R!

Slides, course, course notes / textbook with Quarto
A student dashboard with flexdashboard
Git automation with ghclass
Interactive tutorials with learnr
…