📦
Building tidy tools

Day 2 Session 1: Function Design

Emma Rand and Ian Lyttle

Invalid Date

👋 Welcome Back

The Team

Emma Rand 🐦er13_r

Elliot Murphy

Beatriz Milz

Kailey Mulligan

Ian Lyttle 🐦ijlyttle

Haley Jeppson

Ted Laderas

Standing on the shoulders of Building Tidy Tools, rstudio::conf(2020) (C. Wickham and Wickham 2021), R Packages (H. Wickham and Bryan 2020)

Code of conduct

rstudio::conf code of conduct highlights:

  • wear a mask over your nose and mouth.
  • treat everyone with respect.
  • if there is a problem:

    • talk to RStudio staff member (Elliot is our liason)
    • email: conf@rstudio.com
    • phone: +1 844-448-1212

Code of conduct

The Code of Conduct and COVID policies can be found at https://www.rstudio.com/conference/2022/2022-conf-code-of-conduct/. Please review them carefully.

RStudio requires that you wear a mask that fully covers your mouth and nose at all times in all public spaces. We strongly recommend that you use a correctly fitted N95, KN95, or similar particulate filtering mask; we will have a limited supply available upon request.

You can report Code of Conduct violations in person, by email, or by phone. Please see the policy linked above for contact information.

Housekeeping

  • WiFi:
    • Username: conf22
    • Password: together!
  • There are gender neutral bathrooms by the National Harbor rooms.
  • The meditation room is located at National Harbor 9. Open 8am - 5pm, Monday - Thursday. The hotel also has a dedicated room behind the reception.
  • The lactation room is located at Potomac Dressing Room. Open 8am - 5pm, Monday - Thursday.
  • Participants who do not wish to be photographed have red lanyards, please note everyone’s lanyard colors befor taking a photo and respect their choices.

Acknowledgements

  • RStudio for putting this on (esp. Mine Çetinkaya-Rundel)
  • Emma has been an amazing collaborator in this endeavour
  • All the TAs for helping these days run smoothly
  • Hadley Wickham, Jenny Bryan, and Charlotte Wickham
  • At SE: Gilles Perry and Abhijeet Shegokar for “suffering” through a practice run
  • All of you for being here

How we will work

  • stickies

  • no stupid questions

Schedule

  • Function design

  • Managing side effects

  • Tidy eval

  • Functional & object-oriented programming

State of play

We want to concentrate on specific concepts, rather than writing entire functions.

We have created a set of checkpoints called states:

btt22::btt_state()
 [1] "2.1.1" "2.1.2" "2.2.1" "2.2.2" "2.2.3" "2.3.1" "2.3.2" "2.3.3" "2.3.4"
[10] "2.3.5" "2.3.6" "2.3.7" "2.4.1" "2.4.2" "2.4.3"

For example, "2.1.1" means day 2, session 1, task 1.

Getting new files

To get new files for a state:

# "2.1.1": day 2, session 1, task 1
btt_get("2.1.1")
  • files to directories R, tests/testthat.
  • contains functions, tests that you will complete.

Staying on the “happy path”

One example builds on another, so it’s important to keep up.

We will do our best to help; in case you need to reset:

btt_reset_hard("2.1.1")

Overwrites:

  • directories: R, tests/testthat
  • Imports, Suggests sections of DESCRIPTION

Learning objectives

At the end of this section you will be able to:

  • order and name your function’s arguments.
  • recognize type-stable functions and their importance.
  • distinguish a pure function from a function that has or uses side effects.

But first…

Make R CMD CHECK happy

When we finished yesterday:

> checking R code for possible problems ... NOTE
  uss_make_matches: no visible binding for global variable ‘tier’
  uss_make_matches: no visible binding for global variable ‘Season’
  uss_make_matches: no visible binding for global variable ‘Date’
  uss_make_matches: no visible binding for global variable ‘home’
  uss_make_matches: no visible binding for global variable ‘visitor’
  uss_make_matches: no visible binding for global variable ‘hgoal’
  uss_make_matches: no visible binding for global variable ‘vgoal’
  Undefined global functions or variables:
    Date Season hgoal home tier vgoal visitor

0 errors ✓ | 0 warnings ✓ | 1 note x

Where does tier, etc. come from?

  • We know it’s a column in a data frame, but R doesn’t know that.

  • How to specify “this comes from a data frame” ?

Preview of tidy eval

The {rlang} package (Henry and Wickham 2022) provides pronouns.

Interactively, we might write:

library("dplyr")

mtcars |>
  mutate(cons_lper100km = 235.215 / mpg)

In a package function, we would write:

mtcars |>
  dplyr::mutate(cons_lper100km = 235.215 / .data$mpg)

Your turn "2.1.1"

  1. Add the {rlang} package:
usethis::use_package("rlang")
  1. Import the .data, .env pronouns:
# adds to R/ussie-package.R
usethis::use_import_from("rlang", c(".data", ".env"))
  1. matches.R: use .data, .env in uss_make_matches().

  2. devtools::check() should be happy now.

API design

A thing I like about tidyverse:

  • there should be a function to do this; it should look like this
  • there already is

Because:

  • functions and arguments follow naming conventions

  • arguments are ordered according to purpose

  • we know what to expect for return values

Evolving references

The way we approach problems is always evolving; tidyverse is no exception:

  • follow the GitHub repo of your favorite tidyverse/r-lib package (mine is {usethis}):
    • issue discussions
    • pull-request reviews

Naming functions

If writing a smaller package, consider prefixing your functions:

  • {ussie}: uss_make_matches()

  • {btt22}: btt_get()

Use a noun if building up a specific type of object:

Casing

  • Tidyverse uses snake_case; Shiny prefers camelCase

  • Python prefers snake_case

  • JavaScript prefers:

    • camelCase for functions
    • PascalCase for classes, interfaces

Pick a convention according to your domain, follow it.

Arguments

Here, mtcars is an argument:

head(mtcars)

Here, data is a formal argument:

head <- function(data){
  ...
}

In R, we sometimes use these terms interchangeably; we sometimes use the term formals.

¯\_(ツ)_/¯

Naming arguments

Like naming functions, strive to be:

  • consistent
  • evocative
  • concise

There are only two hard things in Computer Science: cache invalidation and naming things.

– Phil Karlton

And off-by-one errors – Leon Bambrick

Ordering arguments

  • data: first argument, “the thing”
  • descriptors: values the user should specify
  • dots (...): stuff that gets passed to other functions
  • details: values with defaults

I have seen the order of dots and details reversed.

However, data and descriptors almost always come first.

Discuss with neighbour

Which are: data, descriptors, details?

# there are acutally more args...
pivot_longer <- function(
  data,                
  cols,                
  names_to = "name",   
  names_prefix = NULL  
) {
  ...
}

Discuss with neighbor (answer)

Which are: data, descriptors, details?

# there are acutally more args...
pivot_longer <- function(
  data,                # data
  cols,                # descriptor
  names_to = "name",   # details
  names_prefix = NULL  #
) {
  ...
}

Return value: type stability

This is a key to tidyverse.

In theory

Type of return-value depends only on the types of the inputs.

  • no return_tibble = TRUE arguments.

In practice

  • return same type as data (first) argument

  • return constant type, e.g. double

Putting it all together

When I think of tidyverse functions, I can remember type for:

  • data (first) argument

  • return value

For example:

tibble -> tibble pattern makes it easy to work with the pipe: |>

Our turn "2.1.2"

Implement a function, uss_get_matches():

  • given a country, return a matches tibble

Only if needed, btt22::btt_reset_hard("2.1.2")

Get new files, btt22::btt_get("2.1.2"):

  • columns.R, get-matches.R
  • test-get-matches.R

usethis::use_package("engsoccerdata")

columns.R

In {ussie} we (will) have all sorts of tibbles:

  • engsoc
  • matches
  • teams_matches
  • seasons

and groupings:

  • seasons_grouping
  • accumulate

Build-time vs. run-time

  • Put this code, temporarily, into columns.R:
build_time <- Sys.time()

run_time <- function() {
  Sys.time()
}

If you need to delay evaluation, try a function.

get_soccer_data()

Given name of dataset in {engsoccerdata}, return dataset:

get_soccer_data <- function(data_name) {
  # create isolated environment
  e <- new.env()
  
  # put the data into environment
  name <- utils::data(
    list = data_name, 
    package = "engsoccerdata", 
    envir = e
  )[1]
  
  # return data from environment
  e[[name]]
}

uss_countries()

Return set of valid values for country:

uss_countries <- function() {
  c("england", "germany", "holland", "italy", "spain")
}

Run-time vs. build-time

Safer habit: delay evaluation by wrapping code in a function

best_wins_leeds()

  • usethis::use_package("engsoccerdata")

    best_wins_leeds <- function(n = 10) {
      engsoccerdata::bestwins(
        engsoccerdata::england,
        teamname = "Leeds United",
        N = n
      )
    }

uss_get_matches()

Given country, return matches data:

uss_get_matches <- function(country = uss_countries()) {
  
  # validate country
  country <- match.arg(country)
  
  # get data for country
  data <- get_soccer_data(country)
  
  # capitalize
  substr(country, 1, 1) <- toupper(substr(country, 1, 1))
  
  # make matches data, return
  uss_make_matches(data, country) 
}

Document

Show an example (or two) of using uss_get_matches() in your package vignette.

Pure functions vs. side effects

  • I (Ian) have been programming since 1982, in some form.
  • I studied mechanical engineering, not computer science; programming on the side.
  • When I learned about this in 2016, it changed my view of programming.
  • Keeping this distinction in mind simplifies a lot of challenges.

Pure function

A function where:

  • the return value depends only on argument values

  • only change is the return value

Examples:

function(x, y) {
  x + y
}
cos(pi)

Side effects

A function where:

  • the return value can depend on “the outside universe”

  • there is a change in the “the outside universe”

Examples:

readr::read_csv("myfile.csv")

Why is this important?

  • Pure functions easier to test than functions with side effects.
  • Side effects (interactions with universe) take time (Shiny).
  • Functions with side effects should document the effects.
  • Side effects are not inherently bad (we do need to write to the file system), but they need extra care.

Discuss with your neighbour

Are these {ussie} functions pure?

  • uss_countries()

  • uss_make_matches()

  • get_soccer_data()

  • uss_get_matches()

Practical advice

Try to separate tasks into pure functions and side effects:

  • easier to test the pure functions and side effects separately

  • use these functions in higher-level functions

For example:

  • uss_make_matches() is a pure function.

  • get_soccer_data() uses side effects.

  • uss_get_matches() calls each of these functions.

Summary

  • Naming: be consistent, concise, yet evocative.
  • Argument order: data, descriptors, dots, details.
  • Return value type: be consistent, predictable.
  • Easy to remember data (first) argument and return value:
    • easy to use pipe, |>.
  • Be mindful of side effects.

Additional material

Hadley’s keynote at rstudio::conf(2017):

  • not available on YouTube 😢

  • talks about tidyverse design

Joe Cheng’s talks (Part 1, Part 2) on reactivity at Shiny Developers Conference (2016), precursor to rstudio::conf():

  • these were the talks that changed my (Ian’s) perspective on programming

  • pure functions vs. side effects

References

Henry, Lionel, and Hadley Wickham. 2022. “Rlang: Functions for Base Types and Core r and ’Tidyverse’ Features.” https://CRAN.R-project.org/package=rlang.
Team, Tidyverse. 2022. “Tidyverse Design Guide.” https://design.tidyverse.org/.
Wickham, Charlotte, and Hadley Wickham. 2021. Building Tidy Tools. rstudio::conf 2020. https://github.com/rstudio-conf-2020/build-tidy-tools.
Wickham, Hadley. 2022. “The Tidyverse Style Guide.” https://style.tidyverse.org/.
Wickham, Hadley, and Jenny Bryan. 2020. R Packages. The work-in-progress 2nd edition. Online. https://r-pkgs.org/index.html.