class: center, middle, inverse, title-slide # Welcome to R for People Analytics ## Introduction and Preliminaries ###
rstudio::
conf(2022) --- class: left, middle, rstudio-logo, bigfont ## Good morning and welcome! What we hope you can say at the end of this 2-day workshop: 🧑‍🎓 You've learned a lot about some important methods for your work 🎉 You've had fun 👋 You've met some interesting people who work on similar things to you --- class: left, middle, rstudio-logo # About this workshop --- class: left, middle, rstudio-logo ## Introducing yourselves Find a person close to you who you haven't met and tell each other: 1. Who you are 2. Where you are from 3. What you do 4. What you hope to get from the next two days --- class: left, middle, rstudio-logo ## What is people analytics? > "People analytics uses behavioral data to understand how people work and change how companies are managed." > > <footer>--- Wikipedia</footer> <p> > "People analytics is a data-driven approach to understanding people related phenomena." > > <footer>--- Keith</footer> --- class: left, middle, rstudio-logo ## What is people analytics? A lot of what constitutes 'people analytics' is no different from other types of analytics. * Processing, cleaning and structuring data * Identifying patterns of interest * Reporting descriptive statistics But we will focus on quantitative techniques which are used more frequently in people analytics than in most other fields: * Explanatory modeling * Network analysis --- class: left, middle, rstudio-logo ## Explanatory modeling: what is it? Explanatory modeling uses a sample of data to draw inferences about the potential causes of an outcome of interest. It is also sometimes called *inferential modeling*. <b>Examples:</b> * Does working schedule and/or pay rate influence likelihood to leave an organization? * Does academic performance in earlier years of a program influence academic performance in the final year? * Do certain demographic factors influence choice of career? --- class: left, middle, rstudio-logo ## Explanatory modeling: what we will learn We will focus on regression analysis as a way to <i>explain</i> outcomes of interest using data. You will learn: 1. How to choose an appropriate model type for the problem at hand. 2. How to prepare your data for your model. 3. How to execute the model. 4. How to view a variety of outputs from the model. 5. How to interpret those outputs against the problem at hand. --- class: left, middle, rstudio-logo ## Network analysis: what is it? Network analysis uses graph theory to store, visualize and analyze data on relationships. This can be used to answer questions about people, groups, organizational structures and many other things. <b>Examples:</b> * Who are the important or influential actors in a group? * Are there 'hidden but important' subgroups? * What factors drive connection in an organization? --- class: left, middle, rstudio-logo ## Network analysis: what we will learn We will focus on creating, visualizing and analyzing network structures to draw insights about a problem. You will learn: 1. How to store data in graph structures 2. How to visualize those structures 3. How to understand relationships/connections within those structures 4. How to measure importance and influence in a network 5. How to identify community structures in networks and how to describe those structures. --- class: left, middle, rstudio-logo, reallybigfont ## How we will learn 👩‍🏫 Talks and instruction 👨🏽‍💻 Frequent short coding exercises 🤹 Project work 😲 A few other things --- class: left, middle, rstudio-logo, reallybigfont ## How to ask questions 📣 Ask instructors or TAs (during breaks if possible!) 💻 Post to our [Github Discussions](https://github.com/rstudio-conf-2022/people-analytics-rstats/discussions) page! --- class: left, middle, rstudio-logo, reallybigfont ## How to get help <table> <tr> <td class='box lightblue'></td> <td valign="middle">🥳 I'm finished - all good!</td> </tr> <tr> <td class='box pink'></td> <td valign="middle">😕 I could use some help!</td> </tr> </table> --- class: left, middle, rstudio-logo # Foundations: Working with people data in R --- class: left, middle, rstudio-logo ## Data types - numeric ```r # numeric double my_double <- 42.3 # use typeof() to find out the data type of a scalar value typeof(my_double) ``` ``` ## [1] "double" ``` ```r # numeric integer my_integer <- 42L typeof(my_integer) ``` ``` ## [1] "integer" ``` --- class: left, middle, rstudio-logo ## Data types - character and logical ```r # character is any string in quotes my_name <- "Keith" typeof(my_name) ``` ``` ## [1] "character" ``` ```r # logical is TRUE or FALSE at_rstudio_conf <- TRUE typeof(at_rstudio_conf) ``` ``` ## [1] "logical" ``` --- class: left, middle, rstudio-logo ## Data structures - numeric, character and logical vectors ```r # vectors are 1-dimensional homogeneous structures (same data type) first_primes <- c(2, 3, 5, 7, 11) # use str() to get info about data structures str(first_primes) ``` ``` ## num [1:5] 2 3 5 7 11 ``` ```r # character vector faculty <- c("Alex", "Jiena", "Jordan", "Keith", "Liz", "Rachel") str(faculty) ``` ``` ## chr [1:6] "Alex" "Jiena" "Jordan" "Keith" "Liz" "Rachel" ``` ```r # logical vector faculty_has_dog <- c(TRUE, FALSE, TRUE, TRUE, TRUE, TRUE) str(faculty_has_dog) ``` ``` ## logi [1:6] TRUE FALSE TRUE TRUE TRUE TRUE ``` --- class: left, middle, rstudio-logo ## Data structures - categorical (factor) vectors ```r # categorical or factor vectors store a limited set of categorical values faculty_factor <- as.factor(faculty) str(faculty_factor) ``` ``` ## Factor w/ 6 levels "Alex","Jiena",..: 1 2 3 4 5 6 ``` ```r # if the categories have order, you can specify the order performance <- c("Low", "High", "Medium", "High", "Low") ordered_performance <- ordered( performance, levels = c("Low", "Medium", "High") ) str(ordered_performance) ``` ``` ## Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 3 2 3 1 ``` --- class: left, middle, rstudio-logo ## Data structures - type coercion ```r # what happens if we try to put heterogeneous data in a vector mixed_types_1 <- c(6.75, "Keith") str(mixed_types_1) ``` ``` ## chr [1:2] "6.75" "Keith" ``` ```r # some form of type coercion occurs mixed_types_2 <- c(TRUE, 6.3) str(mixed_types_2) ``` ``` ## num [1:2] 1 6.3 ``` ```r # if you add an unknown element to a defined factor vector new_faculty <- c(faculty_factor, "George Clooney") str(new_faculty) ``` ``` ## chr [1:7] "1" "2" "3" "4" "5" "6" "George Clooney" ``` ```r # use type conversion functions to control coercion new_faculty <- c(as.character(faculty_factor), "George Clooney") str(new_faculty) ``` ``` ## chr [1:7] "Alex" "Jiena" "Jordan" "Keith" "Liz" "Rachel" "George Clooney" ``` --- class: left, middle, rstudio-logo ## Exercise - data type and type conversion For our first short exercise, we will do some practice on working with and converting data types. Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 01 - R Fundamentals**. Let's work on **Exercises 1 and 2**. --- class: left, middle, rstudio-logo ## Data structures - named lists Named lists are the most flexible structures in R. They can contain any other structures inside them. ```r my_list <- list( great_tv = c("Ozark", "Mad Men", "Breaking Bad"), first_primes = first_primes, faculty_factor = faculty_factor ) str(my_list) ``` ``` ## List of 3 ## $ great_tv : chr [1:3] "Ozark" "Mad Men" "Breaking Bad" ## $ first_primes : num [1:5] 2 3 5 7 11 ## $ faculty_factor: Factor w/ 6 levels "Alex","Jiena",..: 1 2 3 4 5 6 ``` ```r # access specific elements my_list$great_tv ``` ``` ## [1] "Ozark" "Mad Men" "Breaking Bad" ``` --- class: left, middle, rstudio-logo ## Data structures - dataframes Dataframes are named lists of vectors of the same length. They are the most popular data structure in R - basically the R equivalent of a spreadsheet. ```r (faculty_info <- data.frame( faculty = faculty_factor, has_dog = faculty_has_dog )) ``` ``` ## faculty has_dog ## 1 Alex TRUE ## 2 Jiena FALSE ## 3 Jordan TRUE ## 4 Keith TRUE ## 5 Liz TRUE ## 6 Rachel TRUE ``` ```r str(faculty_info) ``` ``` ## 'data.frame': 6 obs. of 2 variables: ## $ faculty: Factor w/ 6 levels "Alex","Jiena",..: 1 2 3 4 5 6 ## $ has_dog: logi TRUE FALSE TRUE TRUE TRUE TRUE ``` --- class: left, middle, rstudio-logo ## Loading and viewing dataframes All of the data sets we work with will be in online CSV files, so we can load them in from a URL using `read.csv()`. ```r url <- "https://peopleanalytics-regression-book.org/data/ugtests.csv" ugtests <- read.csv(url) str(ugtests) ``` ``` ## 'data.frame': 975 obs. of 4 variables: ## $ Yr1 : int 27 70 27 26 46 86 40 60 49 80 ... ## $ Yr2 : int 50 104 36 75 77 122 100 92 98 127 ... ## $ Yr3 : int 52 126 148 115 75 119 125 78 119 67 ... ## $ Final: int 93 207 175 125 114 159 153 84 147 80 ... ``` Often data is big, so we will use `head()` to look at the first few rows: ```r head(ugtests) ``` ``` ## Yr1 Yr2 Yr3 Final ## 1 27 50 52 93 ## 2 70 104 126 207 ## 3 27 36 148 175 ## 4 26 75 115 125 ## 5 46 77 75 114 ## 6 86 122 119 159 ``` --- class: left, middle, rstudio-logo ## Functions Functions perform useful operations on objects, returning a transformed object. They usually exist because there is a task that needs to be performed repeatedly by a user or many users. We've already seen some functions. Can you name some functions that we have already seen in previous pages? We will be using a lot of functions over the next 2 days. Some of them will be built into base R, like `lm()` or `glm()`, and some will be from add-on packages like `polr()` or `eigen_centrality()`. ```r # example function - substr() extracts characters from a string substr("Keith", start = 2, stop = 4) ``` ``` ## [1] "eit" ``` To display help on how to use the `substr` function, use `?substr` or `help(substr)` in the console. --- class: left, middle, rstudio-logo ## Packages A set of functions that have been created for a specific purpose can be released as a package. We will be using packages like `dplyr`, `MASS` and `igraph` over the next 2 days. All packages have been pre-installed for you on RStudio Cloud, but installing packages is easy. For example `install.packages("igraph")` would install the `igraph` package. To use the functions in a package, you should load the installed package from your library. For example, to load `dplyr` you would use `library(dplyr)`. Sometimes it makes sense to namespace functions in packages so that they are not confused with similarly named functions in other packages. For example, to use the `filter()` function in `dplyr`, you can namespace using `dplyr::filter()`. --- class: left, middle, rstudio-logo ## The pipe operator The pipe operator `|>` helps you write more readable code through avoiding deeply nested functions within functions, allowing you to see the order of operations more clearly. (**Tip:** Use `Cmd/Ctrl+Shift+M` for a shortcut to the pipe). ```r library(dplyr) # without pipe round(mean(dplyr::pull(dplyr::filter(ugtests, Yr2 < 75), Yr1)), 2) ``` ``` ## [1] 51.75 ``` ```r # with pipe (note neat coding style) ugtests |> dplyr::filter(Yr2 < 75) |> dplyr::pull(Yr1) |> mean() |> round(2) ``` ``` ## [1] 51.75 ``` --- class: left, middle, rstudio-logo ## Exercise - Dataframes, functions, packages and the pipe operator For our next short exercise, we will do some practice on working with dataframes, functions, packages and the pipe operator. Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and continue **Assignment 01 - R Fundamentals**. Let's work on **Exercises 3 and 4**. --- class: left, middle, rstudio-logo ## Plotting and graphing in base R Plotting is a big part of any analytical work. R has a very wide range of options for this. Base R has functions like `plot()` for simple X-Y plots, and `boxplot()` or `hist()` for specific plot types. ```r plot(ugtests$Yr1, ugtests$Final) ``` <img src="1-preliminaries_files/figure-html/unnamed-chunk-21-1.png" height="300" style="display: block; margin: auto;" /> --- class: left, middle, rstudio-logo ## Plotting and graphing in `ggplot2` For those who know it, `ggplot2` is an incredibly powerful graphing package based on the Grammar of Graphics (Wilkinson, 2005). ```r library(ggplot2) ggplot(ugtests, aes(x = Yr1, y = Final)) + geom_point(color = "blue") + labs(x = "Year 1", y = "Final") + theme_minimal() ``` <img src="1-preliminaries_files/figure-html/unnamed-chunk-22-1.png" height="300" style="display: block; margin: auto;" /> --- class: left, middle, rstudio-logo ## Pairplots Pairplots are very useful summary plots to understand univariate and bivariate patterns in data, and are often a useful precursor to modeling efforts. It's important for data types to be well defined for pairplots to work effectively. ```r library(GGally) GGally::ggpairs(ugtests) ``` <img src="1-preliminaries_files/figure-html/unnamed-chunk-23-1.png" height="300" style="display: block; margin: auto;" /> --- class: left, middle, rstudio-logo ## Documenting work in R Markdown R Markdown allows you to integrate your work into a document with commentary, and is a great way to record the work you have done for future reference and reproducibility. The assignments in this workshop are all set up in R Markdown documents. When you have finished them you can knit them into HTML documents which will remain available in your workspace after this workshop. Feel free to add your own text commentary or notes to these documents to help remind you of important things. When we get to our project work tomorrow, you should consider using R Markdown to record your method and code in one document. --- class: left, middle, rstudio-logo ## Exercise - Plotting and recording your work For our next short exercise, we will do some practice on plotting and on recording work in R Markdown. Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and continue **Assignment 01 - R Fundamentals**. Let's work on **Exercises 5 and 6**. --- class: left, middle, rstudio-logo # ☕ Let's have a break! 😌