class: center, middle, inverse, title-slide # Network Analysis II ###
rstudio::
conf(2022) --- class: left, middle, rstudio-logo, bigfont ## Aim of this module ✅ Review basic network structure metrics - Measures of the overall network structure - Measures of structural positions in a network ✅ Understand relationships between nodes - Community detection methods - Assortativity and Similarity --- class: left, middle, rstudio-logo ## The Structure of Organizational Networks Organizational network structure can be characterized by the pattern of relationships among people that provide both opportunities and constraints. There are two broad ways to measure network structure: 1. <b>Overall Network Structure</b>: Allows us look at structural differences between two or more organizational networks. 2. <b>Structural Position of Nodes</b>: Allows us to quantify and describe a node's relationship within a single network. --- class: left, middle, rstudio-logo ## Overall Organizational Network Structure Measures of overall network structure can answer many different questions in People Analytics: - How do differences of connectivity across offices relate to office-level turnover? - How does the size of different work groups impact their overall performance? - Does your finance department share information with your HR department? These are all questions that measures of overall network structure can help answer. --- class: left, middle, rstudio-logo ## Overall Network Structure: Path Metrics Paths can help us understand how people are connection and how information is spread. Consider the below graphs and suppose the nodes are people. Person A wants to be introduced to Person C. How can we facilitate that introduction, i.e., how can we get from A to C? <img src="6-graph_metrics_files/figure-html/unnamed-chunk-1-1.png" height="400" style="display: block; margin: auto;" /> --- class: left, middle, rstudio-logo ## Overall Network Structure: Directed and Undirected Paths There is only 1 way in the directed graph, but many in the undirected graph. <table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Directed </th> <th style="text-align:left;"> Undirected </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A->B, B->C </td> <td style="text-align:left;"> A->B, B->C </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> A->D, D->C </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> A->B, B->D, D->C </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> A->D, D->B, B->C </td> </tr> </tbody> </table> --- class: left, middle, rstudio-logo ## Overall Network Structure: Distance The distance between two nodes is the sum of the weights of the edges traversed in the path. For graphs without a weight property, every edge is assumed to have weight 1. Suppose these weights are the strength of the relationship. Now which path would you choose to introduce Person A to Person C? <img src="6-graph_metrics_files/figure-html/unnamed-chunk-3-1.png" width="400" height="300" style="float:left; padding-right:40px" style="display: block; margin: auto;" /> ```r distances(d1,weights = NULL) ``` ``` ## A B C E D F ## A 0 3 4 2 4 6 ## B 3 0 1 5 6 9 ## C 4 1 0 6 5 10 ## E 2 5 6 0 6 4 ## D 4 6 5 6 0 10 ## F 6 9 10 4 10 0 ``` --- class: left, middle, rstudio-logo ## Overall Network Structure: Network size It is often useful to simply understand how large the network is. How many nodes and edges are we dealing with? <img src="6-graph_metrics_files/figure-html/unnamed-chunk-5-1.png" width="400" height="400" style="float:left; padding-right:40px" style="display: block; margin: auto;" /> ```r # Number of nodes vcount(sg) ``` ``` ## [1] 6 ``` ```r # Number of edges ecount(sg) ``` ``` ## [1] 5 ``` --- class: left, middle, rstudio-logo ## Overall Network Structure: Network density Network density is defined as the number of actual edges divided by the number of possible edges. A graph with density of 1 is a *complete* graph, and a graph with low density is considered *sparse*. A network with high density will be able to disperse information faster than a graph with network density. <img src="6-graph_metrics_files/figure-html/unnamed-chunk-7-1.png" width="400" height="400" style="float:left; padding-right:40px" style="display: block; margin: auto;" /> - Actual Ties: 5 - Possible Ties = `\(\frac{N(N-1)}{2} = 15\)` ```r # Network density edge_density(sg) ``` ``` ## [1] 0.3333333 ``` --- class: left, middle, rstudio-logo ## Overall Network Structure: Components It can be useful to know the number of **components** in a network. A component is defined as the number of disconnected groups in a network. In a people network, is of particular interest to understand and minimize the number of small components, or "islands", of people who feel disconnected from the broader population. <img src="6-graph_metrics_files/figure-html/unnamed-chunk-9-1.png" width="400" height="400" style="float:left; padding-right:40px" style="display: block; margin: auto;" /> ```r # Number of components components(sg) ``` ``` ## $membership ## A B C E D F ## 1 1 1 2 1 2 ## ## $csize ## [1] 4 2 ## ## $no ## [1] 2 ``` --- class: left, middle, rstudio-logo ## Overall Network Structure: Network diameter Network diameter is defined as the longest of the shortest paths in a network. This metric only makes sense in a *connected* graph, because in a *disconnected* graph it is simply `\(\infty\)`. For the graph below, what is the diameter of the network? <img src="6-graph_metrics_files/figure-html/unnamed-chunk-11-1.png" width="300" height="300" style="float:left; padding-right:40px" style="display: block; margin: auto;" /> ``` ## From To Shortest.Path Distance ## 1 A B {A,E,D,B} 3 ## 2 A C {A,E,C} 2 ## 3 A D {A,E,D} 2 ## 4 A E {A,E} 1 ## 5 B C {B,D,C} 2 ## 6 B D {B,D} 1 ## 7 B E {B,D,E} 2 ## 8 C D {C,D} 1 ## 9 C E {C,E} 1 ## 10 D E {D,E} 1 ``` --- class: left, middle, rstudio-logo ## Overall Network Structure: Network diameter <img src="6-graph_metrics_files/figure-html/unnamed-chunk-13-1.png" width="400" height="300" style="float:left; padding-right:40px" style="display: block; margin: auto;" /> ```r diameter(sg2) ``` ``` ## [1] 3 ``` --- class: left, middle, rstudio-logo, bigfont ## Overall Network Structure: Cliques Cliques are subsets of vertices in an undirected graph whose induced subgraph is complete (has an edge density of 1). Cliques can help us identify tightly connected groups of people. How many cliques are in our network? <img src="6-graph_metrics_files/figure-html/unnamed-chunk-15-1.png" width="400" height="300" style="display: block; margin: auto;" /> --- class: left, middle, rstudio-logo ## Overall Network Structure: Cliques <img src="6-graph_metrics_files/figure-html/unnamed-chunk-16-1.png" width="400" height="400" style="float:left; padding-right:40px" style="display: block; margin: auto;" /> ```r # Size of the largest clique clique_num(d1) ``` ``` ## [1] 3 ``` ```r # Number of cliques between size 2 and 3 length(cliques(d1, min = 2, max = 3)) ``` ``` ## [1] 9 ``` --- class: left, middle, rstudio-logo ## Exercise - practicing overall network metrics Using a dataset on a network of dolphins, you will work through calculating various network metrics Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**. Let's work on **Exercise 1**. --- class: left, middle, rstudio-logo ## Structural Position of People In networks involving people, understanding their structural position within the network is of significant interest. We can use these metrics as a way to undestand who is important or central within a givin network. Questions you might consider: - Who serves as a "bridge" between various parts of an organization? If these people depart, you can end up with "islands" of people who feel disconnected. - Who has the highest number of connections? This might be a good person to consider for a leadership role, assuming they are viewed positively by their connections. - Who is connected to a lot of well-connected people? These people can be helpful for facilitating introductions. --- class: left, middle, rstudio-logo, bigfont ## Structural Position: Node Centrality **Degree centrality** is simply the count of connections for each node. **Closeness centrality** is a measure of how central or close a node is to other nodes. Information spreads quickly through a network if it starts with those with high closeness centrality. To calculate closeness centrality for a node `\(v\)`: 1. Take all nodes connected to `\(v\)` and calculate the distance between each node and `\(v\)`. 2. Take the average and invert it (so that higher means closer) --- class: left, middle, rstudio-logo, bigfont ## Structural Position: Node Centrality **Betweenness centrality** is a measure of how important a node is in the overall connectedness of the network. To derive the betweenness centrality of a node `\(v\)`: 1. Take any pairs of nodes that are not `\(v\)` and calculate the number of shortest paths between them. 2. Determine how many of those paths pass through `\(v\)`. 3. Divide 2 by 1 and sum across all pairs of nodes in the network. --- class: left, middle, rstudio-logo ## Back to our undirected graph Let's create the graph together and learn how to work with it in R. We'll use the `tidygraph` package for a more convenient way to work with `igraph`. ```r # our edgelist edges_gr <- data.frame( from = c("A","A","A","B","B","C","E","B"), to = c("B","D","E","C","D","D","F","A") ) # create a graph from the dataframe using igraph gr <- igraph::graph_from_data_frame(edges_gr, directed = FALSE) # the tidygraph package gives us a really simple way to work with gr_tidy <- igraph::simplify(gr) |> # remove the A->B/B->A duplication tidygraph::as_tbl_graph() ``` --- class: left, middle, rstudio-logo ## Back to our undirected graph ```r gr_tidy ``` ``` ## # A tbl_graph: 6 nodes and 7 edges ## # ## # An undirected simple graph with 1 component ## # ## # Node Data: 6 x 1 (active) ## name ## <chr> ## 1 A ## 2 B ## 3 C ## 4 E ## 5 D ## 6 F ## # ## # Edge Data: 7 x 2 ## from to ## <int> <int> ## 1 1 2 ## 2 1 4 ## 3 1 5 ## # … with 4 more rows ``` --- class: left, middle, rstudio-logo ## Centrality of our undirected graph Now we can use convenient `tidyverse` functions and create a quick table of centrality values for each node. ```r c_vals <- gr_tidy |> dplyr::rename(NODE = name) |> dplyr::mutate( DEGREE_CENT = tidygraph::centrality_degree(), BTWN_CENT = tidygraph::centrality_betweenness(), CLOSE_CENT = tidygraph::centrality_closeness() ) |> kbl() |> kable_minimal() ``` --- class: left, middle, rstudio-logo ## Centrality of our undirected graph Now we can use convenient `tidyverse` functions and create a quick table of centrality values for each node. <table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> NODE </th> <th style="text-align:right;"> DEGREE_CENT </th> <th style="text-align:right;"> BTWN_CENT </th> <th style="text-align:right;"> CLOSE_CENT </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6.0 </td> <td style="text-align:right;"> 0.1428571 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:right;"> 0.1250000 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 0.0909091 </td> </tr> <tr> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4.0 </td> <td style="text-align:right;"> 0.1111111 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:right;"> 0.1250000 </td> </tr> <tr> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 0.0769231 </td> </tr> </tbody> </table> --- class: left, middle, rstudio-logo ## Exercise - centrality measures Using the same dolphin network dataset, you will work through calculating centrality measures. Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**. Let's work on **Exercise 2**. --- class: left, middle, rstudio-logo, bigfont ## The Formation of Organizational Networks We can only measure the structure of networks that we already know exist. Oftentimes, within a formal organizational network, smaller informal networks may exist and the firm may not know about them. Network analysis provides us with techniques that not only allow us to find these hidden networks, but also to understand what led to their formation: - Community Detection - Assortativity --- class: left, middle, rstudio-logo ## Community detection Network communities are "hidden" subsets that exist within a network. Through community detection methods, we can sometimes uncover interesting similarities between nodes that we wouldn't have otherwise noticed. --- class: left, middle, rstudio-logo ## Louvain algorithm (1/2) One commonly used community detection algorithm is the Louvain algorithm, which partitions the graph into subsets of vertices by trying to maximize the *modularity* of the graph. Modularity measures how dense the connections are within subsets of vertices in a graph by comparing the density to that which would be expected by a random graph. Modularity ranges from -0.5 to 1 and any positive value indicates the vertices inside the subgroups are more densely connected than would be expected by chance. ```r # find Louvain communities communities <- cluster_louvain(sf_gr) # assign as a vertex property V(sf_gr)$community <- membership(communities) # How large are they? sizes(communities) ``` ``` ## Community sizes ## 1 2 3 4 5 6 7 8 9 10 ## 39 3 40 37 26 20 8 29 3 3 ``` --- class: left, middle, rstudio-logo ## Louvain algorithm (2/2) ```r set.seed(123) ggraph(sf_gr, layout = "fr") + geom_edge_link(color = "grey") + geom_node_point(aes(color = as.factor(community)), show.legend = FALSE) + theme_void() ``` <img src="6-graph_metrics_files/figure-html/unnamed-chunk-24-1.png" width="400" height="400" style="display: block; margin: auto;" /> --- class: left, middle, rstudio-logo ## Exercise - community detection measures Using the dolphin network dataset, you will work through finding a plotting communities. Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**. Let's work on **Exercise 3**. --- class: left, middle, rstudio-logo, bigfont ## Assortativity Homophily is the tendency for similar people to be connected to each other, i.e., "birds of a feather flock together". When people share common traits, it is sometimes easier to form relationships. Examples include - education - political beliefs - social class - hobbies - age - gender We can measure homophily in a network using assortativity. --- class: left, middle, rstudio-logo, bigfont ## Assortativity for nominal data For undirected graphs: $$ `\begin{equation} r = \frac{\sum_{i}e_{ii} - \sum_{i}a_i^2}{1-\sum_{i}a_i^2} \end{equation}` $$ where: - `\(e_{ii}\)` is the fraction of edges between nodes node of type `\(i\)` to one of type `\(i\)`. - `\(a_i\)` is the fraction of each type of edge that is connected to a node of type `\(i\)` Assortativity ranges from -1 to 1, 1 meaning people only connect with people like them, 0 meaning people connect with all sorts of people equally, and -1 meaning people only connect with people unlike them. --- class: left, middle, rstudio-logo ## Example Suppose we want to understand if dog owners tend to be connected to other dog owners, and if non-dog owners tend to be connected to other non-dog owners. <table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Dog </th> <th style="text-align:right;"> No dog </th> <th style="text-align:right;"> `\(a_i\)` </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Dog </td> <td style="text-align:right;"> 0.34 </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 0.54 </td> </tr> <tr> <td style="text-align:left;"> No dog </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.46 </td> </tr> <tr> <td style="text-align:left;"> `\(a_i\)` </td> <td style="text-align:right;"> 0.54 </td> <td style="text-align:right;"> 0.46 </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> $$ `\begin{equation} r = \frac{(0.34+0.26)-(0.54^2+0.46^2)}{1-(0.54^2+0.46^2)} = 0.195 \end{equation}` $$ --- class: left, middle, rstudio-logo, bigfont ## Assortativity for numeric data $$ `\begin{equation} r = \frac{\sum_{xy}xy(e_{xy}-a_xb_y)}{\sigma_a\sigma_b} \end{equation}` $$ where: - `\(e_{xy}\)` is the fraction of edges joining nodes with values `\(x\)` and `\(y\)` - `\(a_x\)` is the fraction of edges that start and end at nodes with values of `\(x\)` - `\(b_y\)` is the fraction of edges that start and end at nodes with values of `\(y\)` - `\(\sigma_a\)` is the standard deviation of the distribution of `\(a_x\)` - `\(\sigma_b\)` is the standard deviation of the distribution of `\(b_y\)` --- class: left, middle, rstudio-logo, bigfont ## Using `igraph` to calculate assortativity Let's use the workfrance edgelist. Suppose we want to know if people tend to work with people in their same department. ```r url <- "https://ona-book.org/data/workfrance_edgelist.csv" workfrance_edgelist <- read.csv(url) head(workfrance_edgelist) ``` ``` ## from to mins ## 1 3 159 8 ## 2 3 253 14 ## 3 3 447 17 ## 4 3 498 10 ## 5 3 694 7 ## 6 3 751 7 ``` --- class: left, middle, rstudio-logo, bigfont ## Adding relevant feature We'll need to pull in department in order to add this to the graph we will be creating. ```r url <- "https://ona-book.org/data/workfrance_vertices.csv" workfrance_vertices <- read.csv(url) head(workfrance_vertices) ``` ``` ## id dept ## 1 89 DCAR ## 2 97 DCAR ## 3 118 DCAR ## 4 220 DCAR ## 5 378 DCAR ## 6 656 DCAR ``` --- class: left, middle, rstudio-logo, bigfont ## Create graph and calculate assortativity ```r gr <- workfrance_edgelist[1:2] |> igraph::graph_from_data_frame( directed = F, vertices = workfrance_vertices ) assortativity_nominal(gr, factor(V(gr)$dept)) ``` ``` ## [1] 0.684411 ``` --- class: left, middle, rstudio-logo ## Exercise - Assortativity Using a dataset related to emails, we will work through calculating an assortativity example. Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**. Let's work on **Exercise 4**. --- class: left, middle, rstudio-logo # 🍱 Lunchtime! 😋