Network Analysis II

class: center, middle, inverse, title-slide

# Network Analysis II
### <b>rstudio::</b>conf(2022)

---

class: left, middle, rstudio-logo, bigfont

## Aim of this module

&#9989; Review basic network structure metrics
  - Measures of the overall network structure
  - Measures of structural positions in a network
  
&#9989; Understand relationships between nodes
  - Community detection methods
  - Assortativity and Similarity
  
  
---
class: left, middle, rstudio-logo

## The Structure of Organizational Networks

Organizational network structure can be characterized by the pattern of relationships among people that provide both opportunities and constraints.

There are two broad ways to measure network structure:
  1. <b>Overall Network Structure</b>: Allows us look at structural differences between two or more organizational networks.
  2. <b>Structural Position of Nodes</b>: Allows us to quantify and describe a node's relationship within a single network.

---
class: left, middle, rstudio-logo

## Overall Organizational Network Structure

Measures of overall network structure can answer many different questions in People Analytics:

- How do differences of connectivity across offices relate to office-level turnover?
  - How does the size of different work groups impact their overall performance? 
  - Does your finance department share information with your HR department?

These are all questions that measures of overall network structure can help answer.

---
class: left, middle, rstudio-logo

## Overall Network Structure: Path Metrics

Paths can help us understand how people are connection and how information is spread. Consider the below graphs and suppose the nodes are people. Person A wants to be introduced to Person C.  How can we facilitate that introduction, i.e., how can we get from A to C?

---
class: left, middle, rstudio-logo

## Overall Network Structure: Directed and Undirected Paths

There is only 1 way in the directed graph, but many in the undirected graph.

<table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Directed </th>
   <th style="text-align:left;"> Undirected </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A-&gt;B, B-&gt;C </td>
   <td style="text-align:left;"> A-&gt;B, B-&gt;C </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> A-&gt;D, D-&gt;C </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> A-&gt;B, B-&gt;D, D-&gt;C </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> A-&gt;D, D-&gt;B, B-&gt;C </td>
  </tr>
</tbody>
</table>

---
class: left, middle, rstudio-logo

## Overall Network Structure: Distance

The distance between two nodes is the sum of the weights of the edges traversed in the path. For graphs without a weight property, every edge is assumed to have weight 1.  Suppose these weights are the strength of the relationship.  Now which path would you choose to introduce Person A to Person C?

```r
distances(d1,weights = NULL)
```

```
##   A B  C E  D  F
## A 0 3  4 2  4  6
## B 3 0  1 5  6  9
## C 4 1  0 6  5 10
## E 2 5  6 0  6  4
## D 4 6  5 6  0 10
## F 6 9 10 4 10  0
```

---
class: left, middle, rstudio-logo

## Overall Network Structure: Network size

It is often useful to simply understand how large the network is.  How many nodes and edges are we dealing with?

```r
# Number of nodes
vcount(sg)
```

```
## [1] 6
```

```r
# Number of edges
ecount(sg)
```

```
## [1] 5
```

---
class: left, middle, rstudio-logo

## Overall Network Structure: Network density

Network density is defined as the number of actual edges divided by the number of possible edges. A graph with density of 1 is a *complete* graph, and a graph with low density is considered *sparse*.  A network with high density will be able to disperse information faster than a graph with network density.

-  Actual Ties: 5
  -  Possible Ties = `$\frac{N(N-1)}{2} = 15$`

```r
# Network density
edge_density(sg)
```

```
## [1] 0.3333333
```

---
class: left, middle, rstudio-logo

## Overall Network Structure: Components

It can be useful to know the number of **components** in a network.  A component is defined as the number of disconnected groups in a network.  In a people network, is of particular interest to understand and minimize the number of small components, or "islands", of people who feel disconnected from the broader population.

```r
# Number of components
components(sg)
```

```
## $membership
## A B C E D F 
## 1 1 1 2 1 2 
## 
## $csize
## [1] 4 2
## 
## $no
## [1] 2
```

---
class: left, middle, rstudio-logo

## Overall Network Structure: Network diameter

Network diameter is defined as the longest of the shortest paths in a network.  This metric only makes sense in a *connected* graph, because in a *disconnected* graph it is simply `$\infty$`. For the graph below, what is the diameter of the network?

```
##    From To Shortest.Path Distance
## 1     A  B     {A,E,D,B}        3
## 2     A  C       {A,E,C}        2
## 3     A  D       {A,E,D}        2
## 4     A  E         {A,E}        1
## 5     B  C       {B,D,C}        2
## 6     B  D         {B,D}        1
## 7     B  E       {B,D,E}        2
## 8     C  D         {C,D}        1
## 9     C  E         {C,E}        1
## 10    D  E         {D,E}        1
```

---
class: left, middle, rstudio-logo

## Overall Network Structure: Network diameter

```r
diameter(sg2)
```

```
## [1] 3
```

---
class: left, middle, rstudio-logo, bigfont

## Overall Network Structure: Cliques

Cliques are subsets of vertices in an undirected graph whose induced subgraph is complete (has an edge density of 1). Cliques can help us identify tightly connected groups of people.

How many cliques are in our network?

---
class: left, middle, rstudio-logo

## Overall Network Structure: Cliques

```r
# Size of the largest clique
clique_num(d1)
```

```
## [1] 3
```

```r
# Number of cliques between size 2 and 3
length(cliques(d1, min = 2, max = 3))
```

```
## [1] 9
```

---
class: left, middle, rstudio-logo

## Exercise - practicing overall network metrics

Using a dataset on a network of dolphins, you will work through calculating various network metrics

Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**.

Let's work on **Exercise 1**.

---
class: left, middle, rstudio-logo

## Structural Position of People

In networks involving people, understanding their structural position within the network is of significant interest. We can use these metrics as a way to undestand who is important or central within a givin network.

Questions you might consider:

- Who serves as a "bridge" between various parts of an organization?  If these people depart, you can end up with "islands" of people who feel disconnected.
  - Who has the highest number of connections?  This might be a good person to consider for a leadership role, assuming they are viewed positively by their connections.
  - Who is connected to a lot of well-connected people?  These people can be helpful for facilitating introductions.

---
class: left, middle, rstudio-logo, bigfont

## Structural Position: Node Centrality

**Degree centrality** is simply the count of connections for each node.

**Closeness centrality** is a measure of how central or close a node is to other nodes.  Information spreads quickly through a network if it starts with those with high closeness centrality.  To calculate closeness centrality for a node `$v$`:

1.  Take all nodes connected to `$v$` and calculate the distance between each node and `$v$`. 
2.  Take the average and invert it (so that higher means closer)

---
class: left, middle, rstudio-logo, bigfont
## Structural Position: Node Centrality

**Betweenness centrality** is a measure of how important a node is in the overall connectedness of the network.  To derive the betweenness centrality of a node `$v$`:

1.  Take any pairs of nodes that are not `$v$` and calculate the number of shortest paths between them.
2.  Determine how many of those paths pass through `$v$`.
3.  Divide 2 by 1 and sum across all pairs of nodes in the network.
  
  
---
class: left, middle, rstudio-logo

## Back to our undirected graph

Let's create the graph together and learn how to work with it in R. We'll use the `tidygraph` package for a more convenient way to work with `igraph`.

```r
# our edgelist
edges_gr <- data.frame(
  from = c("A","A","A","B","B","C","E","B"),
  to = c("B","D","E","C","D","D","F","A")
)

# create a graph from the dataframe using igraph
gr <- igraph::graph_from_data_frame(edges_gr, directed = FALSE)

# the tidygraph package gives us a really simple way to work with 
gr_tidy <- igraph::simplify(gr) |> # remove the A->B/B->A duplication
  tidygraph::as_tbl_graph()
```

---
class: left, middle, rstudio-logo

## Back to our undirected graph

```r
gr_tidy
```

```
## # A tbl_graph: 6 nodes and 7 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 6 x 1 (active)
##   name 
##   <chr>
## 1 A    
## 2 B    
## 3 C    
## 4 E    
## 5 D    
## 6 F    
## #
## # Edge Data: 7 x 2
##    from    to
##   <int> <int>
## 1     1     2
## 2     1     4
## 3     1     5
## # … with 4 more rows
```
---
class: left, middle, rstudio-logo

## Centrality of our undirected graph

Now we can use convenient `tidyverse` functions and create a quick table of centrality values for each node.

```r
c_vals <-
  gr_tidy |>
  dplyr::rename(NODE = name) |>
  dplyr::mutate(
    DEGREE_CENT = tidygraph::centrality_degree(),
    BTWN_CENT = tidygraph::centrality_betweenness(),
    CLOSE_CENT = tidygraph::centrality_closeness()
  ) |>
  kbl() |>
  kable_minimal()
```

---
class: left, middle, rstudio-logo

## Centrality of our undirected graph

Now we can use convenient `tidyverse` functions and create a quick table of centrality values for each node.

<table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> NODE </th>
   <th style="text-align:right;"> DEGREE_CENT </th>
   <th style="text-align:right;"> BTWN_CENT </th>
   <th style="text-align:right;"> CLOSE_CENT </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 6.0 </td>
   <td style="text-align:right;"> 0.1428571 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 1.5 </td>
   <td style="text-align:right;"> 0.1250000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.0 </td>
   <td style="text-align:right;"> 0.0909091 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4.0 </td>
   <td style="text-align:right;"> 0.1111111 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> D </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 1.5 </td>
   <td style="text-align:right;"> 0.1250000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> F </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.0 </td>
   <td style="text-align:right;"> 0.0769231 </td>
  </tr>
</tbody>
</table>

---
class: left, middle, rstudio-logo

## Exercise - centrality measures

Using the same dolphin network dataset, you will work through calculating centrality measures.

Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**.

Let's work on **Exercise 2**.

---
class: left, middle, rstudio-logo, bigfont

## The Formation of Organizational Networks

We can only measure the structure of networks that we already know exist. Oftentimes, within a formal organizational network, smaller informal networks may exist and the firm may not know about them.

Network analysis provides us with techniques that not only allow us to find these hidden networks, but also to understand what led to their formation: 
  - Community Detection
  - Assortativity
  
---
class: left, middle, rstudio-logo

## Community detection

Network communities are "hidden" subsets that exist within a network.  Through community detection methods, we can sometimes uncover interesting similarities between nodes that we wouldn't have otherwise noticed.

---
class: left, middle, rstudio-logo

## Louvain algorithm (1/2)

One commonly used community detection algorithm is the Louvain algorithm, which partitions the graph into subsets of vertices by trying to maximize the *modularity* of the graph. Modularity measures how dense the connections are within subsets of vertices in a graph by comparing the density to that which would be expected by a random graph.  Modularity ranges from -0.5 to 1 and any positive value indicates the vertices inside the subgroups are more densely connected than would be expected by chance.

```r
# find Louvain communities
communities <- cluster_louvain(sf_gr)

# assign as a vertex property
V(sf_gr)$community <- membership(communities)

# How large are they?
sizes(communities)
```

```
## Community sizes
##  1  2  3  4  5  6  7  8  9 10 
## 39  3 40 37 26 20  8 29  3  3
```

---
class: left, middle, rstudio-logo

## Louvain algorithm (2/2)

```r
set.seed(123)
ggraph(sf_gr, layout = "fr") +
  geom_edge_link(color =  "grey") +
  geom_node_point(aes(color = as.factor(community)),
                  show.legend = FALSE) +
  theme_void()
```

---
class: left, middle, rstudio-logo

## Exercise - community detection measures

Using the dolphin network dataset, you will work through finding a plotting communities.

Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**.

Let's work on **Exercise 3**.

---
class: left, middle, rstudio-logo, bigfont

## Assortativity

Homophily is the tendency for similar people to be connected to each other, i.e., "birds of a feather flock together". When people share common traits, it is sometimes easier to form relationships. Examples include

- education
- political beliefs
- social class
- hobbies
- age
- gender

We can measure homophily in a network using assortativity.

---
class: left, middle, rstudio-logo, bigfont

## Assortativity for nominal data

For undirected graphs:

$$
`\begin{equation}
r = \frac{\sum_{i}e_{ii} - \sum_{i}a_i^2}{1-\sum_{i}a_i^2}
\end{equation}`
$$

where:

- `$e_{ii}$` is the fraction of edges between nodes node of type `$i$` to one of type `$i$`.
- `$a_i$` is the fraction of each type of edge that is connected to a node of type `$i$`

Assortativity ranges from -1 to 1, 1 meaning people only connect with people like them, 0 meaning people connect with all sorts of people equally, and -1 meaning people only connect with people unlike them.

---
class: left, middle, rstudio-logo

## Example

Suppose we want to understand if dog owners tend to be connected to other dog owners, and if non-dog owners tend to be connected to other non-dog owners.

$$
`\begin{equation}
r = \frac{(0.34+0.26)-(0.54^2+0.46^2)}{1-(0.54^2+0.46^2)} = 0.195
\end{equation}`
$$

---
class: left, middle, rstudio-logo, bigfont

## Assortativity for numeric data

$$
`\begin{equation}
r = \frac{\sum_{xy}xy(e_{xy}-a_xb_y)}{\sigma_a\sigma_b}
\end{equation}`
$$

where:

- `$e_{xy}$` is the fraction of edges joining nodes with values `$x$` and `$y$`
- `$a_x$` is the fraction of edges that start and end at nodes with values of `$x$`
- `$b_y$` is the fraction of edges that start and end at nodes with values of `$y$`
- `$\sigma_a$` is the standard deviation of the distribution of `$a_x$`
- `$\sigma_b$` is the standard deviation of the distribution of `$b_y$`

---
class: left, middle, rstudio-logo, bigfont

## Using `igraph` to calculate assortativity

Let's use the workfrance edgelist. Suppose we want to know if people tend to work with people in their same department.

```r
url <- "https://ona-book.org/data/workfrance_edgelist.csv"
workfrance_edgelist <- read.csv(url)
head(workfrance_edgelist)
```

```
##   from  to mins
## 1    3 159    8
## 2    3 253   14
## 3    3 447   17
## 4    3 498   10
## 5    3 694    7
## 6    3 751    7
```

---
class: left, middle, rstudio-logo, bigfont

## Adding relevant feature

We'll need to pull in department in order to add this to the graph we will be creating.

```r
url <- "https://ona-book.org/data/workfrance_vertices.csv"
workfrance_vertices <- read.csv(url)
head(workfrance_vertices)
```

```
##    id dept
## 1  89 DCAR
## 2  97 DCAR
## 3 118 DCAR
## 4 220 DCAR
## 5 378 DCAR
## 6 656 DCAR
```

---
class: left, middle, rstudio-logo, bigfont

## Create graph and calculate assortativity

```r
gr <- workfrance_edgelist[1:2] |>
  igraph::graph_from_data_frame(
    directed = F,
    vertices = workfrance_vertices
  )

assortativity_nominal(gr, factor(V(gr)$dept))
```

```
## [1] 0.684411
```

---
class: left, middle, rstudio-logo

## Exercise - Assortativity

Using a dataset related to emails, we will work through calculating an assortativity example.

Go to our [RStudio Cloud workspace](https://rstudio.cloud/spaces/230780/join?access_code=7cXJKFU1KUuuZGLwBVQpLG3dIxPUD3jak3ZQmESh) and start **Assignment 06-Graph_metrics**.

Let's work on **Exercise 4**.
---
class: left, middle, rstudio-logo

# &#127857; Lunchtime!  &#128523;