15 Non-Parametric Methods – R for Social Network Analysis

15.1 Packages Needed for this Chapter

library(statnet)
library(igraph)
library(ggraph)
library(graphlayouts)
library(patchwork)
library(networkdata)
library(tidyverse)

15.2 Conditional Uniform Graph Distributions (CUGs)

One widely used non-parametric approach is the Conditional Uniform Graph Distribution (CUG). CUGs define a reference distribution of graphs generated uniformly at random, but conditioned on one or more fixed characteristics of the observed network, such as the number of nodes, total number of edges, or degree distribution. By preserving these basic structural features, CUGs enable researchers to assess whether a more complex pattern (e.g., high reciprocity, homophily, or transitivity) is likely to have arisen by chance given the observed constraints.

Why do we do this? Well, in network analysis, statistical inference relies on some form of randomness; we assume that if a structure emerges more often than we would expect under random conditions, it likely reflects an underlying social mechanism. A CUG distribution formalizes this by defining a uniform distribution over a set of graphs, where each graph satisfies a fixed constraint, such as the same number of edges, node degrees, or dyadic relationships.

For example, if a network displays unusually high clustering, a CUG test can evaluate whether this is statistically exceptional or simply a byproduct of the network’s size and density. By comparing the observed statistic to the distribution of values from randomly generated but structurally constrained networks, researchers can determine the significance of the observed pattern. CUGs are especially useful when a full parametric model is unavailable, difficult to specify, or when the goal is to control for known network features while testing specific structural hypotheses.

Several types of CUGs can be defined:

𝒰 |E(L): All graphs with the same density as the observed graph.
𝒰 |L: All graphs with the same number of edges as the observed graph.
𝒰 |d: All graphs with the same degree distribution \(d = (d_1, d_2, ..., d_n)\).
𝒰 |MAN: All graphs with the same dyad census (i.e., counts of mutual, asymmetric, and null ties).

For example, in a CUG with fixed density (number of edges), each graph that meets the constraint is equally probable. Graphs that do not satisfy the constraint (e.g., different number of ties) are assigned a probability of zero. This results in a null model that is tightly defined and suitable for hypothesis testing.

15.3 Hypothesis Testing Procedure

The process follows classical null hypothesis testing logic:

Null Hypothesis (\(H_{0}\)): The observed network is drawn from a CUG model that preserves a given constraint (e.g., edge count).
Alternative Hypothesis (\(H_{1}\)): The observed network structure is not typical under this model and thus suggests a non-random social mechanism.

To test \(H_0\) we follow the below steps:

Define a summary statistic (e.g., degree centralization, reciprocity or transitivity).
Generate a large number of networks from the CUG distribution.
Compute the statistic for each simulated network.
Compare the observed statistic to the simulated distribution for that statistic.

If the observed value lies in the extreme tails (e.g., below the 2.5th percentile or above the 97.5th), we reject \(H_{0}\) at the corresponding 5% significance level, concluding that the observed structure likely reflects a genuine social mechanism rather than random chance. Since the tests are actually non-parametric (Monte Carlo) tests, we can estimate the \(p\)-value as the proportion of simulated test statistics that are as extreme or more extreme than the observed value. This provides a data-driven way to assess whether the observed statistic is likely under the null model, allowing for hypothesis testing without strong distributional assumptions.

The process is visualized in Figure 15.1 where the left panel shows the observed network, while the right panel displays the null distribution of the statistic based on 500 randomly generated graphs that preserve the same number of nodes and edges. The dashed red line indicates the observed value. If this value lies in the tail of the distribution, it suggests the observed structure is unlikely to have arisen by chance under the null model.

Figure 15.1: Conditional Uniform Graph (CUG) test of a network statistic of interest.

15.4 Running Example

We will use a running example, namely Coleman (Coleman 1964) which is available as graph object in the networkdata package and as an array in statnet (or more specifically sna). We will use these interchangeably depending on the function and package used for our analysis. The data consists of self-reported friendship ties among 73 boys in a small high school in Illinois over the 1957-1958 academic year. Networks of reported ties for all 73 informants are provided for two time points (fall and spring). We will only focus on the fall network here, which we load and visualize:

# Load data as graph object from networkdata and extract the fall network
coleman_g <- networkdata::coleman[[1]]

# plot using gggraph
ggraph(coleman_g, layout = "stress") + 
          geom_edge_link(edge_colour = "#666060", end_cap = circle(9,"pt"), 
                         n = 2, edge_width = 0.4, edge_alpha = 0.7, 
                         arrow = arrow(angle = 15, 
                         length = unit(0.1, "inches"), 
                         ends = "last", type = "closed"))  +
          geom_node_point(fill = "#525240",color = "#FFFFFF", 
                           size = 5, stroke = 1.1, shape = 21) + 
          theme_graph() + 
          ggtitle("fall friendship network") +
          theme(legend.position = "none")

15.5 Uniform Graph Distribution Given Expected Density

Here, we simulate graoh distributions while keeping the expected frequency from the observed network fixed. We aim to answer the follwoing quesiton: do we observed significantly many more reciporcal ties in our observed than what is expected by pure chance given random networks of the same expected density? So the statistic of interested here is number of reciprocal ties which we obtain from dyad census:

dyad_census(coleman_g)

$mut
[1] 62

$asym
[1] 119

$null
[1] 2447

# alterantiver one can also use the below
sum(which_mutual(coleman_g)) / 2

[1] 62

The number of reciprocal ties is equal to 62. In order to compute this density, we can use the igraph function edge_density().

density_obs <- edge_density(coleman_g)

Further we save number of nodes and edges of the observed network:

n_nodes <- vcount(coleman_g)

To simulate the null distribution we use the function igraph from the sample_gnp package. We simulate 1000 random networks with the parameters specified above.

# Simulate 1000 random graphs with same density
set.seed(123)
sim_g_dens <- replicate(1000, {
  sample_gnp(n = n_nodes, p = density_obs, directed = TRUE, loops = FALSE)
}, simplify = FALSE)

Note that the output consists of 1000 randomly generated networks as graph objects. While each individual simulated network may not contain exactly the same number of edges as the observed network, they are stochastically equivalent in terms of overall density.

We then compute the number of reciprocal (mutual) ties in each simulated network and visualize the resulting distribution under the null model of random tie allocation in Figure 15.2. A red dashed line is included in the plot to indicate the number of mutual ties observed in the actual network (62), highlighting how it compares to the distribution expected by chance.

Figure 15.2: Empirical distribution of reciprocal ties obtained from 1,000 random directed networks generated under the Conditional Uniform Graph (CUG) null model, holding constant the number of nodes and density (or expected number of edges). The vertical dashed line shows the number of mutual ties observed in the actual network, highlighting its deviation from the null model.

As shown in the plot, the observed number of mutual ties falls far in the right tail of the distribution. This indicates a substantial deviation from what would be expected under random tie formation. This can be interpreted as follows:

If ties in the network were allocated completely at random—while preserving the overall density—it would be highly unlikely to observe as many reciprocal ties as we do in the actual network.

Stated more formally, this analysis serves as a test of the following null hypothesis:

\(H_0\): The number of mutual ties in the observed network does not differ from what would be expected under random tie formation, given the observed network density.

\(H_1\): The observed number of mutual ties is significantly greater than expected by chance, indicating a tendency toward reciprocity beyond what density alone would predict.

Since the observed number of mutual ties falls far in the right tail of the simulated distribution, we see a clear and substantial deviation from the null model. Therefore, we reject the null hypothesis, concluding that the observed network exhibits significantly more reciprocity than would be expected by random chance alone. This indicates a strong tendency for mutual connections in the network that is not explained by density alone.

In order to compute the \(p\)-value, we compare the observed value to the distribution from the simulated null model:

# Empirical p-value (proportion of simulated mutual counts >= observed)
p_value <- mean(mutual_counts >= dyad_census(coleman_g)$mut)
p_value

[1] 0

Unsurprisingly, this value is equal to zero, as we do not have any part of the distribution ranging over the observed count. However, \(p\)-values from Monte Carlo tests are never exactly 0, but rather “less than 1 / number of simulations”, so a more correct statement would be that

\(p\) < 0.001 (based on 1000 simulations) providing very strong evidence against the null hypothesis.

Note

The rgraph() function from the sna (or statnet) package can also be used to simulate random networks with a specified expected density. However, note that its output consists of adjacency matrices (as a 3D array or list), rather than igraph graph objects. This requires additional conversion before applying igraph-based analyses.

15.6 Uniform Graph Distribution Given Number of Edges

In this section, we perform the same test but condition the random networks generated on the exact number of edges. So out null world is now stated as

\(H_0\): The number of mutual ties in the observed network is consistent with what would be expected under random edge assignment, given a fixed number of nodes and edges.

To simulate this null distribution under this new constraint, we generate 1000 random networks using the sample_gnm() function from the igraph package, which samples graphs with a fixed number of edges.

set.seed(123)  
# Parameters from the observed network
n_nodes <- vcount(coleman_g)
n_edges <- ecount(coleman_g)

# Simulate 1000 random directed graphs with fixed number of edges
sim_g_edges <- replicate(1000, {
  g <- sample_gnm(n = n_nodes, m = n_edges, directed = TRUE, loops = FALSE)
}, simplify = FALSE)

Note that each simulated network has exactly the same number of edges as the observed network, ensuring that any difference in mutual ties arises from the pattern of connections, not their quantity. We then compute the number of mutual ties in each simulated network and visualize the distribution, shown in Figure 15.3:

Figure 15.3: Empirical distribution of reciprocal ties obtained from 1000 random directed networks generated under the Conditional Uniform Graph (CUG) null model, holding constant the number of nodes and edges. The vertical dashed line shows the number of mutual ties observed in the actual network, highlighting its deviation from the null model.

The results are very similar to earlier; the observed number of mutual ties lies in the extreme right tail of the simulated distribution. This suggests a strong deviation from what would be expected under the null model of randomly assigned edges with no structural bias toward reciprocity.

Note

The rgnm() function from the sna (or statnet) package can also be used to simulate random networks with a specified expected density. However, note again that its output consists of adjacency matrices (as a 3D array or list).

15.7 Uniform Graph Distribution Given Dyad Census

Given the results above where we tested for reciprocity using CUG null models that preserved either network density or the total number of edges—we now move to a more constrained and informative baseline: null models conditioned on dyadic structure. Specifically, we consider CUG tests based on the dyad census, which preserve the number of mutual, asymmetric, and null dyads observed in the original network. This means that the basic dyadic processes, such as the overall tendency toward reciprocity, are held constant in the simulated networks.

Because mutual ties are fixed across all networks under this null model, the number of reciprocal ties can no longer be used as a test statistic (it will be identical in every simulation). As a result, we shift our focus to higher-order structural properties that emerge from patterns of connected dyads. One such property is transitivity (or triadic closure), which captures the tendency for actors who share a common partner to also be directly connected.

By conditioning on the dyad census and evaluating statistics like transitivity, we can assess whether the observed network exhibits more complex structural organization than would be expected by chance, even after accounting for baseline dyadic tendencies.

Our null hypothesis is now defined as:

\(H_0\): The number of complete triangles (transitive triads) observed in the network does not differ from what would be expected under random tie allocation, conditional on the observed dyadic processes (i.e., mutual, asymmetric, and null dyad frequencies).

Can we then say that there are more complete triangles than we expect by chance? Just how likely or unlikely is it to observe this many triangles? In order to answer this we again need to produce the world of hypothetical networks by simulation.

Note

Since igraph does not currently provide a function to generate random graphs with a fixed dyad census, we make use of rguman() from the sna package, and then convert them to graph objects, and use igraph::triad_census() to compute the triad-level statistics.

In directed netowrks, there are two triad types cprresponding to closed triads MAN-030 and MAN-300.

set.seed(123) 
# Generate 1000 graphs with fixed dyad census using rguman (from sna)
sim_nets_man <- rguman(n = 1000, nv = n_nodes, 
                        mut =  dyad_census(coleman_g)$mut, 
                        asym = dyad_census(coleman_g)$asym, 
                        null = dyad_census(coleman_g)$null, 
                        method = "exact")

# Convert to igraph objects
sim_g_man <- lapply(1:dim(sim_nets_man)[1], function(i) {
  graph_from_adjacency_matrix(sim_nets_man[i,,], mode = "directed")
})

# Get observed triad counts by index
obs_triad <- triad_census(coleman_g)
obs_030T <- obs_triad[9]   # position of 030T
obs_300  <- obs_triad[16]  # position of 300

# sim_graphs: list of igraph directed networks
sim_g_tc <- t(sapply(sim_g_man, triad_census))
sim_g_tc_df <- as.data.frame(sim_g_tc)

# Extract simulated triad counts 
triad_df <- sim_g_tc_df %>%
  select(9, 16) %>%
  rename(`Triad 030T` = 1, `Triad 300` = 2) %>%
  pivot_longer(cols = everything(), names_to = "Triad_Type", values_to = "Count")

# Create observed values for vertical lines
obs_df <- data.frame(
  Triad_Type = c("Triad 030T", "Triad 300"),
  Count = c(obs_030T, obs_300)
)

Now we have everything set to plot out the results as before to test out hypothesis. The results are shown in Figure 15.4.

Figure 15.4: Distributions of two types of closed triads, 030T (one mutual and two asymmetric ties) and 300 (three mutual ties), across 1000 randomly generated directed networks with a fixed dyad census. The red dashed lines indicate the counts of each triad type observed in the actual network. These plots assess whether the observed levels of triadic closure (23 and 22) deviate significantly from what would be expected under a null model that preserves mutual, asymmetric, and null dyads.

15.8 Uniform Graph Distribution Given Fixed Degree

Is the average geodesic distance in the observed network significantly shorter (or longer) than would be expected in random networks with the same in- and out-degree sequence? To answer this we turn to uniform graph distribution given fixed degree. The null world correspond now to the following:

\(H_0\): The average geodesic distance observed in the network is consistent with what would be expected under random graphs that preserve the in- and out-degree sequence.

This CUG test here evaluates whether the observed network is more (or less) efficiently connected than expected under degree-preserving randomization. The results are run and visualized below in Figure 15.5.

Figure 15.5: Distribution of average geodesic distance across 1000 random directed graphs preserving the in- and out-degree sequences of the observed network. The red dashed line indicates the observed value.

This indicates that the observed network is significantly more efficiently connected (in terms of shortest paths) than what would be expected under random edge arrangement, given the same degree distribution. In other words, the network’s structure enables actors to reach one another more quickly than random networks with the same node-level connectivity.

What’s the \(p\)-value? We compare the observed value to the distribution from the simulated null model:

# Empirical p-value (proportion of simulated mutual counts >= observed)
p_value <- mean(sim_geodist <= obs_dist)
p_value

[1] 0.002

Thus, we would reject the null. This example also highlights that you need to pay attention to which tail the observed value is located when computing the \(p\)-value (here left).

15.9 Analysing Homophily Using Non-Parametric Null Distribution

To illustrate how we can test for homophily using a non-paramtric approach, we turn to another data set with available node attributes, namely the one by Lazega (2001). This data set comes from a network study of corporate law partnership that was carried out in a Northeastern US corporate law firm, referred to as SG&R, 1988-1991 in New England. It includes (among others) measurements of networks among the 71 attorneys (partners and associates) of this firm, i.e. their coworker network, advice network, friendship network, and indirect control networks. Various members’ attributes are also part of the dataset, including seniority, formal status, office in which they work, gender, lawschool attended.

Two tests will be performed:

Test 1: Friendship based on gender
Test 2: Cowork among partners based on law practice

These two networks that are used in the following are visualized in Figure 15.6.

Figure 15.6: Visualization of two relational networks among lawyers in the Lazega dataset. Top: The friendship network, with nodes colored by gender. Bottom: The symmetrized coworking network among the 36 law firm partners, with nodes colored by practice area.

Test 1: Friendship based on gender

First, we examine gender-based homophily in a friendship network among corporate lawyers from the Lazega dataset. We load the data as a graph object from the package networkdata and vizualize it, with node color representing gender:

law_friends_g <-  networkdata::law_friends

We are here testing the following

\(H_0\): The number of same-gender ties in the network is consistent with a random distribution of ties (given network size and density).

\(H_1\): The observed network has significantly more (or fewer) same-gender ties than expected by chance.

To answer this, we use a non-parametric test based on randomly generated networks that match the observed network in size and density. This is effectively a CUG test under the null model 𝒰 |L, where ties are randomly distributed.

We start by converting the friendship network to an adjacency matrix and extracting the gender attribute from the nodes:

law_mat <- as.matrix(as_adjacency_matrix(law_friends_g, sparse = FALSE))
law_nodes <- vcount(law_friends_g)
law_edges <- sum(law_mat)
law_gender <- V(law_friends_g)$gender # gender extracted

Then, we compute the number of observed homophilous ties (same-gender friendships) in the observed network:

# 1 = male, 2 = female
homoph_obs <- sum(law_mat[law_gender == 1, law_gender == 1]) +
              sum(law_mat[law_gender == 2, law_gender == 2])

To generate networks from the null world, we simulate 1000 random directed graphs with the same number of nodes and edges as the observed network and then compute the number of homophilous ties in each simulated network (assuming the same ordering of gender assignments to nodes in each simulated network):

set.seed(123)
law_sim_g <- replicate(1000, {
  g <- sample_gnm(n = law_nodes, m = law_edges, directed = TRUE, loops = FALSE)
  as_adjacency_matrix(g, sparse = FALSE)
}, simplify = FALSE)

homoph_sim <- sapply(law_sim_g, function(mat) {
  sum(mat[law_gender == 1, law_gender == 1]) +
  sum(mat[law_gender == 2, law_gender == 2])
})

As before, we can create a histogram of the simulated homophilous tie counts and mark the observed value, as shown in Figure 15.7:

Figure 15.7: Distribution of same-gender friendship ties in 1000 random directed graphs preserving the number of nodes and total number of ties. The red dashed line marks the number of same-gender ties in the observed network.

The empirical \(p\)-value is computed as

mean(homoph_sim >= homoph_obs)

[1] 0

which implies that we reject the null hypothesis, concluding that gender-based homophily exists in the friendship network.

Test 2: Cowork among partners based on law practice

Our second test is whether partners in the law firm are more likely to collaborate (cowork) with others who share the same practice area (litigation or corporate) than we would expect by chance. This is a test of homophily based on professional specialization.

We begin by loading the law_cowork network from the networkdata package:

law_cw_g <- networkdata::law_cowork

We restrict the analysis to the first 36 lawyers, corresponding to the partners of the firm (as indicated by their “status” attribute).

The follwing hypotheses can be stated:

\(H_0\): The number of same-practice coworking ties is consistent with a random distribution of ties, given number of ties and edges of the network.

\(H_1\): The observed number of same-practice ties is significantly greater than expected under random conditions; suggesting homophily based on professional specialization.

Since coworking is a reciprocal relationship, we treat the network as undirected by symmetrizing the adjacency matrix:

# Create an adjacency matrix
law_mat_cwdir <- as.matrix(as_adjacency_matrix(law_cw_g, sparse = FALSE))
law_mat_cwdir <- law_mat_cwdir[1:36, 1:36]

# Symmetrize to create undirected matrix (co-ties must be mutual)
law_mat_cw <- (law_mat_cwdir == 1 & t(law_mat_cwdir) == 1) * 1

law_nodes_cw <- nrow(law_mat_cw)
law_ties_cw <- sum(law_mat_cw) / 2  # undirected: each tie counted twice

We now extract the binary attribute practice for each partner and store it as a vector:

law_attr_pract <- V(law_cw_g)$practice[1:36]

We define homophilous ties as coworking ties between two lawyers of the same practice area. We count both litigation–litigation and corporate–corporate ties:

homoph_obs_cw <- sum(
  law_mat_cw[law_attr_pract == 1, law_attr_pract == 1]) / 2 +
  sum(law_mat_cw[law_attr_pract == 2, law_attr_pract == 2]) / 2

To construct the null model, we generate 1000 random undirected graphs with the same number of nodes and total number of ties:

set.seed(77)
law_sim_cw <- replicate(1000, {
  g <- sample_gnm(n = law_nodes_cw, m = law_ties_cw, directed = FALSE, loops = FALSE)
  as_adjacency_matrix(g, sparse = FALSE)
}, simplify = FALSE)

We calculate the number of same-practice ties in each simulated graph:

homoph_sim_cw <- sapply(law_sim_cw, function(mat) {
  sum(mat[law_attr_pract == 1, law_attr_pract == 1]) / 2 +
  sum(mat[law_attr_pract == 2, law_attr_pract == 2]) / 2
})

and finally, plot the null distribution and compare with observed value:

Figure 15.8: Distribution of same-practice coworking ties in 1000 random undirected graphs preserving the number of nodes and ties. The red dashed line indicates the observed number of homophilous ties among law firm partners. This test evaluates whether the level of professional homophily (litigation or corporate) is different than what is expected under random tie allocation.

As seen in Figure 15.8, we are in the right tail, indicating we arfe observing more homophilous ties than expected by chance. To formnally test the hypothesis above we can compute the \(p\)-value:

mean(homoph_sim_cw >= homoph_obs_cw)

[1] 0.002

Thus, there is statistically significant evidence of practice-based homophily among the firm’s partners.

15.1 Packages Needed for this Chapter

15.2 Conditional Uniform Graph Distributions (CUGs)

15.3 Hypothesis Testing Procedure

15.4 Running Example

15.5 Uniform Graph Distribution Given Expected Density

15.6 Uniform Graph Distribution Given Number of Edges

15.7 Uniform Graph Distribution Given Dyad Census

15.8 Uniform Graph Distribution Given Fixed Degree

15.9 Analysing Homophily Using Non-Parametric Null Distribution

Test 1: Friendship based on gender

Test 2: Cowork among partners based on law practice

References