29 Tutorial
To demonstrate how you can use EpiModel with data inputs from network census and egocentrically sampled designs we will return to the built-in adolescent friendship network data “faux.mesa.high” (“Mesa”) from the ergm
package.
Mesa is a network census based on one of the “saturation” sample schools from the Add Health study (wave 1). Questionnaires were completed by all of the students from the 7th-12th grades, and they were asked to nominate up to 5 close male and 5 close female friends from a roster that included the names of all students in their school, and a code for friends who were not enrolled in the school. More information on the study design can be found here. The network included in the ergm
package is a simplified and anonymized version of the original data, using a model-based simulation of the mutual in-school nominations only, so it is an undirected network. More information on the details and construction of the simulated dataset can be found by typing ?faux.mesa.high
.
This is a static network, and the Add Health study did not include questions about partnership duration. We will use a hypothetical average friendship length to parameterize the duration.
Epimodel’s network methods have the flexibility to be entirely driven by a single dataset (when all necessary data have been collected), to combine data from different studies, or to introduce parameters for exploring “what-if” scenarios.
Download the R script to follow along with this tutorial here.
Let’s start by loading the EpiModel
package. Optionally, also install the Rglpk
package to use the preferred solver during estimation.
29.1 Using Network Census Data
29.1.1 Load the data
We’ll load it, and look at the summary. Note that the class of the built-in Mesa data object is network
. If you have data from a network census study design, the methods for creating network
objects from your data can be found in one of the Statnet intro workshops.
[1] "network"
Network attributes:
vertices = 205
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 203
missing edges= 0
non-missing edges= 203
Vertex attribute names:
Grade Race Sex
No edge attributes
29.1.2 Explore attribute effects
We’ll focus on two of the attributes here – Grade and Sex – and use some basic R utilities to explore whether these are associated with heterogeneity in behavior (degree) and network structure (mixing patterns).
Degree distributions:
Code
degree0 degree1 degree2 degree3 degree4 degree5 degree6 degree7
57 51 30 28 18 10 2 4
degree8 degree9 degree10 degree13
1 2 1 1
degree0 degree1 degree2 degree3 degree4 degree5 degree6 degree7
57 51 30 28 18 10 2 4
degree8 degree9 degree10 degree13
1 2 1 1
degree0 degree1 degree2 degree3 degree4 degree5 degree6 degree7
57 51 30 28 18 10 2 4
degree8 degree9 degree10 degree13
1 2 1 1
Code
Code
par(mfrow=c(1,2))
# by Grade
barplot(summary(mesa ~ nodefactor("Grade", levels=T)) / table(mesa %v% "Grade"),
horiz = T,
yaxt = "n", col = fill.col,
main = "Mean degree by Grade", xlab = "Mean degree", ylab = "Grade")
grades <- c(7:12)
pos <- barplot(summary(mesa ~ meandeg:nodefactor("Grade", levels=TRUE)),
horiz = T, plot=F)
axis(2, at = pos, labels = grades)
# by sex
barplot(summary(mesa ~ nodefactor("Sex", levels=T)) / table(mesa %v% "Sex"), horiz = T,
yaxt = "n", col=fill.col[7:8],
main = "Mean degree by Sex", xlab = "Mean degree", ylab = "Sex")
sex <- c("F", "M")
pos <- barplot(summary(mesa ~ meandeg:nodefactor("Sex", levels=TRUE)),
horiz = T, plot=F)
axis(2, at = pos, labels=sex)
F M
23.2 32.1
Mixing patterns:
7 8 9 10 11 12
7 75 0 0 1 1 1
8 0 33 2 4 2 1
9 0 2 23 7 6 4
10 1 4 7 9 1 5
11 1 2 6 1 17 5
12 1 1 4 5 5 6
Note: Marginal totals can be misleading for undirected mixing matrices.
F M
F 82 71
M 71 50
Note: Marginal totals can be misleading for undirected mixing matrices.
Code
par(mfrow = c(1,2), oma=c(0,0,3,0))
plot(mesa, coord=coords,
vertex.col='Grade', main="By Grade", cex.main = 1)
legend('bottomleft',fill=7:12,
legend=paste('Grade',7:12),cex=0.5)
plot(mesa, coord=coords,
vertex.col='Sex', main="By Sex", cex.main = 1)
legend('bottomleft',fill=1:2,
legend=paste('Sex',c("F","M")),cex=0.5)
mtext("Mesa friendship network", side=3, outer=T, cex=1.5)
The EDA replicates what we saw in this network using statnetWeb: both the sex and grade attributes influence friendship prevalence and selection, producing a network with some clear heterogeneities in structure.
29.1.3 Specify the model
We explored several models in statnetWeb, but here we want to focus on demonstrating how network data are used in an EpiModel workflow. So we won’t spend time testing and comparing models for the network, but the lab will give you a chance to do some of that.
We are going to fit a specific model for attribute heterogeneities that is (loosely) informed by the EDA above:
- “main effects” for sex and grade (
nodefactor
) – to represent the difference in mean degree by levels of each attribute; - a mixing term for grade (
nodematch
) – to represent homophily by grade; - a degree(0) term by sex – to represent the differential prevalence of isolates for boys and girls.
29.1.4 Fitting the Network Model
Following the terminology and syntax demonstrated previously, we would like to fit a TERGM where the formula includes terms for overall mean degree (edges
), mean degree by grade and sex (nodefactor
terms) mixing by grade (a nodematch
term), and isolates by sex (having a degree of exactly zero). A total of 11 terms is specified by this model.
Since we are using network data input we can use ergm
to calculate the summary statistics for these model terms directly from the data using the summary()
function. These are the statistics used as the target stats in the fitting process.
Code
edges nodefactor.Grade.8 nodefactor.Grade.9 nodefactor.Grade.10
203 75 65 36
nodefactor.Grade.11 nodefactor.Grade.12 nodematch.Grade nodefactor.Sex.M
49 28 163 171
nodematch.Sex deg0.Sex.F deg0.Sex.M
132 23 34
We then fit the TERGM with the above formula and data, adding a hypothetical parameter for the dissolution model that produces a mean relational duration of 60 time steps.
Next, we run fit diagnostics to verify that the target stats are reproduced in expectation.
Note, we can also run dx for network statistics we didn’t include in the model, and this can be a quick way to assess the model GOF to features that were not explicitly represented. Let’s take a look at the observed values of some and compare them to the netstats produced by this model.
meandeg gwdeg.fixed.0.5 kstar2 triangle
1.980488 199.565738 659.000000 62.000000
Code
Network Diagnostics
-----------------------
- Simulating 25 networks
- Calculating formation statistics
Which of these netstats are reproduced by the model? Which are not?
You can also extract the raw data that goes into producing these numerical summaries and plots to work with them later.
time sim meandeg gwdeg.fixed.0.5 kstar2 triangle
1 1 1 1.902439 194.6103 520 9
2 2 1 1.902439 195.2788 514 9
3 3 1 1.921951 194.8129 531 10
4 4 1 1.912195 192.0841 532 11
5 5 1 1.902439 192.0766 520 11
6 6 1 1.921951 192.1490 539 11
7 7 1 1.902439 191.8629 526 11
8 8 1 1.882927 191.1496 514 9
9 9 1 1.912195 192.0039 531 9
10 10 1 1.912195 191.6711 536 9
29.1.5 Epidemic Simulation
To get the epidemic simulation started, we need to set the epidemic model parameters, initial conditions, and control settings (details provided here).
Code
myparam <- param.net(inf.prob = 0.2,
act.rate = 1.8,
rec.rate = 0.1)
myinit <- init.net(i.num = 10)
# Here we set which attribute to use for epi stats breakdowns
# (only one can be specified in base EpiModel)
# and also which stats we want to monitor
mycontrol <- control.net("SIS", nsteps = 100, nsims = 10,
epi.by = "Grade",
nwstats.formula = ~edges + nodematch("Grade") +
degree(0:2, by = "Sex"),
verbose = TRUE)
Run the epidemic simulation with netsim
.
Code
How does prevalence vary by grade? Since the grade sizes are unequal, it’s best to compare rates rather than raw numbers. We construct and plot those below. Note that despite clear degree differences and homophily by grade, and some early prevalence differentials, the prevalence eventually stabilizes at similar levels for all grades. This happens because the transmission probability is relatively high, the network is relatively well connected over time and there are no groups that are completely isolated from the others.
Code
mySIS <- mutate_epi(mySIS,
prev.7 = i.num.Grade7 / num.Grade7,
prev.8 = i.num.Grade8 / num.Grade8,
prev.9 = i.num.Grade9 / num.Grade9,
prev.10 = i.num.Grade10 / num.Grade10,
prev.11 = i.num.Grade11 / num.Grade11,
prev.12 = i.num.Grade12 / num.Grade12)
plot(mySIS, y = c("prev.7", "prev.8", "prev.9", "prev.10",
"prev.11", "prev.12"),
qnts.alpha = 0.2, qnts = 0.5, legend = TRUE)
It’s also a good idea to plot the network stats time series from the epidemic simulations to ensure that the epidemic simulations did not have unintended consequences on the network structure.
29.1.6 Dynamic Visualization
The simulation object you created can be passed to the ndtv
package for animation, allowing you to see how the infection is moving through the network of students. Recall that the while the network data we started with was a network census, it was a static network. So we passed a hyptothetical average partnership duration to EpiModel for calculating the dissolution model and parameterizing the TERGM. So the network structure is coming from the observed data, and the dynamics of tie formation and dissolution from the hypothetical.
We’ll vizualize the first 20 steps of the epidemic simulation to get a sense of what the network, and the spread, look like.
Code
myEpiSim <- get_network(mySIS)
myEpiSim <- color_tea(myEpiSim, verbose = FALSE)
slice.par <- list(start = 1, end = 20, interval = 1,
aggregate.dur = 1, rule = "any")
render.par <- list(tween.frames = 10, show.time = FALSE)
plot.par <- list(mar = c(0, 0, 0, 0))
compute.animation(myEpiSim, slice.par = slice.par, verbose = TRUE)
render.d3movie(
myEpiSim,
render.par = render.par,
plot.par = plot.par,
vertex.cex = 0.9,
vertex.col = "ndtvcol",
edge.col = "darkgrey",
vertex.border = "lightgrey",
displaylabels = FALSE,
output.mode = "htmlWidget")
Given what you know about this network, how well do you think this simulated model captures the heterogeneity in friendship structure? Does the animated network seem to have the same structure over time? If not, how would you modify the workflow above to address this?
So that’s the workflow when working with a network census dataset and a hypothetical edge duration parameter. Behind the curtain, this is the breakdown of how the packages are used:
netest()
the
ergm
package calculates the target statistics from the network data and estimates the formation model that you specified for the faux.mesa.high nodeset.the
EpiModel
package calculates the dissolution coefficient and adjusts the coefficient on theedges
term of the formation model to transform it to an incidence rate.
netdx()
- the
tergm
package simulates a complete dynamic network over time from the fitted model and keeps a list of the monitored statistics for inspection/plotting
netsim()
the
tergm
package simulates a complete dynamic network over time from the fitted model (as above).the
EpiModel
package simulates the transmission process over the dynamic network
- animation
EpiModel
extracts the network and colors the nodes by infection status (get_network
,color_tea
)ndtv
animates the colored dynamic network (withcompute.animation
andrender.d3movie
)
29.2 Using egocentrically sampled data
Now we will extract the ego networks from Mesa run the same EpiModel simulation using egocentrically sampled data as the input, and show that the results are the same.
29.2.1 Load the ergm.ego package
29.2.2 Extract the egocentric data
For this example we will extract an egocentric census – that is, we will pull the egonets for every node in the Mesa network. We do this to highlight the similarities and differences in the data format (rather than the content of the network data) and the downstream impacts.
The object we extract from Mesa is an egor
class object (egor
is the name of the package that handles egodata), and it’s a list that contains 3 tables: egos (with their attributes), alters (with their attributes and the egoID they are linked to) and aaties (if these are present). The format of the summary information is different than the format used with network
objects. For more details about the structure of egor
objects and how to work with them please see ?as.egor
and the Statnet ergm.ego
workshop.
[1] "egor" "list"
# EGO data (active): 205 × 4
.egoID Grade Race Sex
* <int> <dbl> <chr> <chr>
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
# ℹ 200 more rows
# ALTER data: 406 × 5
.altID .egoID Grade Race Sex
* <int> <int> <dbl> <chr> <chr>
1 174 1 7 Hisp F
2 161 1 7 Hisp F
3 151 1 7 Hisp F
# ℹ 403 more rows
# AATIE data: 372 × 3
.egoID .srcID .tgtID
* <int> <int> <int>
1 1 151 127
2 1 127 52
3 1 127 87
# ℹ 369 more rows
205 Egos/ Ego Networks
406 Alters
Min. Netsize 1
Average Netsize 2.74324324324324
Max. Netsize 13
Average Density 0.750903163274297
Alter survey design:
Maximum nominations: Inf
Let’s verify that the ego attribute distribution is the same as in the Mesa network object.
29.2.3 Explore the data
The simple structure of the egor
object makes it relatively easy to conduct EDA using the traditional tools in R, and examples can be found in the Statnet workshop for ergm-ego
. We’ll just highlight one nice plotting function built into ergm-ego
for looking at degree distributions by attributes, with the BRG distribution overlay.
Otherwise, we’ll look at one important thing here: the summary statistics for the model terms from the two types of network data. These stats become the target stats in the fitting process.
Code
network egocensus
edges 203 203
nodefactor.Grade.8 75 75
nodefactor.Grade.9 65 65
nodefactor.Grade.10 36 36
nodefactor.Grade.11 49 49
nodefactor.Grade.12 28 28
nodematch.Grade 163 163
nodefactor.Sex.M 171 171
nodematch.Sex 132 132
deg0.Sex.F 23 23
deg0.Sex.M 34 34
They’re identical. So, net of stochastic variation in the fitting algorithm, the ERGM and TERGM fits from these two different network data inputs should be the same.
29.2.4 Fit the network model
We fit the TERGM with netest
using the same formula we used before. This time, netest
will automatically detect that the network object passed on the LHS is an egor
object, and it will use ergm.ego
for the estimation.
When fit without specific population size controls, the
ergm.ego
package default assumption is that the true network size is equal to the sample size.
We’ll use this default here, since we took a census of our true population network, Mesa, so our egocentric “sample” actually is the same size as our population network. But we’ll show an example of scaling the fit to a different population network size below.
Compare the coefficients from the two fits (the same, net of stochastic variations in the fitting algorithm):
Code
network egocensus diff
edges -6.123 -6.129 0.006
nodefactor.Grade.8 0.198 0.187 0.010
nodefactor.Grade.9 0.090 0.094 -0.004
nodefactor.Grade.10 0.339 0.337 0.002
nodefactor.Grade.11 0.452 0.454 -0.002
nodefactor.Grade.12 0.779 0.768 0.011
nodematch.Grade 3.005 3.005 0.000
nodefactor.Sex.M -0.275 -0.275 0.001
nodematch.Sex 0.609 0.619 -0.010
deg0.Sex.F 1.661 1.661 0.000
deg0.Sex.M 1.220 1.216 0.004
Run model diagnostics to check the model goodness of fit with the egodata and plot the results:
As with the network census data, the simulations from the egodata census match the target statistics in expectation.
We won’t go on to simulate the epidemic, since the inputs to and syntax for EpiModel from here forward are the same as those used for the Mesa network census above.
Bottom line: The basic netest
syntax is similar for all types of network data inputs – target statistics, a network census dataset or egocentrically sampled network data.
Since ERGM/TERGM fits depend only on the model’s sufficient statistics, when the values of these statistics are the same, the estimated coefficients, network simulations and epidemic simulations will be the same (up to stochastic variation), regardless of the type of data that you pass to EpiModel.
29.2.5 Scale the fit to a different network size
Using the netest
function to estimate the model for a larger (or smaller) network is straightforward, and useful in two contexts:
Your egocentrically sampled dataset is fairly small.
Egocentric study designs are used just like a sample survey: to collect data from a tiny fraction of the whole population, and use the sample to estimate models and make statistical inferences to the whole population. Sampling is efficient (when it’s done right!). But if your sampled network is too small, this may compromise the estimation. The basic principle is analagous to the general statistical rule of thumb that you want a minimum cell size (of, say, 5) in order to estimate model terms that depend on those cells. In this context you want a large enough “pseudo-population network” size during the MCMC run to provide sufficient statistics that can support estimation. A simple rule of thumb is a pseudo-population network size of 1,000, but the appropriate size depends on the complexity of the model and whether there are sample weight, as in other statistical contexts.
Simulation experiments (“what-ifs” and counterfactuals)
Say you want to examine whether and how network size influences infection dynamics. One way to do that is to fit the network model with a range of explicit pseudo-population sizes, simulate epidemics on these differently sized networks, and compare the results.
It is helpful to learn a bit more about how the package ergm.ego
works if you want to explore model-based network scaling, but the basic netest
syntax itself is very simple – you pass the argument ppopsize
to ergm.ego
via the argument set.control.ergm.ego
in netest
. Here we demonstrate estimation with pseudo-population network sizes of 1000 and 2000.
Code
set.seed(0)
myfit.ego.pp1000 <- netest(mesa.ego,
formation = formation,
coef.diss = dissolution_coefs(~offset(edges), 60),
set.control.ergm.ego = ergm.ego::control.ergm.ego(
ppopsize = 1000))
set.seed(0)
myfit.ego.pp2000 <- netest(mesa.ego,
formation = formation,
coef.diss = dissolution_coefs(~offset(edges), 60),
set.control.ergm.ego = ergm.ego::control.ergm.ego(
ppopsize = 2000))
The control parameter tells ergm.ego
to construct a complete network of size ppopsize
with the same (proportional) distribution of nodal attributes, same mean degree and appropriately scaled target statistics as those observed in the egocentric sample (or as close as possible if the ppopsize
is not a multiple of the sample size). The MCMC fitting algorithm then uses this scaled pseudo-network during estimation.
If you want to inspect the scaled pseudo-network, it can be found in the fit object component newnetwork
. This is technically the network used in the last chain of the MCMC estimation process, so the ties are a stochastic draw from the model distribution, but the target stats and nodeset (with attributes) are the same throughout the fitting process.
Compare the number of nodes in each of the networks, their target stats, and the preservation of mean degree across network size (net of stochasticity):
Network sizes:
Code
network egocensus egopp1000 egopp2000
1 205 205 1025 2050
Network size scaling factor:
Code
network egocensus egopp1000 egopp2000
1 1 1 5 10
Target stats:
Code
network egocensus egopp1000 egopp2000
edges 203 203 1015 2030
nodefactor.Grade.8 75 75 375 750
nodefactor.Grade.9 65 65 325 650
nodefactor.Grade.10 36 36 180 360
nodefactor.Grade.11 49 49 245 490
nodefactor.Grade.12 28 28 140 280
nodematch.Grade 163 163 815 1630
nodefactor.Sex.M 171 171 855 1710
nodematch.Sex 132 132 660 1320
deg0.Sex.F 23 23 115 230
deg0.Sex.M 34 34 170 340
Target stat scaling:
Code
network egocensus egopp1000 egopp2000
edges 1 1 5 10
nodefactor.Grade.8 1 1 5 10
nodefactor.Grade.9 1 1 5 10
nodefactor.Grade.10 1 1 5 10
nodefactor.Grade.11 1 1 5 10
nodefactor.Grade.12 1 1 5 10
nodematch.Grade 1 1 5 10
nodefactor.Sex.M 1 1 5 10
nodematch.Sex 1 1 5 10
deg0.Sex.F 1 1 5 10
deg0.Sex.M 1 1 5 10
Mean degree:
Code
network egocensus egopp1000 egopp2000
meandeg 2.17561 2.234146 1.976585 2.044878
Compare the estimated network model coefficients:
Code
network egocensus egopp1000 egopp2000 pct.diff2000
edges -6.123 -6.129 -7.669 -8.356 -36.467
nodefactor.Grade.8 0.198 0.187 0.175 0.173 12.650
nodefactor.Grade.9 0.090 0.094 0.079 0.078 14.057
nodefactor.Grade.10 0.339 0.337 0.305 0.304 10.454
nodefactor.Grade.11 0.452 0.454 0.410 0.407 9.979
nodefactor.Grade.12 0.779 0.768 0.697 0.683 12.264
nodematch.Grade 3.005 3.005 2.898 2.889 3.877
nodefactor.Sex.M -0.275 -0.275 -0.261 -0.260 5.235
nodematch.Sex 0.609 0.619 0.558 0.547 10.263
deg0.Sex.F 1.661 1.661 1.594 1.584 4.666
deg0.Sex.M 1.220 1.216 1.168 1.156 5.224
The key difference is the value of the “edges” coefficient, reflecting a scaling assumption built in to the ergm-ego
package.
29.2.5.1 The scaling assumption: mean degree? or density?
The canonical edges coefficient in an ERGM controls the density of the network: the probability of a tie across all possible dyads. Density is related to mean degree as:
\[ Mean Degree = Density * (N-1) \] We can verify this in the original Mesa network:
Code
meandeg
204
As we scale up the network size, \(N' = k*N\), where \(k\) is 5 and 10 in our scaleup examples above, the number of dyads grows (as the square of the net size). If we use the fitted ERGM from the original Mesa network to simulate a network that is 5 or 10 times higher, the canonical edges coefficient would preserve the network density, but that would force the mean degree to rise, by a factor of \((k-1)*Density\). Using the Mesa network, for example, the mean degree would rise from the observed value of about 2 to about 8 and 20 respectively for our 1000 and 2000 sized networks.
This “density-dependent” behavior does not seem realistic in the context of most human social networks. Instead, it seems more likely that the number of ties a person has are likely to level out at a certain number, regardless of how much the population size increases (or decreases, as long as network size doesn’t drop below mean degree).
The ergm-ego
package has a built-in assumption that preserves the mean degree rather than the density when network size is scaled up using the ppopsize
argument.
Mean degree is preserved by the scaling the edges coefficient. The adjustment happens automatically in ergm.ego
and is a straightforward difference on the log scale: log(orig net size) - log(new net size).
Note that the same assumption, that mean degree is the scale invariant parameter, is used in the EpiModel
package, and is invoked when using open-population models.