## Install the following package
# install.packages ("relevent")
library (relevent)
library(sna)
library(ggplot2)
library(scales)
library(igraph)
library(RColorBrewer)
Welcome to part 1 of the practical for the social network analysis.
In this practical we will get hands-on experience with the materials in
the lectures about applying Relational Event Models (REMs) on real
relational event history (REH) data: Class data and World Trade Center
(WTC) by doing a bit of programming in R. These data sets are stored in
UUsummerschool.rdata
. Later in this practical we explain
these data sets briefly.
As it was mentioned in the lecture, dyadic REMs are intended to capture the behavior of systems in which individual social units (persons, organizations, animals, companies, countries, etc.) direct discrete actions towards other individuals in their environment, which form a social network.
First, we need to load UUsummerschool.rdata
and see what
is inside of it.
# Set your own working directory
# setwd()
# Load data using load():
load("UUsummerschool.rdata")
# This R-data contains 11 objects:
ls()
## [1] "as.sociomatrix.eventlist" "Class"
## [3] "ClassIntercept" "ClassIsFemale"
## [5] "ClassIsTeacher" "PartOfApollo_13"
## [7] "Twitter_data_rem3" "WTCPoliceCalls"
## [9] "WTCPoliceIsICR"
What is stored in UUsummerschool.rdata
?
A function as.sociomatrix.eventlist
that outputs a
sociomatrix, and the rest are data sets such as
PartOfApollo_13
, Class
,
Twitter_data_rem3
and WTCPoliceCalls
.
Probably you would ask: “What is a sociomatrix?” Before you let the computer do the social network analysis, it must have some basic data to calculate, like the connection between whom and whom. We should be aware that any one of those connections involves only two individuals. That is, it involves two nodes. This property might already inspire you to think the relationship as a mathematical expression, for instance a matrix of two dimensions. If we put the elements in the same sequence both at the row and the column, and we use 1 and 0 to represent whether there is a connection between the two elements or not, then we will have a matrix as we expect. The matrix is for the social network and we can give it an exclusive name, sociomatrix.
Secondly, let’s check the objects that we will work with in this practical.
as.sociomatrix.eventlist
## function (eventlist, n)
## {
## g <- matrix(0, n, n)
## if (NROW(eventlist) > 0) {
## tabmat <- table(eventlist[, -1, drop = FALSE])
## g[as.numeric(dimnames(tabmat)[[1]]), as.numeric(dimnames(tabmat)[[2]])] <- tabmat
## }
## g
## }
## attr(,"source")
## [1] "function(eventlist,n){"
## [2] " g<-matrix(0,n,n)"
## [3] " if(NROW(eventlist)>0){"
## [4] " tabmat<-table(eventlist[,-1,drop=FALSE])"
## [5] " g[as.numeric(dimnames(tabmat)[[1]]), as.numeric(dimnames(tabmat)[[2]])]<-tabmat"
## [6] " }"
## [7] " g"
## [8] "}"
head(WTCPoliceCalls)
## number source recipient
## 1 1 16 32
## 2 2 32 16
## 3 3 16 32
## 4 4 16 32
## 5 5 11 32
## 6 6 11 32
head(Class)
## StartTime FromId ToId
## 1 0.135 14 12
## 2 0.270 12 14
## 3 0.405 18 12
## 4 0.540 12 18
## 5 0.675 1 12
## 6 0.810 12 1
relevent
package and rem.dyad
functionThe relevent
package (Butts, 2015) in R provides
implementation of relational event framework, with specialized
functionality for easily fitting dyadic relational event models. Within
the relevent
package, the rem.dyad()
function
is the primary workhorse for modeling dyadic data. As you can see the
related documentation, there are a bunch of arguments that can be used
for different purposes. Here, in this practical, we only use a few of
them (please see the pdf file rem.dyad.args
). For more
explanation please check the help file of rem.dyad
as
follows.
## You can access the help file in three different ways:
# 1)
??rem.dyad
# 2)
help(rem.dyad)
# 3)
?relevent::rem.dyad
There is also rem()
for modeling the dyadic data. The
main difference is that rem()
needs the computed statistics
in advance but rem.dyad()
computes the statistics
internally. However, if we have the value of statistics in advance
rem()
tends to be faster.
As applying the regression model, we should first decide which variables/statistics should be in our model. This can be done in several ways: experts knowledge, prior information and statistics/variables selection e.g., using BIC,etc. Note that not all effects may lead to identified models in all cases. It is up to the user to ensure that the postulated model makes sense.
This data contains information about the 9/11 terrorist attacks at the WTC in New York City in 2001. That should be notified that the radio communication was essential in coordinating activities during the crisis. Butts et al.(2007) coded radio communication events between officers responding to 9/11 from transcripts of communications recorded during the event. It is important to note that the WTC radio data was coded from transcripts that lacked detailed timing information; we do not therefore know precisely when these calls were made. But we know the order of events. Since the case of ordinal timing is somewhat simpler than that of exact timing, we consider the WTC data first in this practical. We will illustrate ordinal time REMs using the 481 communication events from 37 named communicants in this data set.
The eventlist is stored in an object called
WTCPoliceCalls
.
Question 1: Check the data using head
and
tail
function. How many events does the data contain? How
many actors? Without testing anything, can you see any trend? If so,
which statistics can capture that trend?
Hint: Use table()
to generate frequency tables for
source
and recipient
attributes. Use
ggplot(data, aes(attribute))+geom_bar()+labs(y="", x="")
to
visualize the data.
# Check the data.
head(WTCPoliceCalls)
## number source recipient
## 1 1 16 32
## 2 2 32 16
## 3 3 16 32
## 4 4 16 32
## 5 5 11 32
## 6 6 11 32
table(WTCPoliceCalls$source)
##
## 1 2 3 4 5 6 8 9 10 11 12 15 16 17 18 19 20 21 22 23
## 5 1 5 4 2 1 5 1 2 7 3 18 109 1 9 5 2 17 22 5
## 24 25 26 27 28 29 30 31 32 33 34 36 37
## 10 9 3 6 4 6 8 6 171 3 5 22 4
table(WTCPoliceCalls$recipient)
##
## 1 2 3 4 5 8 10 11 12 14 15 16 17 18 19 20 21 22 23 24
## 4 1 4 5 2 1 1 5 2 1 19 103 2 9 4 2 13 21 5 8
## 25 26 27 28 29 30 31 32 33 34 35 36 37
## 6 5 1 2 4 7 5 201 2 5 2 26 3
ggplot(WTCPoliceCalls, aes(source))+geom_bar()+labs(y="Frequncy", x="Source")
ggplot(WTCPoliceCalls, aes(recipient))+geom_bar()+labs(y="Frequncy", x="Recipient")
You can see the reciprocation and recency pattern in the first few events: responding officer 16 called officer 32 in the first event, officer 32 then called 16 back in the second, which might be characterized as a local reciprocity effect or \(AB \rightarrow BA\) participation shift. This was followed by 32 being the target of the next four calls, which is probably due to either a role of 32 such as a coordinator or perhaps the presence of a recency mechanism.
Question 2: Use the included sna
function
as.sociomatrix.eventlist()
to convert the eventlist into a
valued sociomatrix. For the definition of sociomatrix see previous
section.
Hint: as.sociomatrix.eventlist(data, number of actors)
.
WTCPoliceNet <- as.sociomatrix.eventlist(WTCPoliceCalls ,37)
Question 3: Once you have the sociomatrix object, plot the
aggregated network using gplot()
.
Hint: gplot(created sociomatrix, the arguments such as the size of
nodes, width of edges, colors of nodes, etc.)
gplot
creates a node-and-edge visualization of a network.
It can accept an adjacency matrix (N × N), an array of adjacency
matrices (m × N × N), network objects, and other forms of network data
(check sna
library to see what else is possible).
gplot
has many arguments that you can tweak your network
model visualization, for example:
vertex.col
: set the node color.edge.col
: color for edge border and fill.vertex.sides
: number of sides for node polygon. ex)
triangle = 3, square = 4 (default = 8).label.bg
: color for the background of node.gplot
to find other features.gplot(WTCPoliceNet, edge.lwd=WTCPoliceNet^0.75, arrowhead.cex=
log(as.edgelist.sna(WTCPoliceNet)[,3])+.25,
vertex.col= ifelse(WTCPoliceIsICR ,"black","gray"),
vertex.cex=1.25, vertex.sides=ifelse(WTCPoliceIsICR ,4,100))
The plot above is the time-aggregated WTC Police communication network.
Question 4: igraph
is the most commonly used R
package for network analysis. It provides tools to import, transform,
and visualize graphs. The documentation for igraph
can be
found here: https://igraph.org/r/doc/. igraph
functions
operate on igraph objects, therefore we need to transform our
sociomatrix (also called adjacency matrix) into an igraph object; Use
graph_from_adjacency_matrix()
to convert the WTC Police
Calls sociomatrix into a graph object.
Hint: igraph object is a special class of graphs used by
igraph
. igraph graphs store information about vertices,
edges, and attributes in a format of an edge list. You can read more
about igraph
package using ?igraph
.
wtc_graph_fromAM <- graph_from_adjacency_matrix(WTCPoliceNet)
Question 5: Adding relevant data (attributes) to a graph is
called ‘decorating’ a graph. In igraph
it can be done by
using $
operator. Assign corresponding numbers to nodes as
vertex names. This will make it easier to index nodes while calculating
different statistics.
Hint: You can set vertex attributes by using
V(graph)$name_of_attribute
.
V(wtc_graph_fromAM)$name <- seq(1,37)
Question 6: Use WTCPoliceIsICR vector to assign vertex
attribute that shows whether an actor occupies an institutionalized
coordinator role (ICR).
Hint: V(graph)$name_of_attribute
.
V(wtc_graph_fromAM)$IsICR <- WTCPoliceIsICR
Question 7: Using igraph
package visualize the
WTC Police Calls network using different layouts to see which layout is
the most beneficial. First, use a circular layout.
Hint: Use
plot(graph, layout=layout_in_circle, vertex.color=...)
to
plot the graph. You can see different layout options using
?layout_()
.
# Circular layout
igraph_options(vertex.size=12, edge.arrow.size=0.5)
plot(wtc_graph_fromAM, layout=layout_in_circle,
vertex.color=ifelse(V(wtc_graph_fromAM)$IsICR, "red", "pink"))
Circular layout is the simplest layout. The vertices are arranged around the circumference of a circle (typically ordered by degree) and edges are then drawn across the circle.
Now, try a Fruchterman and Reingold method.
Hint:
plot(graph, layout=layout_with_fr, vertex.color=...)
.
# Fruchterman and Reingold method
plot(wtc_graph_fromAM, layout=layout_with_fr,
vertex.color=ifelse(V(wtc_graph_fromAM)$IsICR, "red", "pink"))
Fruchterman and Reingold method is a spring-embedder method of graph drawing wherein the nodes are analogous to balls and edges to springs between them. The goal is to minimize the energy of the system through moving the nodes and changing the forces between them.
Also try the Kamada and Kawai method.
Hint:
plot(graph, layout=layout_with_kk, vertex.color=...)
.
# Kamada and Kawai method
plot(wtc_graph_fromAM, layout=layout_with_kk,
vertex.color=ifelse(V(wtc_graph_fromAM)$IsICR, "red", "pink"))
Kamada and Kawai method is an energy-placement method which simulates a spring between each vertex. The length of the spring between the vertices is determined by the shortest path between them. The idea is to move the nodes in such a way as to minimize the global energy of the layout.
Question 8: What are the size and order of this network?
Network’s size refers to the total number of it’s edges. And order
refers to the number of vertices in the network.
Hint: gsize(graph)
, gorder(graph)
.
gsize(wtc_graph_fromAM)
## [1] 481
gorder(wtc_graph_fromAM)
## [1] 37
Question 9: Degree of a vertex is a number of edges incident
upon that vertex. A degree distribution is a collection of degree
frequencies over the whole network. The degree distribution shows how
connected the graph is and whether some nodes have a notably distinct
degree compared to others. Using ggplot2
create degree
distributions of the WTC Police Calls data.
Hint: degree(graph, mode=c("in", "out", "total"))
;
ggplot(data, aes()) + geom_histogram() + labs(title="", x="", y="")
.
# Since we are working with directed graph we can create indegree, outdegree, and total distributions:
# calculating degrees of the network
G.indegrees <- igraph::degree(wtc_graph_fromAM, mode = 'in')
G.outdegrees <- igraph::degree(wtc_graph_fromAM, mode = 'out')
G.total_degree <- igraph::degree(wtc_graph_fromAM, mode = 'total')
# creating a dataframe to be able to work with ggplot
G.degree.histogram <- data.frame(G.indegrees, G.outdegrees, G.total_degree)
# ploting indegree distribution
ggplot(G.degree.histogram, aes(x = G.indegrees)) +
geom_histogram(binwidth = 3,
colour= "black", fill = "white") +
scale_y_continuous(breaks=breaks_pretty()) +
labs(title="Indegree Distribution", y="Frequncy", x="Vertex Degree")
# ploting outdegree distribution
ggplot(G.degree.histogram, aes(x = G.outdegrees)) +
geom_histogram(binwidth = 3,
colour= "black", fill = "white") +
scale_y_continuous(breaks=breaks_pretty()) +
labs(title="Outdegree Distribution", y="Frequncy", x="Vertex Degree")
# ploting total degree distribution
ggplot(G.degree.histogram, aes(x = G.total_degree)) +
geom_histogram(binwidth = 3,
colour= "black", fill = "white") +
scale_y_continuous(breaks=breaks_pretty()) +
labs(title="Total Degree Distribution", y="Frequncy", x="Vertex Degree")
# you can also use hist() function to visualize the distributions.
#hist(degree(wtc_graph_fromAM, mode = 'in'), col='lightblue', xlab='Vertex Degree',
#ylab='Frequency', main='')
#hist(degree(wtc_graph_fromAM, mode = 'out'), col='blue', xlab='Vertex Degree',
#ylab='Frequency', main='')
After examining the degree distribution plots, we can see that the majority of the nodes have quite low in-, out-, and total degrees. However, there are two highly connected nodes.
Question 10: Centrality measures help identify ‘important’
nodes in a network. Degree centrality measures how ‘important’ a node is
through measuring the number of edges incident upon that node.
Betweeness measures how often a certain node appears on a shortest path
between pairs of other nodes. Closeness measures how close a certain
vertex is to all other vertices. Eigenvector centrality captures the
‘importance’ of a node through it’s neighbors, therefore the more
connected the vertex’s neighbors are the more ‘important’ is the vertex
itself. Calculate centrality measures: degree, betweeness, and
eigenvector (closeness can’t be calculated for this graph due to the
presence of isolated vertices). Extract nodes that score the highest on
each centrality measure.
Hint:
degree(graph, ..., mode=c("all", "out", "in", "total"))
;
betweenness(graph, directed = TRUE...,)
;
eigen_centrality(graph, directed = TRUE, ...)
.
a) highest degree
in_degree <- igraph::degree(wtc_graph_fromAM, mode='in')
out_degree <- igraph::degree(wtc_graph_fromAM, mode='out')
total_degree <- igraph::degree(wtc_graph_fromAM, mode='total')
in_degree[in_degree == max(in_degree)]
## 32
## 201
out_degree[out_degree == max(out_degree)]
## 32
## 171
total_degree[total_degree == max(total_degree)]
## 32
## 372
We can see that in this network, node 32 is the most central node with regard to its in-, out-, and total degree.
b) highest betweenness
between_wtc <- betweenness(wtc_graph_fromAM, directed = TRUE, normalized = TRUE)
between_wtc[between_wtc == max(between_wtc)]
## 32
## 0.5936907
Node 32 also appears on the shortest path between other nodes the most.
c) highest eigenvector
eigen_wtc <- eigen_centrality(wtc_graph_fromAM, directed = TRUE)
eigen_values <- eigen_wtc[1]
eigen_values <- unlist(eigen_values)
eigen_values[eigen_values == max(eigen_values)]
## vector.32
## 1
Node 32 also has the most well-connected neighbors. These results show us that node 32 is the most ‘important’ node in this network with regards to degree, betweeness, and eigenvector centralities.
Question 11: Community detection identifies subsets
(clusters) of vertices that demonstrate ‘cohesiveness’ with respect to
their relationship patterns. Clusters correspond to sets of nodes with
more edges within the cluster than the rest of the graph. One of the
ways of approaching community detection in directed graphs, called Naive
approach, is to convert the directed network into an undirected one and
assume edge symmetry. Apply community detection algorithm on
WTCPoliceCalls data.
Hint:
as.undirected(graph, mode = c("collapse", "each", "mutual"))
;
cluster_fast_greedy(graph)
. note that this function
cluster_fast_greedy
applies the fast and greedy algorithm
for community detection. It is a hierarchical clustering algorithm used
to find densely connected subgraphs (communities) in the graph.
# Converting digraph into undirected graph without multyedges:
undirected.wtc <- as.undirected(
wtc_graph_fromAM,
mode = c("collapse"))
# use fast and greedy algorithm for hierarchical clustering
wuc <- cluster_fast_greedy(undirected.wtc)
Once you’ve applied the algorithm, you can look how many communities
the algorithm identified using length()
.
length(wuc)
## [1] 9
Additionally, you can see the number of members of each community
using sizes()
.
sizes(wuc)
## Community sizes
## 1 2 3 4 5 6 7 8 9
## 9 14 3 3 2 2 2 1 1
igraph
has a convenient way of plotting results of
community detection. You can use plot(communities, graph)
to visually represent the results.
# plot clusters
plot(wuc, undirected.wtc)
After plotting the results of community detection, we can see that there are two major communities centered around node 32 and node 16. In addition, all isolated nodes have been assigned to their own clusters.
Hierarchical clustering methods produce a hierarchy of nested
partitions of the graph. We can represent this hierarchy in the form of
a dendrogram. Dendrogram is a tree diagram that represents the
similarity among groups of objects. The heights of the branches signify
how similar the objects are to each other. igraph
has a
function dendPlot(communities, mode='phylo')
specifically
for this.
# plot a dendrogram
dendPlot(wuc, mode='phylo')
## Warning: `dendPlot()` was deprecated in igraph 2.0.0.
## ℹ Please use `plot_dendrogram()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Different communities are color-coded in the dendrogram which makes it clear how the nodes are partitioned into different clusters. Looking at this dendrogram we can see how similar the nodes are to the other members of their communities. It appears that the nodes in the same cluster are quite similar in this network.
The question is whether the propensity of individuals to send and
receive calls depends on whether they occupy institutionalized
coordinator roles (ICR) or not. WTCPoliceIsICR
is a vector
that shows the ICR.
One of the arguments in rem.dyad()
function is
ordinal = TRUE
which, by default, treats the data as
ordinal. Because WTCPoliceCalls
data does not contain exact
timing but contains the order of events we do not need to change
anything. However, later in the practical we will consider data with
exact time stamps which will require us to change the value of this
argument.
Question 12: Use rem.dyad()
by passing
appropriate arguments.
Hint:
rem.dyad(data..., n = number of actors, effects = c("CovInt"), covar = list(CovInt = WTCPoliceIsICR), hessian = TRUE)
.
Note that CovInt is an effect of the exogenous covariate(s) on the
propensity of actors to initiate and receive communication events.
CovInt requires exogenous covariate(s) to be passed as covariate
arguments. Please check the description of arguments
(?rem.dyad
).
wtcfit1 <- rem.dyad(WTCPoliceCalls,n=37,effects=c("CovInt"),
covar=list(CovInt=WTCPoliceIsICR),hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
Once you have fit the model, you can summarize the model fit using
summary()
.
Question 13: Check the model summary. Then see how much were
not very well predicted by your model using
mean(apply(your model$predicted.match,1,all))
.
predicted.match
provides a two-column matrix of true and
false corresponding to the sender and receiver columns in the main data
frame. It shows whether the model could predict the sender and receiver
truly. apply(your model$predicted.match,1,all)
applies
function all
over every row of this matrix. The function
all
checks whether all elements of a logical vector are
TRUE
(whether sender and receiver were correctly
predicted). mean()
calculates the average value of the
apply()
function output. TRUE
is treated as 1
and FALSE
is treated as 0. Therefore, the result of this
function falls between 0 and 1 - the higher the number, the better the
predictions of our model.
Hint: summary(your model name)
.
summary(wtcfit1)
## Relational Event Model (Ordinal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## CovInt.1 2.104464 0.069817 30.142 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 6921.048 on 481 degrees of freedom
## Residual deviance: 6193.998 on 480 degrees of freedom
## Chi-square: 727.0499 on 1 degrees of freedom, asymptotic p-value 0
## AIC: 6195.998 AICC: 6196.007 BIC: 6200.174
mean(apply(wtcfit1$predicted.match,1,all))
## [1] 0
The output gives us the covariate effect, some uncertainty, and goodness-of-fit information. The format is like the output for a regression model summary, but note that the coefficients should be interpreted considering the relational event framework.
Next, we evaluate whether it is worth treating these effects separately with respect to ICR status. To do so, we enter the ICR covariate as a sender and receiver covariate, respectively.
Question 14: Fit another model adding the covariates and
check the model summary as you did above. See how much were not very
well predicted by your model using
mean(apply(your model$predicted.match,1,all))
and compare
it with previous model prediction.
Hint: It is the same as what we did for Q12. But here we should consider
the effect for the sender and for the receiver separately. The structure
is the same but we need to select CovSnd, and CovRec. CovRec is an
actor-level covariate effect for receiving interactions and CovSnd is an
actor-level covariate effect for initiating interactions with others.
Both effects require exogenous covariate(s) to be passed as covariate
arguments. Check the description of arguments to see what you need to
specify for effects
.
wtcfit2<-rem.dyad(WTCPoliceCalls,n=37,effects=c("CovSnd","CovRec"),
covar=list(CovSnd=WTCPoliceIsICR ,CovRec= WTCPoliceIsICR),hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
summary(wtcfit2) # check the model output
## Relational Event Model (Ordinal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## CovSnd.1 1.979172 0.095745 20.671 < 2.2e-16 ***
## CovRec.1 2.225715 0.092862 23.968 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 6921.048 on 481 degrees of freedom
## Residual deviance: 6190.175 on 479 degrees of freedom
## Chi-square: 730.8731 on 2 degrees of freedom, asymptotic p-value 0
## AIC: 6194.175 AICC: 6194.2 BIC: 6202.527
mean(apply(wtcfit2$predicted.match,1,all))-mean(apply(wtcfit1$predicted.match,1,all))
## [1] 0
The output of 0 means that our second model had 0 better predicted senders and receivers compared to the first model.
Question 15: Now, evaluate which model is preferred by BIC
(lower is better) and explain what the BIC tells us.
Hint: You can extract BIC value from each model by using $
sign (e.g., yourmodel$BIC
).
wtcfit1$BIC - wtcfit2$BIC
## [1] -2.352663
Add the reciprocity term: \(AB \rightarrow BA\). For the definition of reciprocity, see the slides. In particular, the AB-BA shift, which captures the tendency for B to call A, given that A has just called B, is likely at play in radio communication.
Question 16: Fit another model adding the reciprocity term
and check the model summary. See how much were not very well predicted
by your model using
mean(apply(your model$predicted.match,1,all))
and compare it
with previous model prediction.
Hint: Add “PSAB-BA” as another statistic to the best model. Check the
description of arguments to see what you need to specify for
effects
.
wtcfit3 <- rem.dyad(WTCPoliceCalls,n=37,effects=c("CovInt","PSAB-BA"),
covar=list(CovInt=WTCPoliceIsICR),hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
summary(wtcfit3)
## Relational Event Model (Ordinal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## CovInt.1 1.60405 0.11500 13.949 < 2.2e-16 ***
## PSAB-BA 7.32695 0.10552 69.436 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 6921.048 on 481 degrees of freedom
## Residual deviance: 2619.115 on 479 degrees of freedom
## Chi-square: 4301.933 on 2 degrees of freedom, asymptotic p-value 0
## AIC: 2623.115 AICC: 2623.14 BIC: 2631.467
mean(apply(wtcfit3$predicted.match,1,all))-mean(apply(wtcfit2$predicted.match,1,all))
## [1] 0.6839917
Question 17: Again, evaluate which model is preferred by BIC.
# Compare the BIC of this model with the first model.
wtcfit1$BIC - wtcfit3$BIC
## [1] 3568.707
Note that the lower BIC the better the model. It appears that there
is a very strong reciprocity effect. Doesn’t it?
We may expect that the current participants in a communication are
likely to initiate the next call and that one’s most recent
communications may be the most likely to be returned.
These processes can be captured with the participation shifts for dyadic
turn receiving/continuing and recency effects, respectively.
Question 18: Fit another model adding the p-shift effects to
the previous model and check the model summary. See how much were not
very well predicted by your model using
mean(apply(your model$predicted.match,1,all))
and compare
it with previous model prediction.
Hint: Add “PSAB-BY” and “PSAB-AY” to the best model. PSAB-BY is a turn
receiving effect - a participant receiving a communication event is the
one initiating the next communication effect \(AB \rightarrow BY\). “PSAB-AY” is a turn
continuing effect, when a participant that initiated a conversation
event initiates the next conversation event \(AB \rightarrow AY\). For more information
check the description of arguments to see what you need to specify for
effects
.
## Adding p-shift effects to model 3
wtcfit4<-rem.dyad(WTCPoliceCalls,n=37,effects=c("CovInt","PSAB-BA","PSAB-BY","PSAB-AY"),
covar=list(CovInt=WTCPoliceIsICR),hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
summary(wtcfit4)
## Relational Event Model (Ordinal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## CovInt.1 1.54283 0.11818 13.0549 < 2.2e-16 ***
## PSAB-BA 7.49955 0.11418 65.6831 < 2.2e-16 ***
## PSAB-BY 1.25941 0.25131 5.0115 5.401e-07 ***
## PSAB-AY 0.87216 0.30611 2.8491 0.004384 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 6921.048 on 481 degrees of freedom
## Residual deviance: 2595.135 on 477 degrees of freedom
## Chi-square: 4325.913 on 4 degrees of freedom, asymptotic p-value 0
## AIC: 2603.135 AICC: 2603.219 BIC: 2619.839
mean(apply(wtcfit4$predicted.match,1,all))-mean(apply(wtcfit3$predicted.match,1,all))
## [1] 0
Question 19: Evaluate which model is preferred.
# Compare the BIC of this model with the previous model (third model).
wtcfit3$BIC - wtcfit4$BIC
## [1] 11.62806
Question 20: Lastly, fit a model adding the recency effects
to the previous model and check the model summary. See how much were not
very well predicted by your model using
mean(apply(your model$predicted.match,1,all))
and compare
it with previous model prediction.
Hint: RRecSnd is a recency effect - the recency of conversation event
\(A \rightarrow B\) affects the future
rate of conversation event \(B \rightarrow
A\). RSndSnd is also a recency effect - the recency of
conversation event \(A \rightarrow B\)
affects the future rate of conversation event \(A \rightarrow B\). Are they the statistics
that we need for this purpose?
## Adding recency effects to model 4
wtcfit5<-rem.dyad(WTCPoliceCalls,n=37,
effects=c("CovInt","PSAB-BA","PSAB-BY","PSAB-AY","RRecSnd","RSndSnd"),
covar=list(CovInt=WTCPoliceIsICR),hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
summary(wtcfit5)
## Relational Event Model (Ordinal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## RRecSnd 2.38496 0.27447 8.6892 < 2.2e-16 ***
## RSndSnd 1.34622 0.22307 6.0349 1.590e-09 ***
## CovInt.1 1.07058 0.14244 7.5160 5.640e-14 ***
## PSAB-BA 4.88714 0.15293 31.9569 < 2.2e-16 ***
## PSAB-BY 1.67938 0.26116 6.4304 1.273e-10 ***
## PSAB-AY 1.39016 0.31057 4.4762 7.600e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 6921.048 on 481 degrees of freedom
## Residual deviance: 2307.413 on 475 degrees of freedom
## Chi-square: 4613.635 on 6 degrees of freedom, asymptotic p-value 0
## AIC: 2319.413 AICC: 2319.591 BIC: 2344.469
mean(apply(wtcfit5$predicted.match,1,all))-mean(apply(wtcfit4$predicted.match,1,all))
## [1] 0
Question 21: Evaluate which model is preferred.
# Compare the BIC of this model with the previous model (fourth model).
wtcfit4$BIC - wtcfit5$BIC
## [1] 275.3701
The results indicate that turn-receiving, turn-continuing, and
recency effects are all at play in the relational event process.
Question 22: Can you think of adding more
statistics?
Please add a few more and see how the BIC changes. To this end, check
the definition of statistics precisely.
# Try it out yourself.
Question 23: As there are a lot of statistics that we can use, can you think of some ways for statistics selection?
# Try it out yourself.
Model adequacy is a crucial consideration: even though the model fit
is sufficiently good, you need to think of whether this model is good
enough for our purposes. For example, the ability of the relational
event model to predict the next event in the sequence, given those that
have come before.
Note that we do not discuss this further in this practical but
interested students can check Butts 2008.
The classroom dataset includes timed interactions between students and instructors within a high school classroom (McFarland, 2001; Bender-deMoll and McFarland, 2006). Sender and receiver events (n=691) were recorded between 20 actors (18 students and 2 teachers) along with the time of the events in increments of minutes. Note that the data employed here were modified slightly to increase the amount of time occurring between very closely recorded events, ensuring no simultaneity of events as assumed by the relational event framework. Two actor-level covariates are also at hand in the dataset used here: whether the actor was a teacher and whether the actor was female.
Let’s start by examining the raw temporal data and the
time-aggregated network.
In this case, event time is given in increments of minutes from the
onset of observation.
Question 24: Check the data again using head()
.
Can you see any trend?
Hint: Use table()
to generate frequency tables for sender
and receiver attributes. Use
ggplot(data, aes(attribute))+geom_bar()+labs(y="", x="")
to
visualize the data.
## The edgelist is stored in an object called `Class`.
head(Class)
## StartTime FromId ToId
## 1 0.135 14 12
## 2 0.270 12 14
## 3 0.405 18 12
## 4 0.540 12 18
## 5 0.675 1 12
## 6 0.810 12 1
table(Class$FromId)
##
## 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20
## 29 27 63 25 23 164 19 18 32 19 39 20 81 20 26 32 29 25
table(Class$ToId)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 38 10 37 55 37 32 32 29 28 47 29 49 30 46 30 36 42 40 10 34
ggplot(Class, aes(FromId))+geom_bar()+labs(y="Frequncy", x="Sender")
ggplot(Class, aes(ToId))+geom_bar()+labs(y="Frequncy", x="Receiver")
Question 25: Convert the Class
eventlist into a
valued sociomatrix using as.sociomatrix.eventlist()
as you
did above.
Hint: as.sociomatrix.eventlist(data, number of actors)
.
ClassNet <- as.sociomatrix.eventlist(Class ,20)
Question 26: Plot the network using
gplot()
.
Here we color the female nodes black, the male nodes gray, represent
teachers as square-shaped nodes and students as triangle-shaped nodes.
Edges between nodes are likewise scaled proportional to the number of
communication events transpiring between actors.
Hint: Please check the hint of Q3 and consult the help file of
gplot
.
gplot(ClassNet ,vertex.col=ifelse(ClassIsFemale ,"black","gray"),
vertex.sides=3+ClassIsTeacher ,vertex.cex=2, edge.lwd= ClassNet ^0.75)
Question 27: Use graph_from_adjacency_matrix()
to convert the Classroom sociomatrix into a graph object.
Hint: graph_from_adjacency_matrix(sociomatrix)
.
# transforming class sociomatrix into igraph object
class_graph <- graph_from_adjacency_matrix(ClassNet)
Question 28: Assign corresponding numbers as vertex
names.
Hint: You can set vertex attributes by using
V(graph)$name_of_attribute
.
V(class_graph)$name <- seq(1,20)
Question 29: Use ClassIsFemale and ClassIsTeacher vectors to
assign vertex attributes.
Hint: V(graph)$name_of_attribute
.
V(class_graph)$IsFemale <- ClassIsFemale
V(class_graph)$IsTeacher <- ClassIsTeacher
Question 30: Using igraph
package visualize the
Classroom data using different layouts to see which layout is the most
beneficial. First, use a circular layout.
Hint:
plot(graph, layout=layout_in_circle, vertex.color=...)
.
# Circular layout
igraph_options(vertex.size=12, edge.arrow.size=0.5)
plot(class_graph, layout=layout_in_circle,
vertex.shape=ifelse(V(class_graph)$IsTeacher, "rectangle", "circle"),
vertex.color=ifelse(V(class_graph)$IsFemale, "pink", "lightblue"))
Now, try a Fruchterman and Reingold method.
Hint:
plot(graph, layout=layout_with_fr, vertex.color=...)
.
# Fruchterman and Reingold method
plot(class_graph, layout=layout_with_fr,
vertex.shape=ifelse(V(class_graph)$IsTeacher, "rectangle", "circle"),
vertex.color=ifelse(V(class_graph)$IsFemale, "pink", "lightblue"))
Also try the Kamada and Kawai method.
Hint:
plot(graph, layout=layout_with_kk, vertex.color=...)
.
# Kamada and Kawai method
plot(class_graph, layout=layout_with_kk,
vertex.shape=ifelse(V(class_graph)$IsTeacher, "rectangle", "circle"),
vertex.color=ifelse(V(class_graph)$IsFemale, "pink", "lightblue"))
Question 31: What are the size and order of this
network?
Hint: gsize(graph)
, gorder(graph)
.
gsize(class_graph)
## [1] 691
gorder(class_graph)
## [1] 20
Question 32: Create degree distributions of the Classroom
data.
Hint:
ggplot(data, aes()) + geom_histogram() + labs(title="", x="", y="")
.
# Since we are working with directed graph we can create indegree and outdegree distributions:
# calculate indegree and outdegree
G.indegrees <- igraph::degree(class_graph, mode = 'in')
G.outdegrees <- igraph::degree(class_graph, mode = 'out')
G.total_degrees <- igraph::degree(class_graph, mode = 'total')
# creating a dataframe for working with ggplot
G.degree.histogram <- data.frame(G.indegrees, G.outdegrees, G.total_degrees)
# ploting indegree distribution
ggplot(G.degree.histogram, aes(x = G.indegrees)) +
geom_histogram(binwidth = 3,
colour= "black", fill = "white") +
labs(title="Indegree Distribution", x="Vertex Degree", y="Frequency")
# ploting outdegree distribution
ggplot(G.degree.histogram, aes(x = G.outdegrees)) +
geom_histogram(binwidth = 3,
colour= "black", fill = "white") +
labs(title="Outdegree Distribution", x="Vertex Degree", y="Frequency")
# ploting total degree distribution
ggplot(G.degree.histogram, aes(x = G.total_degrees)) +
geom_histogram(binwidth = 3,
colour= "black", fill = "white") +
labs(title="Total Degree Distribution", x="Vertex Degree", y="Frequency")
#hist(degree(class_graph, mode = 'in'), col='lightblue', xlab='Vertex Degree',
#ylab='Frequency', main='')
#hist(degree(class_graph, mode = 'out'), col='blue', xlab='Vertex Degree',
#ylab='Frequency', main='')
According to the degree distributions, Classroom network is a well-connected network with a few nodes having noticeably higher degree compared to the rest.
Question 33: Calculate centrality measures: degree,
betweeness, closeness and eigenvector. Extract nodes that score the
highest on each centrality measure.
Hint:
degree(graph, ..., mode=c("all", "out", "in", "total"))
;
betweenness(graph, directed = TRUE, ...)
;
closeness(graph, ...)
;
eigen_centrality(graph, directed = TRUE, ...)
.
a) highest degree
in_degree_class <- igraph::degree(class_graph, mode='in')
out_degree_class <- igraph::degree(class_graph, mode='out')
total_degree_class <- igraph::degree(class_graph, mode='total')
in_degree_class[in_degree_class == max(in_degree_class)]
## 4
## 55
out_degree_class[out_degree_class == max(out_degree_class)]
## 7
## 164
total_degree_class[total_degree_class == max(total_degree_class)]
## 7
## 196
Node with highest number of incoming edges is node 4. And the node with the highest out-, and total degrees is node 7.
b) highest betweenness
between_class <- betweenness(class_graph, directed = TRUE, normalized = TRUE)
between_class[between_class == max(between_class)]
## 7
## 0.3656151
The node that occurs most often on the shortest path between other pairs of nodes is node 7.
c) highest closeness
closeness_class <- closeness(class_graph, normalized = TRUE)
closeness_class[closeness_class == max(closeness_class, na.rm = TRUE) & !is.na(closeness_class)]
## 4 7 14
## 1 1 1
Nodes that are located the clossest to other nodes are 4, 7, and 14.
d) highest eigenvector
eigen_class <- eigen_centrality(class_graph, directed = TRUE)
eigen_values <- eigen_class[1]
eigen_values <- unlist(eigen_values)
eigen_values[eigen_values == max(eigen_values)]
## vector.4
## 1
The node with the most connected neighbours is node 4.
Question 34: Apply community detection algorithm on the
Classroom data. First, use naive approach.
Hint:
as.undirected(graph, mode = c("collapse", "each", "mutual")
;
cluster_fast_greedy(graph)
.
# Converting digraph into undirected graph without multyedges:
undirected.class <- as.undirected(
class_graph,
mode = c("collapse"))
# use fast and greedy algorithm for hierarchical clustering
cuc <- cluster_fast_greedy(undirected.class)
Use length()
to check how many communities the algorithm
has identified.
length(cuc)
## [1] 5
Use sizes()
to see how large those communities are.
sizes(cuc)
## Community sizes
## 1 2 3 4 5
## 5 5 3 4 3
Plot the results of the community detection using
plot(communities, graph)
.
# plot clusters
plot(cuc, undirected.class)
Interestingly, the community sizes do not vary a lot. In addition, we can see that there is quite a lot of edges between the communities as well as within them which is not surprising given how well connected this network is.
Create a dendrogram using
dendPlot(communities, mode='phylo')
.
# plot a dendrogram
dendPlot(cuc, mode='phylo')
According to the dendrogram, even though node 19 is assigned to a cluster, it is quite different from the nodes in it’s cluster. Similarly, node 2 is quite different from the other members of its cluster. This is also seen from the graph representation above, those two nodes are less connected with the other members of their clusters.
We are going to use REMs for event histories with exact timing information.
Question 35: Fit a simple intercept model, containing only a
vector of 1s as an actor-level sending effect. And check the model
summary. See how much were not very well predicted by your model using
mean(apply(your model$predicted.match,1,all))
Hint: This vector is saved as ClassIntercept
, which we can
pass to the respective covariate arguments in rem.dyad()
.
Also note that we do not want to discard timing information
(ordinal=FALSE
).
classfit1<-rem.dyad(Class,n=20,effects=c("CovSnd"),
covar= list(CovSnd=ClassIntercept), ordinal=FALSE,hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
summary(classfit1)
## Relational Event Model (Temporal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## CovSnd.1 -3.332287 0.038042 -87.596 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 5987.221 on 691 degrees of freedom
## Residual deviance: 5987.221 on 691 degrees of freedom
## Chi-square: -7.730705e-11 on 0 degrees of freedom, asymptotic p-value 1
## AIC: 5989.221 AICC: 5989.227 BIC: 5993.759
mean(apply(classfit1$predicted.match,1,all))
## [1] 0
The model does not fit any better than the null, because it is equivalent to the null model (as indicated by the absence of difference between the null and residual deviance).
Question 36: Now fit a more interesting covariate model that
considers gender(male/female) and role(teacher/student) for senders and
receivers. And evaluate whether there is any improvement over the
intercept-only model by BIC. See how much were not very well predicted
by your model using
mean(apply(your model$predicted.match,1,all))
and compare
it with previous model prediction.
classfit2<-rem.dyad(Class,n=20,effects=c("CovSnd","CovRec"),
covar=list(CovSnd=cbind(ClassIntercept ,ClassIsTeacher ,
ClassIsFemale), CovRec=cbind(ClassIsTeacher ,ClassIsFemale))
,ordinal=FALSE ,hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
summary(classfit2)
## Relational Event Model (Temporal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## CovSnd.1 -3.834055 0.078836 -48.6333 < 2.2e-16 ***
## CovSnd.2 1.672786 0.091672 18.2475 < 2.2e-16 ***
## CovSnd.3 0.123985 0.094926 1.3061 0.191510
## CovRec.1 0.373994 0.126999 2.9449 0.003231 **
## CovRec.2 0.165380 0.080892 2.0445 0.040909 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 5987.221 on 691 degrees of freedom
## Residual deviance: 5652.318 on 687 degrees of freedom
## Chi-square: 334.9033 on 4 degrees of freedom, asymptotic p-value 0
## AIC: 5662.318 AICC: 5662.405 BIC: 5685.008
# Compare the BIC of this model with the previous model.
classfit1$BIC - classfit2$BIC
## [1] 308.7508
mean(apply(classfit2$predicted.match,1,all))-mean(apply(classfit1$predicted.match,1,all))
## [1] 0.01013025
A good improvement over the null model. But note that gender does not appear to be predictive of sending communication. A better model may be one without that specific term included.
Question 37: Fit another model excluding the non-significant
term and again compare to the previous model by BIC and
prediction.
Hint: Based on the above outputs, which statistics should be
included/excluded?
classfit3<-rem.dyad(Class,n=20,effects=c("CovSnd","CovRec"),
covar=list(CovSnd=cbind(ClassIntercept ,ClassIsTeacher),
CovRec=cbind(ClassIsTeacher,ClassIsFemale)),
ordinal=FALSE,hessian=TRUE)
## Prepping edgelist.
## Checking/prepping covariates.
## Computing preliminary statistics
## Fitting model
## Obtaining goodness-of-fit statistics
summary(classfit3)
## Relational Event Model (Temporal Likelihood)
##
## Estimate Std.Err Z value Pr(>|z|)
## CovSnd.1 -3.775222 0.063622 -59.3380 < 2.2e-16 ***
## CovSnd.2 1.615759 0.079933 20.2139 < 2.2e-16 ***
## CovRec.1 0.371765 0.127019 2.9269 0.003424 **
## CovRec.2 0.161158 0.080815 1.9942 0.046135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Null deviance: 5987.221 on 691 degrees of freedom
## Residual deviance: 5654.016 on 688 degrees of freedom
## Chi-square: 333.2049 on 3 degrees of freedom, asymptotic p-value 0
## AIC: 5662.016 AICC: 5662.074 BIC: 5680.169
classfit2$BIC - classfit3$BIC
## [1] 4.839715
mean(apply(classfit3$predicted.match,1,all))-mean(apply(classfit2$predicted.match,1,all))
## [1] 0
Question 38: Is there any improvement?