For the current practical, you will need to install a few packages as follows: ‘glasso’, ‘colorRamps’, ‘igraph’, ‘RColorBrewer’. This can be done by uncommenting the following comments:
# required.packages <- c('glasso', 'colorRamps', 'igraph', 'RColorBrewer')
# install.packages(required.packages)
library(glasso)
library(colorRamps)
library(igraph)
library(RColorBrewer)
In the first part, we’ll use the ‘glasso’ package, which implements
the Graphical Lasso algorithm, and the ‘igraph’ package, which contains
tools for building network graphs.
First, we need to tell R to import the data, in this case a txt
file.
Here, we will use the data (cropdata2.txt
).
Suppose that we are interested in the analysis of a particular plant, which we will model in a very simplistic way by considering:
We will analyze the following six continuous variables:
G
: the initial status of the plant (its genetic
potential).E
: the environmental potential of the location and the
season it is grown in.V
: a single vegetative variable summarising all the
information available on constituted reserves. V
is
directly influenced by the values of G
and E
.
The greater they are, the greater is the possibility of a large
vegetative mass V
.N
: the number of seeds.W
: the seeds’ mean weight.C
: the crop, depends directly on N
and
W
.As a result, the behavior of a plant can be described by experts or
the prior knowledge as follows:
{G, E}
\(\rightarrow\)
V
, V
\(\rightarrow\) N
,
V
\(\rightarrow\)
W
, {N, W}
\(\rightarrow\) C
In this practical we will not consider the directions of the edges. This will be done with Bayesian Networks.
Question 1: Complete the following steps.
Step 1) import the data (the txt file).
Hint: Use read.table(yourdata, header=T)
.
# Import the data.
cropdata2 <- read.table("cropdata2.txt", header=TRUE)
Step 2) Create the covariance matrix and assign it to a
variable named “S”.
Hint: Use cov()
.
S <- cov(cropdata2)
The values along the diagonal of the matrix are simply the variances
of each product.
The other values in the matrix represent the covariances between the
various products.
Question 2: Apply Graphical lasso using glasso()
to estimate precision matrix (inverse covariance matrix).
To apply the Graphical Lasso, we choose a value for rho, which is the
regularization parameter that controls the degree of sparsity in the
resulting inverse covariance matrix (in the lecture slides the
regularization parameter is referred to as lambda). Higher values lead
to greater sparsity. As it was mentioned in the lecture sparse networks
are easier to interpret. In our application, there is no “correct” value
of rho, but it can be tuned for your use case. For instance, if you want
to isolate the strongest relationships in your data you would choose a
higher rho value. If you are interested in preserving more tenuous
connections, you would choose a lower rho value. Finding a sensible
value requires experimentation.
Hint: Use
glasso(your covariance matrix, rho = your specified value)
.
rho <- 20 # after you've plotted the network (Q7), come back to this question and play around with this variable and see how the resulting graph changes.
invcov <- glasso(S, rho=rho)
Question 3: Check the symmetry of the inverse covariance
matrix.
It is also not a bad idea to check for symmetry in the resulting inverse
covariance matrix. Asymmetry can arise due to numerical computation and
rounding errors, which can cause problems later depending on what you
want to do with the matrix.
Hint: wi
is the inverse covariance matrix, you can extract
it using $
operator. Use
isSymmetric(your precision matrix$wi)
.
P <- invcov$wi
colnames(P) <- colnames(S)
rownames(P) <- rownames(S)
# check the symmetry
isSymmetric(P)
## [1] FALSE
If your inverse covariance matrix is not symmetric make it symmetric
by ensuring the values bellow the diagonal are equal to the values above
the diagonal. Hint:
your matrix[lower.tri(your matrix)] = t(your matrix)[lower.tri(your matrix)]
.
if(!isSymmetric(P)) { # if it is not symmetric,
P[lower.tri(P)] = t(P)[lower.tri(P)] # make it symmetric
}
isSymmetric(P)
## [1] TRUE
Question 4: Next, we calculate the partial correlation matrix
and set the terms on the diagonal to zero.
This prevents variables having connections with themselves in the
network graph we will be shortly constructing.
Hint: partial correlation: \(\rho_{xy.z}=
-\frac {p_{ij}} {\sqrt{p_{ii}p_{jj}}}\), where \(p_{ij}\) is the \((i,j)^{th}\) element of the inverted
correlation matrix. First, create an empty matrix with the same number
of rows and columns as in your inverse covariance matrix. You can use
matrix(nrow=nrow(your cov. matrix), ncol=ncol(your cov. matrix))
.
Then, use a for-loop to iterate over every row and column:
for(j in 1:nrow(your empty matrix))
,
for(i in 1:ncol(your empty matrix))
to assign a value to
each matrix cell:
your empty matrix[i,j] <- partial correlation equation
.
Lastly, you’ll need to assign the same column and row names as in your
inverse covariance matrix. For this, you can use colnames()
, and rownames()
.
# compute the partial correlation matrix
parr.corr <- matrix(nrow=nrow(P), ncol=ncol(P))
for(j in 1:nrow(parr.corr)) {
for(i in 1:ncol(parr.corr)) {
parr.corr[i, j] <- -P[i,j]/sqrt(P[i,i]*P[j,j])
}
}
# define the column & row names
colnames(parr.corr) <- colnames(P)
rownames(parr.corr) <- colnames(P)
# set the diagonal terms to zero
diag(parr.corr) <- 0
Question 5: There is another way to compute the partial
correlation matrix. Compute the partial correlation matrix using
cov2cor()
.
Note that we set the terms on the diagonal to zero as explained
before.
Hint: partial correlation matrix =
-1*cov2cor(Estimated inverse covariance matrix)
# using cov2cor()
parr.corr2 <- -1*cov2cor(P)
# define the column & row names
colnames(parr.corr2) <- colnames(P)
rownames(parr.corr2) <- colnames(P)
# set the diagonal terms to zero
diag(parr.corr2) <- 0
Question 6: Do both ways produce the same correlation
matrix?
Hint: use all.equal()
to check the equality of the two
matrices you have obtained above.
all.equal(parr.corr,parr.corr2)
## [1] TRUE
Now if you run View(partial correlation matrix)
in R
Studio, you’ll see a sparse partial correlation matrix. In fact, the
non-zero elements represent a connection between two variables with the
strength of the connection determined by the magnitude of the partial
correlation.
The partial correlation matrix can be used to build a network graph,
where variables are represented as nodes and non-zero elements are
represented as edges between two nodes.
# Uncomment to check the sparse partial correlation matrix below.
# View(parr.corr)
Now the second step is to display the inferred network. We will use
igraph
for this purpose. The igraph package has some
fantastic tools for building, manipulating and displaying graphs. We
will only use a fraction of the package’s features here, but if you’re
interested in learning more about it, check out its documentation (e.g.,
https://igraph.org/r/doc/)
Question 7: Construct the network graph. First, use
graph_from_adjacency_matrix()
to make adjacency matrix.
Then, use plot.igraph()
to plot it.
Hint: Note that the mode
should be undirected
and the weighted
should be True
.
“graph_from_adjacency_matrix(your partial correlation matrix, mode="undirected", weighted=TRUE)
”
infered_graph <- graph_from_adjacency_matrix(parr.corr2, mode="undirected", weighted=TRUE)
## Warning: The `adjmatrix` argument of `graph_from_adjacency_matrix()` must be symmetric
## with mode = "undirected" as of igraph 1.6.0.
## ℹ Use mode = "max" to achieve the original behavior.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot.igraph(infered_graph)
Question 8: You can also use plot()
to see the
graph. In this case, choose a different layout.
Hint: For instance, you can choose
layout = layout.lgl(your graph object)
or
layout = layout.grid(your graph object)
.
plot(infered_graph)
plot(infered_graph, vertex.size = 35, edge.width = 1, vertex.color = "purple",vertex.label.color="white")
plot(infered_graph, edge.color = "black", vertex.color =
"lightgreen", vertex.size = 50)
plot(infered_graph, layout = layout.lgl(infered_graph))
plot(infered_graph, layout = layout.lgl(infered_graph))
plot(infered_graph, layout = layout.grid(infered_graph))
## Warning: `layout.grid()` was deprecated in igraph 2.0.0.
## ℹ Please use `layout_on_grid()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Question 9: Now without considering the penalized likelihood
(glasso), compute the partial correlation matrix and estimate the
network based on that. What is the difference between this network and
the previous network?
Hint:
- step 1. Use cov()
to get the covariance matrix.
- step 2. Use solve()
to compute the inverse matrix.
- step 3. Use cov2cor()
to compute the partial covariance
matrix. Then, use the plot()
function as explained
above.
S2 <- cov(cropdata2)
invS2 <- solve(S2)
parr.corr3 <- -1*cov2cor(invS2)
diag(parr.corr3) <- 0
infered_graph2 <- graph_from_adjacency_matrix(parr.corr3, mode="undirected", weighted=TRUE)
plot(infered_graph2, edge.color = "black", vertex.color ="lightgreen", vertex.size = 50)
## Warning in v(graph): Non-positive edge weight found, ignoring all weights
## during graph layout.
plot(infered_graph2, layout = layout.grid(infered_graph)) # using layout.grid
Yes, you are right. The second graph is a fully connected graph and really difficult to interpret. So, this is the right time that we should think of the sparsity. Isn’t it?
Question 10: Compare the corresponding graph in the lecture slides with the sparse graph that you inferred already.
For the second part of the practical we use ‘ggplot2’ and ‘bootnet’ packages.
Question 11: First install and load these two packages.
#install.packages("ggplot2")
#install.packages("bootnet")
library("ggplot2")
library("bootnet")
## This is bootnet 1.6
## For questions and issues, please see github.com/SachaEpskamp/bootnet.
The next step is to tell R to estimate the network model using the
EBICglasso
(Foygeland Drton, 2010) to produce an
interpretable network. This function uses the glasso package (Friedman,
Hastie and Tibshirani, 2011) as well in order to compute a sparse
Gaussian graphical model with the graphical lasso (Friedman, Hastie
& Tibshirani, 2008). The tuning parameter is chosen using the
Extended Bayesian Information Criterion (EBIC). See the
documentation for more information.
For the problem of recovering the graphical structure, information
criteria provide useful optimization objectives for algorithms searching
through sets of graphs or for selection of tuning parameters of other
methods (such as the graphical lasso, which is basically a likelihood
penalization technique).
In many applications, particularly in the analysis of gene expression
data, inference of the graph is of significant interest. Information
criteria provide an important tool for this problem. They provide the
objective to be minimized in (heuristic) searches over the space of
graphs and they are sometimes used to select tuning parameters in other
methods such as the graphical lasso.
An extended Bayesian information criterion can be used for Gaussian
graphical models in a scenario where both the number of variables
p and the sample size n grow. It allows for growth in
the number of non-zero parameters in the true model that is necessary to
cover connected graphs (Foygel R., and Drton M. (2010)).
\[BIC_{\gamma}(E) = −2ln(\theta(E)) + |E| log(n) + 4.{\gamma} |E|_{\gamma} log(p),\]
where \(E\) is the edge set of a
candidate graph and \(ln(\theta(E))\)
denotes the maximized log-likelihood function of the associated
model.
If \(\gamma = 0\), then the classical
BIC is recovered, which leads to the (asymptotically) consistent model
selection in the setting of fixed number of variables p and
growing sample size n. Positive \(\gamma\) leads to the stronger penalization
of large graphs.
The ordinary Bayes information criterion is too liberal for model selection when the model space is large. The new criteria take into account both the number of unknown parameters and the complexity of the model space. Their consistency is established, in particular allowing the number of covariates to increase to infinity with the sample size (Chen J. and Chen Z. 2008).
Question 12: Estimate the network model using the
EBICglasso()
and compare it with the sparse network
resulted from the previous part.
Hint: Use
estimateNetwork(your data, default = "EBICglasso")
. After
that, plot the graph using plot()
function. Note that the
red color shows negative correlation and the blue color shows positive
correlation. The thickness of the edge shows how strong the relationship
is.
Network <- estimateNetwork(cropdata2, default = "EBICglasso")
plot(Network, layout = "spring", labels = colnames(cropdata2))
results <- estimateNetwork(
cropdata2,
default = "EBICglasso",
corMethod = "cor")
plot(results)
By means of synthetic biology, Cantone et al. (2009) designed a network in S.cerevisiae (yeast). With Real-Time Polymerase Chain Reaction (RT-PCR), Cantone et al.(2009) measured in vivo gene expression data: first under galactose- and then under glucose-metabolism. For both carbon sources, the network structure is identical, but the strengths of the regulatory processes change with the carbon source.
<Project 1> Considering the yeast data, estimate the
network model twice: first using glasso
, and second using
EBICglasso
. What is the difference? Is that a fully
connected network or a very sparse network?
yeast <- read.table("yeast.txt", header=T)
colnames(yeast)<- c("CBF1", "GAL4", "SWI5", "GAL80", "ASH1")
S <- cov(yeast)
rho <- 20 # play around with this variable and see how the resulting graph changes.
invcov <- glasso(S, rho=rho)
P <- invcov$wi
colnames(P) <- colnames(S)
rownames(P) <- rownames(S)
# check symmetry
isSymmetric(P)
## [1] TRUE
parr.corr_yeast <- -1*cov2cor(P)
colnames(parr.corr_yeast) <- colnames(P)
rownames(parr.corr_yeast) <- colnames(P)
diag(parr.corr_yeast) <- 0
infered_graph_yeast <- graph_from_adjacency_matrix(parr.corr_yeast, mode="undirected", weighted=TRUE)
plot(infered_graph_yeast, edge.color = "black", vertex.color ="lightgreen", vertex.size = 50)
Network <- estimateNetwork(yeast, default = "pcor") # "pcor" stands for a partial correlation network.
plot(Network, layout = "spring", labels = colnames(yeast))
results <- estimateNetwork(
yeast,
default = "EBICglasso",
corMethod = "cor")
plot(results)
<Project 2> First import the data
(sachs.data.txt
) into your Rstudio.
sachs.data.txt
file contains a flow cytometry dataset with
p = 11 proteins measured on N = 7466 cells (Sachs et al., 2003). Then,
estimate the network model: first using glasso
(rho= 10,
20, 30) and second using EBICglasso
. What is the
difference? Play around with the tuning parameter (rho) using
glasso
and see how the network changes.
Protein_data <- read.table("sachs.data.txt", header = TRUE)
rho <- 40 # play around with this variable and see how the resulting graph changes.
S <- cov(Protein_data)
invcov <- glasso(S, rho=rho)
P <- invcov$wi
colnames(P) <- colnames(S)
rownames(P) <- rownames(S)
# check the symmetry
isSymmetric(P)
## [1] FALSE
if(!isSymmetric(P)) { # if it is not symmetric,
P[lower.tri(P)] = t(P)[lower.tri(P)] # make it symmetric
}
# check the symmetry again
isSymmetric(P)
## [1] TRUE
parr.corr_sachs <- -1*cov2cor(P)
colnames(parr.corr_sachs) <- colnames(P)
rownames(parr.corr_sachs) <- colnames(P)
diag(parr.corr_sachs) <- 0
infered_graph_sachs <- graph_from_adjacency_matrix(parr.corr_sachs, mode="undirected", weighted=TRUE)
plot(infered_graph_sachs, edge.color = "black", vertex.color ="lightgreen", vertex.size = 50)
## Warning in v(graph): Non-positive edge weight found, ignoring all weights
## during graph layout.
plot(infered_graph_sachs, layout = layout.grid(infered_graph_sachs),vertex.size = 60)