R can be obtained via the https://www.r-project.org/ webpage. To download R, you need to select your preferred CRAN (Comprehensive R Archive Network) mirror (https://cran.r-project.org/mirrors.html). On the Mirrors webpage, you will find listings of countries that have identical versions of R and should select a location geographically close to your computer’s location. R can be downloaded for Linux, Windows, and Mac OS. The pages are regularly updated and you need to check with releases are supported for your platform. R as a base package can perform many statistical analyses but most importantly, R’s functionality can be expanded by downloading specific packages. After installing R (https://www.r-project.org/), it is quite useful to also install R Studio (https://www.rstudio.com/), which provides a convenient interface to R. Once both are installed, opening up R Studio will give a window that is split into 4 panes:
install.packages()
). R is a command line
driven programme and you can enter commands at the prompt
(>
by default) and each command is executed one at a
time.For the current practical, you will need to install a few packages as follows: ‘glasso’, ‘colorRamps’, ‘igraph’, ‘RColorBrewer’, ‘ggplot2’, and ‘bootnet’. This can be done by uncommenting the following comments:
# required.packages <- c('glasso', 'colorRamps', 'igraph', 'RColorBrewer',"ggplot2","bootnet")
# install.packages(required.packages)
library(glasso)
library(colorRamps)
library(igraph)
library(RColorBrewer)
library(ggplot2)
library(bootnet)
In the first part, we’ll use the ‘glasso’ package, which implements
the Graphical Lasso algorithm, and the ‘igraph’ package, which contains
tools for building network graphs.
First, we need to tell R to import the data, in this case a txt
file.
Here, we will use the data (cropdata2.txt
).
Suppose that we are interested in the analysis of a particular plant, which we will model in a very simplistic way by considering:
We will analyze the following six continuous variables:
G
: the initial status of the plant (its genetic
potential)E
: the environmental potential of the location and the
season it is grown in.V
: a single vegetative variable summarising all the
information available on constituted reserves. V
is
directly influenced by the values of G
and E
.
The greater they are, the greater is the possibility of a large
vegetative mass V
.N
: the number of seedsW
: the mean weightC
: the crop, depends directly on N
and
W
.As a result, the behaviour of a plant can be described by experts or
the prior knowledge as follows:
{G, E}
\(\rightarrow\)
V
, V
\(\rightarrow\) N
,
V
\(\rightarrow\)
W
, {N, W}
\(\rightarrow\) C
Question 1: Do the following steps.
Step 1) import the data (the txt file).
Hint: Use read.table(yourdata, header=T)
.
# Import the data.
cropdata2 <- read.table("cropdata2.txt", header=TRUE)
Step 2) Create the covariance matrix and assign it to a
variable named “S”.
Hint: Use cov()
.
S <- cov(cropdata2)
The values along the diagonal of the matrix are simply the variances
of each product.
The other values in the matrix represent the covariances between the
various products.
Question 2: Apply Graphical lasso using glasso()
to estimate precision matrix.
To apply the Graphical Lasso, we choose a value for rho, which is the
regularization parameter that controls the degree of sparsity in the
resulting inverse covariance matrix. Higher values lead to greater
sparsity. As it was mentioned in the lecture sparse networks are easier
to interpret. In our application, there is no “correct” value of rho,
but it can be tuned for your use case. For instance, if you want to
isolate the strongest relationships in your data you would choose a
higher value rho. If you are interested in preserving more tenuous
connections, you would choose a lower value of rho. Finding a sensible
value requires experimentation.
Hint: Use
glasso(your covariance matrix, rho = your specified value)
.
rho <- 20 # play around with this variable and see how the resulting graph changes.
invcov <- glasso(S, rho=rho)
Question 3: Check the symmetry of the inverse covariance
matrix.
It is also not a bad idea to check for symmetry in the resulting inverse
covariance matrix. Asymmetry can arise due to numerical computation and
rounding errors, which can cause problems later depending on what you
want to do with the matrix.
Hint: wi
is the inverse covariance matrix. Use
isSymmetric(your inverse covariance matrix)
.
P <- invcov$wi
colnames(P) <- colnames(S)
rownames(P) <- rownames(S)
# check the symmetry
isSymmetric(P)
## [1] FALSE
if(!isSymmetric(P)) { # if it is not symmetric,
P[lower.tri(P)] = t(P)[lower.tri(P)] # make it symmetric
}
Question 4: Next, we calculate the partial correlation matrix
and set the terms on the diagonal to zero.
This prevents variables having connections with themselves in the
network graph we will be shortly constructing.
Hint: partial correlation: \(\rho_{xy.z}=
-\frac {p_{ij}} {\sqrt{p_{ii}p_{jj}}}\), where \(p_{ij}\) is the \((i,j)^{th}\) element of the inverted
correlation matrix.
# compute the partial correlation matrix
parr.corr <- matrix(nrow=nrow(P), ncol=ncol(P))
for(k in 1:nrow(parr.corr)) {
for(j in 1:ncol(parr.corr)) {
parr.corr[j, k] <- -P[j,k]/sqrt(P[j,j]*P[k,k])
}
}
# define the column & row names
colnames(parr.corr) <- colnames(P)
rownames(parr.corr) <- colnames(P)
# set the diagonal terms to zero
diag(parr.corr) <- 0
Question 5: There is another way to compute the partial
correlation matrix. Compute the partial correlation matrix using
cov2cor()
.
Note that we set the terms on the diagonal to zero as explained
before.
Hint: partial correlation matrix =
-1*cov2cor(Estimated inverse covariance matrix)
# using cov2cor()
parr.corr2 <- -1*cov2cor(P)
# define the column & row names
colnames(parr.corr2) <- colnames(P)
rownames(parr.corr2) <- colnames(P)
# set the diagonal terms to zero
diag(parr.corr2) <- 0
Question 6: Do both ways produce the same correlation
matrix?
Hint: use all.equal()
to check the equality of the two
matrices you have obtained above.
all.equal(parr.corr,parr.corr2)
## [1] TRUE
Now if you run View(partial correlation matrix)
in R
Studio, you’ll see a sparse partial correlation matrix. In fact, the
non-zero elements represent a connection between two variables with the
strength of the connection determined by the magnitude of the partial
correlation.
The partial correlation matrix can be used to build a network graph,
where variables are represented as nodes and non-zero elements are
represented as edges between two nodes.
# Uncomment to check the sparse partial correlation matrix below.
# View(parr.corr)
Now the second step is to display the inferred network. We will use
igraph
for this purpose. The igraph package has some
fantastic tools for building, manipulating and displaying graphs. We
will only use a fraction of the package’s features here, but if you’re
interested in learning more about it, check out its documentation (e.g.,
https://igraph.org/r/doc/)
Question 7: Construct the network graph. First, use
graph_from_adjacency_matrix()
to make adjacency matrix.
Then, use plot.igraph()
to plot it.
Hint: Note that the mode
should be undirected
and the weighted
should be True
.
“graph_from_adjacency_matrix(your partial correlation matrix, mode="undirected", weighted=TRUE)
”
infered_graph <- graph_from_adjacency_matrix(parr.corr2, mode="undirected", weighted=TRUE)
plot.igraph(infered_graph)
Question 8: You can also use plot()
to see the
graph. In this case, choose a different layout.
Hint: For instance, you can choose
layout = layout.lgl(your graph object)
or
layout = layout.grid(your graph object)
.
plot(infered_graph)
plot(infered_graph, vertex.size = 35, edge.width = 1, vertex.color = "purple",vertex.label.color="white")
plot(infered_graph, edge.color = "black", vertex.color =
"lightgreen", vertex.size = 50)
plot(infered_graph, layout = layout.lgl(infered_graph))
plot(infered_graph, layout = layout.lgl(infered_graph))
plot(infered_graph, layout = layout.grid(infered_graph))
Question 9: Now without considering the penalized likelihood,
compute the partial correlation matrix and estimate the network based on
that. What is the difference between this network and the previous
network?
Hint:
- step 1. Use cov()
to get the covariance matrix.
- step 2. Use solve()
to compute the inverse matrix.
- step 3. Use cov2cor()
to compute the partial covariance
matrix. Then, use the plot()
function as explained
above.
S2 <- cov(cropdata2)
invS2 <- solve(S2)
parr.corr3 <- -1*cov2cor(invS2)
diag(parr.corr3) <- 0
infered_graph2 <- graph_from_adjacency_matrix(parr.corr3, mode="undirected", weighted=TRUE)
plot(infered_graph2, edge.color = "black", vertex.color =
"lightgreen", vertex.size = 50)
plot(infered_graph2, layout = layout.grid(infered_graph)) # using layout.grid
Yes, you are right. The second graph is a fully connected graph and really difficult to interpret.
Question 10: Compare the corresponding graph in the lecture slides with the sparse graph that you inferred already.
For the second part of the practical we use ‘ggplot2’ and ‘bootnet’ packages.
Question 11: First install and load these two packages.
#install.packages("ggplot2")
#install.packages("bootnet")
library("ggplot2")
library("bootnet")
The next step is to tell R to estimate the network model using the
EBICglasso
(Foygeland Drton, 2010) to produce an
interpretable network. This function uses the glasso package (Friedman,
Hastie and Tibshirani, 2011) as well in order to compute a sparse
Gaussian graphical model with the graphical lasso (Friedman, Hastie
& Tibshirani, 2008). The tuning parameter is chosen using the
Extended Bayesian Information Criterion (EBIC). See the
documentation for more information.
For the problem of recovering the graphical structure, information
criteria provide useful optimization objectives for algorithms searching
through sets of graphs or for selection of tuning parameters of other
methods (such as the graphical lasso, which is basically a likelihood
penalization technique).
In many applications, particularly in the analysis of gene expression
data, inference of the graph is of significant interest. Information
criteria provide an important tool for this problem. They provide the
objective to be minimized in (heuristic) searches over the space of
graphs and they are sometimes used to select tuning parameters in other
methods such as the graphical lasso.
An extended Bayesian information criterion can be used for Gaussian
graphical models in a scenario where both the number of variables
p and the sample size n grow. It allows for growth in
the number of non-zero parameters in the true model that is necessary to
cover connected graphs (Foygel R., and Drton M. (2010)).
\[BIC_{\gamma}(E) = −2ln(\theta(E)) + |E| log(n) + 4.{\gamma} |E|_{\gamma} log(p),\]
where \(E\) is the edge set of a
candidate graph and \(ln(\theta(E))\)
denotes the maximized log-likelihood function of the associated
model.
If \(\gamma = 0\), then the classical
BIC is recovered, which leads to the (asymptotically) consistent model
selection in the setting of fixed number of variables p and
growing sample size n. Positive \(\gamma\) leads to the stronger penalization
of large graphs.
The ordinary Bayes information criterion is too liberal for model selection when the model space is large. The new criteria take into account both the number of unknown parameters and the complexity of the model space. Their consistency is established, in particular allowing the number of covariates to increase to infinity with the sample size (Chen J. and Chen Z. 2008).
Question 12: Estimate the network model using the
EBICglasso()
and compare it with the sparse network
resulted from the previous part.
Hint: Use
estimateNetwork(your data,default = "EBICglasso")
. After
that, plot the graph using plot()
function. Note that the
red color shows negative correlation and the blue color shows positive
correlation. The size of the edge shows how strong the relationship
is.
Network <- estimateNetwork(cropdata2, default = "EBICglasso")
plot(Network, layout = "spring", labels = colnames(cropdata2))
results <- estimateNetwork(
cropdata2,
default = "EBICglasso",
corMethod = "cor")
plot(results)
By means of synthetic biology, Cantone et al. (2009) designed a network in S.cerevisiae (yeast). With Real-Time Polymerase Chain Reaction (RT-PCR), Cantone et al.(2009) measured in vivo gene expression data: first under galactose- and then under glucose-metabolism. For both carbon sources, the network structure is identical, but the strengths of the regulatory processes change with the carbon source.
<Project 1> Considering the yeast data, estimate the
network model twice: first using glasso
, and second using
EBICglasso
. What is the difference? Is that a fully
connected network or a very sparse network?
yeast <- read.table("yeast.txt", header=T)
colnames(yeast)<- c("CBF1", "GAL4", "SWI5", "GAL80", "ASH1")
S <- cov(yeast)
rho <- 20 # play around with this variable and see how the resulting graph changes.
invcov <- glasso(S, rho=rho)
P <- invcov$wi
colnames(P) <- colnames(S)
rownames(P) <- rownames(S)
# check symmetry
isSymmetric(P)
## [1] TRUE
parr.corr_yeast <- -1*cov2cor(P)
colnames(parr.corr_yeast) <- colnames(P)
rownames(parr.corr_yeast) <- colnames(P)
diag(parr.corr_yeast) <- 0
infered_graph_yeast <- graph_from_adjacency_matrix(parr.corr_yeast, mode="undirected", weighted=TRUE)
plot(infered_graph_yeast, edge.color = "black", vertex.color ="lightgreen", vertex.size = 50)
Network <- estimateNetwork(yeast, default = "EBICglasso")
plot(Network, layout = "spring", labels = colnames(yeast))
results <- estimateNetwork(
yeast,
default = "EBICglasso",
corMethod = "cor")
plot(results)
<Project 2> First import the data
(sachs.data.txt
) into your Rstudio.
sachs.data.txt
file contains a flow cytometry dataset with
p = 11 proteins measured on N = 7466 cells (Sachs et al., 2003). Then,
estimate the network model: first using glasso
(rho=
10,20,30) and second using EBICglasso
. What is the
difference? Play around with the tuning parameter (rho) using
glasso
and see how the network changes.
Protein_data <- read.table("sachs.data.txt", header = TRUE)
rho <- 40 # play around with this variable and see how the resulting graph changes.
S <- cov(Protein_data)
invcov <- glasso(S, rho=rho)
P <- invcov$wi
colnames(P) <- colnames(S)
rownames(P) <- rownames(S)
# check the symmetry
isSymmetric(P)
## [1] FALSE
if(!isSymmetric(P)) { # if it is not symmetric,
P[lower.tri(P)] = t(P)[lower.tri(P)] # make it symmetric
}
# check the symmetry again
isSymmetric(P)
## [1] TRUE
parr.corr_sachs <- -1*cov2cor(P)
colnames(parr.corr_sachs) <- colnames(P)
rownames(parr.corr_sachs) <- colnames(P)
diag(parr.corr_sachs) <- 0
infered_graph_sachs <- graph_from_adjacency_matrix(parr.corr_sachs, mode="undirected", weighted=TRUE)
plot(infered_graph_sachs, edge.color = "black", vertex.color ="lightgreen", vertex.size = 50)
plot(infered_graph_sachs, layout = layout.grid(infered_graph_sachs),vertex.size = 60)