Welcome back to the practical. Today we will focus on the following topics:
You will need to install the following packages for working on this practical.
# Installing the BiocManager package when necessary
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install("Rgraphviz")
## Loading the following packages into the current session. You can load packages after you've installed them using install.packages("name of package").
library(gRain)
library(Rgraphviz)
library(lattice)
library(gridExtra)
## To create BNs and inferring DAGs and estimating parameters,
## we will use mainly the bnlearn package (short for "Bayesian network learning").
## install bnlearn with install.packages("bnlearn")
library(bnlearn)
In this section, we will see how discrete BN works. To this end, first, we assume that an expert has an opinion about the dependencies between variables. That is, the structure of the network (network topology) is known.
We would like to investigate the usage patterns of different means of transport with a focus on cars and trains based on a survey. This is done to assess the customer satisfaction across different social groups to evaluate public policies or urban planning. To do so, we will examine the following six discrete variables for each individual:
A
): Recorded as young (‘young’)
for individuals below 30 years old, adult (‘adult’) for individuals
between 30 and 60 years old, and old (‘old’) for people older than
60.S
): Recorded as male (‘M’) or
female (‘F’).E
): The highest level of
education or training completed by an individual, recorded either as up
to high school (‘high’) or university degree (‘uni’).O
): whether the individual
is an employee (‘emp’) or a self- employed (‘self’) worker.R
): The size of the city
the individual lives in, recorded as either small (‘small’) or big
(‘big’).T
): The means of transport
favored by the individual, recorded either as car (‘car’), train
(‘train’) or other (‘other’).Now, let’s get started. First, we are very interested in creating a
DAG based on the expert knowledge. This DAG should show the dependencies
between the aforementioned variables. We will do this here step by step.
Question 1: First step, create a DAG with one node for each
variable in the survey and no arcs.
Use empty.graph()
function. You can define the nodes
through a vector via “nodes
” argument.
dag <- empty.graph(nodes = c("...", "..." , "..."))
# Write your code here
As mentioned above, one of the sources for getting information about the nodes is the expert’s opinion, and another source is the prior knowledge, for instance, from the literature. The expert’s opinions are as follows:
There is an edge from = “A”, to = “E”; that is, “E” depends on
“A”
There is an edge from = “S”, to = “E”; that is, “E” depends on “S”
There is an edge from = “E”, to = “O”; that is, “O” depends on “E”
There is an edge from = “E”, to = “R”; that is, “R” depends on “E”
There is an edge from = “O”, to = “T”; that is, “T” depends on “O”
There is an edge from = “R”, to = “T”; that is, “T” depends on “R”
Question 2: Now, based on expert’s opinion, start adding
the arcs that encode the direct dependencies between the variables in
the survey.
You can check the slides for the whole graph as well as the direction of
arcs.
To this end, one option is to assign arcs to the dag by using:
name of DAG that you created above <- set.arc(name of DAG that you created above, from= "name of parent", to = "name of child")
.
You should use this function for each edge separately. It should be
noted that from
is the parent and to
is the
child node.
# Write your code here
## The dag object encodes the desired direct dependencies.
Note that in the output, e.g. [E|A:S]
, means that A
\(\rightarrow\) E and S \(\rightarrow\) E. That is, in this case, E
has two parents (‘A’ and ‘S’ are co-parents of E) and[A]
means that there is no arc pointing towards A. This representation stems
from a product of conditional probabilities. To see all dependencies,
use the modelstring
function:
modelstring(name of the dag)
.
# Check the output yourself!
# modelstring(name of your dag)
Look at the the documentation included in the package for a
comprehensive overview of all functions in bnlearn
.
Two basic functions are nodes(name of the dag)
and
arcs(name of the dag)
.
Question 3: Use these two functions to see which nodes and edges
you have.
# Write your code here
The latter function also provides a way to add arcs to a DAG via
a matrix.
This is faster than setting them one at a time. See below:
## We can create a matrix with the same structure as that returned by arcs
## and set the whole arc set at once.
dag2 <- empty.graph(nodes = c("A", "S", "E", "O", "R", "T"))
arc.set <- matrix(c("A", "E",
"S", "E",
"E", "O",
"E", "R",
"O", "T",
"R", "T"),
byrow = TRUE, ncol = 2,
dimnames = list(NULL, c("from", "to")))
arcs(dag2) <- arc.set
Compare dag2
to the previous graph that you created
with all.equal()
.
all.equal(x, y)
is a utility to compare R objects x and y
testing the ‘near equality’. If they are different, the comparison is
still made to some extent, and a report of the differences is returned.
Look at the documentation for more explanation.
Question 4: What is the difference? Which approach do you prefer?
# Write your code here
Note that both approaches for creating a DAG guarantee that the
DAG will be acyclic.
Question 5: Use try()
to check the acyclicity
constraint when we add another edge from = “T”, to = “E”.
Hint:
try(set.arc(name of dag, from = name of parent, to = name of child))
# Write your code here
Question 6: Can you try to add another edge and check
the acyclicity constraint?
# Try it out your own!
Previously, we assumed the DAG structure is known, which is rarely the case. In real life situation often the DAG structures are unknown and should be inferred from the data. In this section, we will estimate the DAG structure using tests and scores.
Question 7: First, import the data survey
into R
using read.table()
. The data is already as a .txt file.
Check the first/last 10 rows using
head(data, number of rows)
and
tail(data, number of rows)
.
Hint:
read.table("datafile", header = TRUE, colClasses = "factor")
## Load data
## Check the data
Generally Conditional Independence Tests (CITs) test whether some
specific edges should be included in the DAG using statistical tests.
The ci.test()
from bnlearn
is the main
function used for this aim and it implements both \(G^2\) and \(X^2\). The log-likelihood ratio \(G^2\), is used when
test = "mi"
.
(G–test, also known as the likelihood ratio test, is used when you want
to assess the goodness of fit of two competing statistical models based
on the ratio of their likelihoods).
For example, we can test whether there is an edge from T to E:
ci.test("T", "E", c("O", "R"), test = "mi", data = survey)
In bnlearn
, we can compute the classic frequentist and
maximum likelihood estimates. This can be done with the
bn.fit
function. bn.fit
estimates the
parameters from the data.
Question 8: Given the piece of code above, can you guess what
c("O", "R")
is exactly? Look at the slides for answering
this.
Pearson’s \(X^2\) test is used when
test = "x2"
.
(Pearson’s chi-squared test is a statistical test applied to sets of
categorical data to evaluate how likely it is that any observed
difference between the sets arose by chance).
Question 9: Replace the test = "mi"
with
test = "x2"
, and see whether the result
changes.
# Write your code here
Note that c("O", "R")
are parents of
T
.
Question 10: What do the p-values show?
Unlike conditional independence tests, network scores focus on the
DAG as a whole.
They are goodness-of-fit statistics measuring how well the DAG mirrors
the dependence structure of the data.
Bayesian Information criterion (BIC) is used when
type = "bic"
. Bayesian Dirichlet equivalent uniform (BDeu)
(posterior probability of the DAG) is used when
type = "bde"
, associated with a uniform prior over both the
space of the DAGs and of the parameters. For example:
score(dag, data = survey, type = "bic")
Question 11: Use type = "bde"
and
iss=10
, and see whether the score changes.
Note: The argument iss
, is the imaginary sample
size (equivalent sample size), which is optional. iss
determines how much weight is assigned to the prior distribution
compared to the data when computing the posterior probabilities, as in
Bayesian setting we aim to compute posterior probabilities based on
prior knowledge and the data. The weight is specified as the size of an
imaginary sample supporting the prior distribution. We do not discuss
the mathematical details of this topic here. Just keep in mind that the
larger the iss, the stronger the prior. The value of iss is typically
chosen to be small (usually between 1 and 15) to allow the prior
distribution to be easily dominated by the data.
# Write your code here
Scores can also be used to compare completely different
networks, unlike conditional independence tests.
## New dag by adding the arc E --> T
dag3 <- set.arc(dag, from = "E", to = "T")
score(dag3, data = survey, type = "bic")
We can even generate a DAG at random with
random.graph
and compare it to the previous DAGs through
its score.
Question 12: Use random.graph()
to generate a DAG
(same nodes, random arcs) and compute its score with score
function using BIC and BDeu.
Hint:
random.graph(nodes = c("...", "...", "...", "..."))
# Write your code here
As expected, the randomly generated DAG is worse than the
previous DAGs (e.g., dag
, dag3
).
Learning the DAG from survey
yields a much better
network.
There are several algorithms that tackle this problem by searching for
the DAG that maximizes a given network score. Look at the slides for
more explanation. A simple one is hill-climbing: starting from
a DAG with no arcs, it adds, removes and reverses one arc at a time and
picks the change that increases the network score the most.
Question 13: Implement the hill-climbing algorithm using
hc
function.
Hint: Use hc()
. It takes the data (survey
) as
the only argument (it defaults to the BIC score). Check the help file
(?hc
).
# Write your code here
Question 14: Now compute the scores using
score()
.
# Write your code here
Question 15: Implement the hill-climbing algorithm but now
with BDeu instead of BIC score and compare the results.
Hint: specify score = "bde"
.
# Write your code here
Check some other functions below:
## d-separation checks
dsep(dag, x = "S", y = "R")
dsep(dag, x = "O", y = "R")
## path checks
path.exists(dag, from = "S", to = "R")
## If we condition on E, that path is blocked, and S and R become independent.
dsep(dag, x = "S", y = "R", z = "E")
The ability to plot a BN effectively is a key tool in BN
inference.
bnlearn
uses the functionality implemented in the
Rgraphviz
package to plot graph structures through the
graphviz.plot
function.
## specify the DAG
dag <- empty.graph(nodes = c("A", "S", "E", "O", "R", "T"))
dag <- set.arc(dag, from = "A", to = "E")
dag <- set.arc(dag, from = "S", to = "E")
dag <- set.arc(dag, from = "E", to = "O")
dag <- set.arc(dag, from = "E", to = "R")
dag <- set.arc(dag, from = "O", to = "T")
dag <- set.arc(dag, from = "R", to = "T")
## plot the dag using graphviz.plot()
graphviz.plot(dag)
This layout is called dot
. Other layouts can be
specified with the layout
argument.
Highlighting particular nodes and arcs (path) in a DAG can be
achieved either with the highlight
argument of
graphviz.plot
or using Rgraphviz
directly. The
former is easier to use, while the latter is more versatile.
## using highlight argument in graphviz.plot()
hlight <- list(nodes = nodes(dag), arcs = arcs(dag), col = "grey",
textCol = "grey")
graphviz.plot(dag, highlight = hlight, render = T)
## save the return value (to pp) to make further changes to the plot
pp <- graphviz.plot(dag, highlight = hlight, render = FALSE)
The pp
object is an object of class graph, and it can be
manipulated with the functions provided by the graph
and
Rgraphviz
packages. The look of the arcs can be customized
by using the edgeRenderInfo
function.
Once having made modifications, we can plot the DAG again with the
renderGraph
function.
edgeRenderInfo(pp) <- list(col = c("S~E" = "black", "E~R" = "black"),
lwd = c("S~E" = 3, "E~R" = 3))
renderGraph(pp)
We can highlight nodes with nodeRenderInfo
.
nodeRenderInfo(pp) <-
list(col = c("S" = "black", "E" = "black", "R" = "black"),
textCol = c("S" = "black", "E" = "black", "R" = "black"),
fill = c("E" = "grey"))
renderGraph(pp)
Question 16: Can you highlight a v-structure with a different color?
# Write your code here.
In this section, we will see how continuous BN works. To this end, first, we assume that an expert has an opinion about the dependencies between variables. That is, the structure of the network (network topology) is known.
We are going to use the same crop data that we saw in the graphical model practical.
Suppose that we are interested in the analysis of a particular plant, which we will model in a very simplistic way by considering:
We will analyze the following six continuous variables:
G
: the initial status of the plant: its genetic
potential.E
: the environmental potential of the location and the
season it is grown in.V
: a single vegetative variable summarizing all the
information available on constituted reserves. V
is
directly influenced by the values of G
and E
.
The greater they are, the greater is the possibility of a large
vegetative mass V
.N
: the number of seeds.W
: the seeds’ mean weight.C
: the crop, depends directly on N and W.As a result, the behavior of a plant can be described by experts or
the prior knowledge as follows:
{G, E}
\(\rightarrow\)
V
, V
\(\rightarrow\) N
,
V
\(\rightarrow\)
W
, {N, W}
\(\rightarrow\) C
We need to create an R object describing the aforementioned relationships between variables. Unlike in the previous practical, we use another function.
Question 17: Use model2network
function to
define the DAG structure above.
Hint:
model2network("[node][node|parent of node:parent of node] ...")
.
Note that the order of the nodes and their corresponding parents is not
important.
# Write your code here
Then, we can investigate the independencies. First, we investigate how we can detect the marginal independencies. That is whether two variables are independent marginally.
\[P(A,B) = P(A) \cdot P(B)\]
Question 18: Use dsep
function to find marginal
independencies: between E and G, between E and N, between G and W,
between E and C, between N and W and between C and E.
Hint:
dsep(object which contains the dag structure, "node1", "node2")
.
Check the help file of dsep
(can use ?dsep
).
Note that “node1” and “node2” are the two nodes for which we would like
the marginal independencies.
# Write your code here
The next step would be investigating the conditional independencies.
\[P(A,B|C)=P(A|C) \cdot P(B|C)\]
Question 19: Again, use dsep
function to find
the conditional independencies: between C and E conditional on V,
between E and N conditional on V, and between G and W conditional on
V.
Hint:
dsep(object which contains the dag structure, "node1", "node2", "node that is conditioned on")
.
See the slides for more information.
# Write your code here
As it was mentioned before, conditional independence tests can be used when we want to test whether some specific edges should be in the DAG. To put it in another way, we use this approach when we are interested in investigating/testing the existence of some edges, which indicate the dependencies between them.
Consider the hypothesis: C
is independent from
W
given N
.
In order to investigate the hypothesis, first we need to compute the
correlation matrix for C
, W
, and
N
.
Question 20: Use read.table()
to load the
cropdata1
and check the data using dim()
and
head()
.
Hint: Note that header = TRUE
when you import the data.
## load the crop data
## get the dimension
## check the data how it looks
Question 21: Use cor()
to compute the
correlation matrix for C
, W
, and
N
.
# Write your code here
Then, we compute the partial correlation matrix using
corpcor
package.
Question 22: Use cor2pcor
function from the
corpcor
package to compute the partial correlation
matrix.
Hint: cor2pcor(correlation matrix)
.
# Load the package
library(corpcor)
# Write your code here
You can then find the partial correlation between “C” and “W” from
the partial correlation matrix.
Question 23: First, assign the row & column names the same
names as in the correlation matrix so that you can easily search for the
partial correlation between “C” and “W”. Next, find the partial
correlation between “C” and “W”.
Hint: Use
dimnames(partial corr. matrix) <- dimnames(corr. matrix)
.
# Write your code here
## check "parcor"
## find the partial correlation
We can also find the partial correlation using ci.test
from bn.learn
.
Question 24: Find the partial correlation using
ci.test
.
Hint:
ci.test("node1", "node2", "node3", test = "cor", data = data)
.
# Write your code here
Given the result above, we can say that C
has a
significant positive correlation with W
given
N
, and reject the null hypothesis of independence based on
the extremely small p-value.
The cropdata1
data set is not very large, and
therefore it is not likely to contain enough information to learn the
true structure of the DAG. In order to learn a better DAG, we can use a
bigger sample. Suppose that a sample containing \(2 \times 10^4\) observations is available
in a data frame called cropdata2
.
Question 25: Load cropdata2
using
read.table()
and check the data structure using
dim()
. Then, use iamb()
(a constraint-based
structure learning algorithm) to infer the network structure. What does
test = "cor"
show?
Hint: header = TRUE
when loading data.
iamb(data, test = "cor")
when inferring the network
structure.
## load the crop data
## get the dimension
## infer the network structure
Network scores for Gaussian Bayesian Networks (GBNs) work in a
similar way to the scores for the discrete BNs. Both scores can be
computed by calling the score
function from
bnlearn
.
Bayesian Information criterion (BIC) is used when
type = "bic-g"
.
Bayesian Gaussian equivalent score (BGe) is used when
type = "bge"
.
Question 26: Use score()
to obtain the network
scores based on BIC and BGe. And compare the results.
# Write your code here
Question 27: As we did in the previous question for discrete
BNs, can you infer the network from
cropdata2
?
Hint: Note the score that you use as input.
# Write your code here
Question 28: Can you infer the network from
cropdata1
?
# Write your code here
Question 29: Is there any difference between the inferred
networks in Question 27 and Question 28.
Hint: check nodes()
, arcs()
.
# Write your code here
Question 30: Compare the inferred DAG from Question 28 with the inferred DAG from the introductory example DAG (expert’s opinion).
# Write your code here
In this section, we will focus on igraph
(since we
learned how to use bnlearn
and Rgraphviz
previously).
The arguments provided to the graph.formula
function
identify the nodes at the tail and at the head of each arc in the
graph.
For instance, E-+V
indicates that there is an arc going
from node E to node V. The “-” sign means that E
is at the
tail of the arc, while the “+” means that V
is at the head
of the arc. Therefore, E-+V
and V+-E
identify
the same arc.
# Load the igraph package
library(igraph)
igraph.options(print.full = TRUE)
## Warning: `igraph.options()` was deprecated in igraph 2.0.0.
## ℹ Please use `igraph_options()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
dag0.igraph <- graph.formula(G-+V, E-+V, V-+N, V-+W, N-+C, W-+C)
dag0.igraph
## IGRAPH 3cacfea DN-- 6 6 --
## + attr: name (v/c)
## + edges from 3cacfea (vertex names):
## [1] G->V V->N V->W E->V N->C W->C
It is convenient to convert a bn
or bn.fit
object into an igraph
graph object.
Question 31: Use igraph.from.graphNEL()
and
as.graphNEL()
to convert your bn
object
constructed based on the expert’s opinion to igraph
object.
Hint:
igraph.from.graphNEL(as.graphNEL(your bn object))
.
# Write your code here
There are a number of functions available with igraph
.
Explore using the help file of
igraph
(?igraph
).
For example, function V()
and E()
display
the vertices and edges of a graph, which are synonymous with the nodes
and arcs when working with DAGs.
# Check the output yourself!
# V(your igraph object)
# E(your igraph object)
We can now plot our DAG in different ways just applying some of the
layout algorithms available in igraph.
Here are some examples:
## adjust the plotting layout
par(mfrow = c(2, 2), mar = rep(0.5,4), cex.main = 1)
## 1: default plot
plot(dag.igraph, main = "1: default")
## 2: Specifying the labels
dag2 <- dag.igraph
V(dag2)$label <- V(dag2)$name
plot(dag2, main = "2: with labels")
## 3: Specifying the positions (on x-axis and y-axis)
ly <- matrix(c(2, 3, 1, 1, 2, 3,
1, 4, 4, 2, 3, 2), 6)
plot(dag2, layout = ly, main = "3: positioning")
## 4: Adjusting node and edge style
colo <- c("black", "darkgrey", "darkgrey", rep(NA, 3))
lcolo <- c(rep("white", 3), rep(NA, 3))
par(mar = rep(0, 4), lwd = 1)
plot(dag2, layout = ly, frame = TRUE,
main = "\n4: final",
vertex.color = colo, vertex.label.color = lcolo,
vertex.label.cex = 1.5, vertex.size = 30,
edge.arrow.size = 0.8, edge.color = "black")
Question 32: Can you plot using your own igraph object? Try to play around with the various formatting options available in the igraph package.
# Write your code here
<Project 1> Import the yeast gene expression data from
the previous lab and infer the corresponding network using
hc()
. What is its score? Compare this directed graph with
the undirected graph inferred from the previous lab.
# Write your code here
<Project 2> Import the flow cytometry dataset (Sachs et
al., 2003) from previous lab and infer the corresponding network using
hc()
. What is its score? Compare this directed graph with
the undirected graph inferred from previous lab.
# Write your code here