# PAFway: Pairwise Associations Between Functional Annotations in Networks and Pathways

The purpose of this package is to allow the user to find pairs of annotations that are enriched in a network. For instance, the network might be a gene regulatory network, where the nodes represent genes and the edges represent regulatory interactions between pairs of genes. Each gene might have one or more functional annotation labels (such as GO terms) associated with them. The user might be interested in learning whether a specific GO term is enriched upstream of a second GO term. This function works with directed networks, with and without edge weights. The results can be depicted as either a network or a heatmap. More complete information is available here: http://eprints.whiterose.ac.uk/154294/

## Statistics explanation:

### No edge weights

When no edge weights are considered, enrichment between pairs of genes is determined by the binomial test.

### Edge weights

When a gene network contains edge weights, we calculate the sum of the edge weights of each edge type, and we would like to know whether this value is higher than would be expected by chance. For two functional annotations $$a$$ and $$b$$, let us define $$z_{a,b}$$ as the sum of the edge weights of edge type $$(a,b)$$ in the network. Let us say that $$c_{a,b}$$ is the count of the number of edges of that type. $$P(c_{a,b}=i)$$ is the probability of observing exactly $$i$$ edges of type $$(a,b)$$ and $$P(x \geq z_{a,b} | c_{a,b}=i)$$ is the probability of observing a sum of edge weights greater than $$z_{a.b}$$ given that $$c_{a,b}=i$$. The probability of observing at least $$z_{a,b}$$ is: % a weighted average of the probability of observing at least $$z_{a,b}$$ if there were exactly $$c_{a,b}$$ edges of edge type $$(a,b)$$ for all $$c_{a,b}$$ from $$1$$ to $$N$$ (where $$N$$ is the number of edges of the network). $\begin{equation} \label{eq:summation} P(x \geq z_{a,b})=\sum_{i=1}^N P(c_{a,b}=i) P(x \geq z_{a,b} | c_{a,b}=i) \end{equation}$

where $$N$$ is the number of edges in the network. Note that $$P(x \geq z_{a,b})$$ is the p-value.

## Example with a random network

First, let us construct a random network with 300 nodes, 1000 edges and 14 GO terms.

library(PAFway)

set.seed(123)
#Make 300 nodes
nodes=paste("node", c(1:300))
#First 3 node names:
print(nodes[1:3])
#>  "node 1" "node 2" "node 3"

#Assign them random GO terms
randomGO=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N")[sample(c(1:14), 300, replace=T)]
names(randomGO)=nodes
#First 3 nodes and associated GO terms:
print(randomGO[1:3])
#> node 1 node 2 node 3
#>    "C"    "N"    "C"

#Make 1000 edges
edgesRandom=t(sapply(c(1:1000), function(i){
nodes[sample(300, 2)]
}))
#First 3 edges:
print(edgesRandom[1:3,])
#>      [,1]       [,2]
#> [1,] "node 55"  "node 217"
#> [2,] "node 85"  "node 45"
#> [3,] "node 146" "node 170"

We can also consider a random network, where each gene can have more than one functional annotation:

#Assign each node a second GO term, separated by a '_' symbol
randomGO2=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N")[sample(c(1:14), 300, replace=T)]
randomGO_multiple=sapply(c(1:300), function(i){paste(randomGO[i], randomGO2[i], sep="_")})
names(randomGO_multiple)=nodes

#print first 5 elements, as a demo
print(randomGO_multiple[1:5])
#> node 1 node 2 node 3 node 4 node 5
#>  "C_B"  "N_H"  "C_J"  "J_A"  "B_J"

We can also select a sub-set of GO terms which we consider ‘interesting’:

#Interesting GO terms:
GO_interesting=c("B", "C", "D", "F")

PAFway works with and without edge weights. Here we generate random edge weights:

random_edge_weights=rnorm(length(edgesRandom[,1]), 1, 0.001)
print(random_edge_weights[1:5])
#>  0.9999511 0.9992959 1.0006808 1.0001300 1.0011097
print(length(random_edge_weights))
#>  1000

Now that we’ve set up some networks, we can see whether there are any enriched pairwise relationships between edges.

## Enrichment of pairwise associations without edge weights

First, we will use the main pafway function to find p-values for each function annotation pair. We will start off with the simplest example, which doesn’t use any optional parameters:

#This will run pafway, with no edge weights, for all the GO terms
a=pafway(randomGO, edgesRandom, unique(randomGO))
print(a[1:5, 1:5])
#>           C         N         J         B         F
#> C 0.6228151 0.8682946 0.6396453 0.3951836 0.6313025
#> N 0.7229578 0.2242200 0.7316659 0.2681023 0.7929657
#> J 0.6396453 0.4602521 0.8240874 0.8985991 0.7497069
#> B 0.6617244 0.6734212 0.1940617 0.2165586 0.1958142
#> F 0.2705119 0.4964210 0.7497069 0.9536428 0.9809709

This can be displayed as either a heatmap or a network:

draw_network(a)
#> Registered S3 method overwritten by 'GGally':
#>   method from
#>   +.gg   ggplot2 draw_heatmap(a)
#> Warning in plot.window(...): "fig.height" is not a graphical parameter
#> Warning in plot.window(...): "fig.width" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "fig.height" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "fig.width" is not a graphical parameter
#> Warning in title(...): "fig.height" is not a graphical parameter
#> Warning in title(...): "fig.width" is not a graphical parameter However, in this example, we do not correct for multiple hypothesis testing, so we get quite a few false positives, even though we know that we have started with a random network. Let us re-draw these, after adjustment for multiple hypotheses. We now see that no edges have a p-value<0.05.

draw_network(a, adjMethod = "bonferroni")
#>  "no edges are significant"
#> Warning in plot.window(...): "fig.height" is not a graphical parameter
#> Warning in plot.window(...): "fig.width" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "fig.height" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "fig.width" is not a graphical parameter
#> Warning in title(...): "fig.height" is not a graphical parameter
#> Warning in title(...): "fig.width" is not a graphical parameter ## Enrichment of pairwise associations with edge weights

Next, we’ll repeat the previous analysis, but with edge weights. We will start off with the simplest example, which doesn’t use any optional parameters:

#This will run pafway, with no edge weights, for all the GO terms
b=pafway_edge_weights(randomGO, cbind(edgesRandom, random_edge_weights), unique(randomGO))
print(b[1:5, 1:5])
#>           C         N         J         B         F
#> C 0.7650604 0.9190635 0.7208724 0.6295337 0.7313706
#> N 0.8016517 0.2812259 0.7861320 0.3916431 0.8460953
#> J 0.7208724 0.5286691 0.8637871 0.9588849 0.8056119
#> B 0.8802256 0.7966224 0.2890411 0.5452866 0.3266573
#> F 0.3658062 0.5742984 0.8056119 1.0000000 0.9917003

This can be displayed as either a heatmap or a network:

draw_network(b) draw_heatmap(b)
#> Warning in plot.window(...): "fig.height" is not a graphical parameter
#> Warning in plot.window(...): "fig.width" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "fig.height" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "fig.width" is not a graphical parameter
#> Warning in title(...): "fig.height" is not a graphical parameter
#> Warning in title(...): "fig.width" is not a graphical parameter However, in this example, we do not correct for multiple hypothesis testing, so we get quite a few false positives, even though we know that we have started with a random network. Let us re-draw these, after adjustment for multiple hypotheses. We now see that no edges have a p-value<0.05.

draw_network(b, adjMethod = "bonferroni")
#>  "no edges are significant"
#>  "nothing is even close to being significant"

## More complex scenarios:

### More than one GO annotation per gene

If you use a larger database of GO terms, there may be more than one annotation per gene. These can be appended together, separated by a "_" symbol:

#Multiple GO terms are in: randomGO_multiple
b=pafway(randomGO_multiple, edgesRandom, unique(randomGO), exact=F)
print(b[1:5], b[1:5])
#>  1 1 1 1 0

### Only a subset of GO annotations are interesting

It is usually recommended that you only perform this analysis on GO terms that are of particular interest to you, because otherwise you will lose a lot of statistical power by comparing pairs of GO terms that you don’t care about.

#GO terms of interest are in GO_interesting
b=pafway(randomGO, edgesRandom, GO_interesting)
print(b[1:5], b[1:5])
#>  0 0 1 1 1

### Code is running too slowly (on a really big network)

If you are running pafway (specifically with edge weights) on a really big network, it can take a while. It might be worth decreasing accuracy and increasing speed. One way to do this is to change the threshold at which a value in a pdf is rounded to zero. Another way is to change the step size.

#This will run pafway, with no edge weights, for all the GO terms
b=pafway_edge_weights(randomGO, cbind(edgesRandom, random_edge_weights), unique(randomGO), step=0.001, thresholdZero=0.0001)
print(b[1:5, 1:5])
#>           C         N         J         B         F
#> C 0.7650604 0.9190635 0.7208724 0.6295337 0.7313706
#> N 0.8016517 0.2812259 0.7861320 0.3916431 0.8460953
#> J 0.7208724 0.5286691 0.8637871 0.9588849 0.8056119
#> B 0.8802256 0.7966224 0.2890411 0.5452866 0.3266573
#> F 0.3658062 0.5742984 0.8056119 1.0000000 0.9917003

### Explanation of pafway_meta

Genes may have multiple functional annotations associated with them. For instance, it is often the case that genes that have one GO term will also have a different specific GO term, because GO terms have hierarchical arrangements (e.g. all genes involved in blue light sensing’ are also ‘light sensing’).

This does not directly impact the calculation of p-values in PAFway, but it would potentially impact how we interpret the results from a biological perspective. The aim of PAFway is to evaluate whether an edge between two functional annotations occurs more frequently than would be expected by a null model in which edges are randomly distributed in the network. In this scenario, it does not matter how much overlap there is between the functional annotations. Intuitively this makes sense: Let’s say you randomly select two words from a book and you want to calculate the probability that the first word you select contains the letter a’ and the second word you select contains the letterb’. The expected probability of this happening is $$p_a * p_b$$, and it does not depend on whether the letters a’ andb’ tend to appear together in the same word. This means that the calculation in PAFway is correct, as is.

However, this also means that the frequencies of pairs of edges are correlated with one another. For instance, if GO term a' is enriched upstream ofb’, and many genes that have GO term a' are also labelled with GO termc’, then there may also be an enrichment for edges between genes containing the GO terms c' andb’. If there is an overlap between GO terms b' andd’, then there may also be an enrichment for edges between c' andd’. Biologists may be less interested in an enrichment between GO terms c' andd’ if this enrichment is completely explained by the enrichment between GO terms a' andb’.

For this reason, we may want to know whether we observe more edges between genes containing the GO terms c' andd’, given (i) the co-occurrence of functional annotations a' andc’ (ii) the co-occurrence of functional annotations b' andd’ (iii) the frequency of edges containing a' and`b’. In particular, let $$f_{i,j}$$ be the number of times we observe an edge between genes with functional annotations $$i$$ and $$j$$. Let $$p(j|i)$$ be the probability of a gene having the functional annotation $$j$$ given that it has the functional annotation $$i$$. We use the notation $$!i$$ to signify i. We can calculate the expected value of $$f_{c,d}$$ using the following formula:

\begin{align*} \mathbb{E}(f_{c,d})=& f_{a,b}p(c|a)p(d|b) + \\ & f_{!a,b}p(c|!a)p(d|b) + \\ & f_{a,!b}p(c|a)p(d|!b) + \\ & f_{!a,!b}p(c|!a)p(d|!b) \end{align*}

Then, we can compare the observed count with the expected count: $$f_{c,d}/\mathbb{E}(f_{c,d})$$.