Gene Expression Analysis with myTAI

Introduction

In the Introduction vignette we introduced and discussed how phylotranscriptomics can be applied to capture evolutionary signals in (developmental) transcriptomes. Furthermore, in the Enrichment Analyses vignette we provide a use case to correlate specific groups or sets of genes with their predicted evolutionary origin. Here, we aim to combine previously introduced techniques with classic gene expression analyses to detect possible functional causes for the observed transcriptome conservation.

In other words, phylotranscriptomics allows us to detect stages or periods of evolutionary conservation and is able to predict the evolutionary origin of process or trait specific genes based on enrichment analyses. By combining evolutionary enrichment analyses with the functional annotation of process or trait specific genes (see Functional Annotation for details) the detection of evolutionary signals can be correlated with functional processes. Then, performing gene expression analyses on corresponding process or trait specific genes allows users to detect potential causes of stage/period specific evolutionary transcriptome conservation.

The following sections introduce main gene expression data analysis techniques implemented in myTAI:

• Detection of Differentially Expressed Genes (DEGs)
• Fold-Change
• Welch t-test
• Wilcoxon Rank Sum Test (Mann-Whitney U test)
• Negative Binomial (Exact Tests)

• Collapsing Replicate Samples

• Filter for Expressed Genes

• Compute the Statistical Significance of Each Replicate Combination

Detection of Differenentially Expressed Genes (DEGs)

A variety of methods have been published to detect differentially expressed genes. Some methods are based on non-statistical quantification of expression differences (e.g. fold-change and log-fold-change), but most methods are based on statistical tests to quantify the significance of differences in gene expression between samples. These statistical methods can furthermore be divided into two methodological categories: parametric tests and non-parametric tests. The DiffGenes() function available in myTAI implements the most popular and useful methods to detect differentially expressed genes. In the literature, different methods have been introduced and discussed for microarray technologies versus RNA-Seq technologies.

In this section we will introduce all methods implemented in DiffGenes() using small examples and will furthermore, discuss published advantages and disadvantages of each method and each mRNA quantification technology.

Note that when using DiffGenes() it is assumed that your input dataset has been normalized before passing it to DiffGenes(). For RNA-Seq data DiffGenes() assumes that the libraries have been normalized to have the same size, i.e., to have the same expected column sum under the null hypothesis (or the lib.size argument in DiffGenes() is specified accordingly).

Fold-Changes

A fold change in gene expression is simply the ratio of the gene expression level of one sample against a second sample: $$\frac{e_{i1}}{e_{i2}}$$, where $$e_{i1}$$ is the expression level of gene $$i$$ in sample one and $$e_{i2}$$ is the expression level of gene $$i$$ in sample two. In case replicate expression levels are present for each sample the ratio of means of the corresponding replicates is computed: $$\frac{\bar{e}_{i1}}{\bar{e}_{i2}}$$, where $$\bar{e}_{i1}$$ is the mean of replicate expression levels of gene $$i$$ in sample one and $$\bar{e}_{i2}$$ is the mean of replicate expression levels of gene $$i$$ in sample two.

• Advantages: Given a small number of replicate values the statistical evaluation of differentially expressed genes might be biased (depending on the statistical test chosen) by underlying sample distributions which are not fulfilled or because a small number of replicate values is not sufficient enough to perform non-parametric tests. Here, fold-changes provide a simple way to quantify gene expression differences between samples by $$n$$-fold enrichment. In our opinion, although the process of choosing a threshold for defining genes as being differentially expressed or not based on fold-change values is purely subjective and relies on common sense, in some cases this procedure will be more suitable than defining differentially expressed genes based on p-values obtained from a test statistic with violated test assumptions.

• Disadvantages: If used appropriately, statistical tests not only systematically quantify the significance of the observed gene-by-gene differences of expression, but furthermore, accounts the variance of replicate expression levels when comparing the mean difference of replicate expression levels between samples. Hence, the gene specific variance between replicates is also quantified by the p-value returned by the sufficient test statistic which is not quantified by a simple fold-change measure.

Example: Fold-Change

For the following example we assume that PhyloExpressionSetExample[1:5,1:8] stores 5 genes and 3 developmental stages with 2 replicate expression levels per stage.

data("PhyloExpressionSetExample")

# Detection of DEGs using the fold-change measure
DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep          = 2,
method        = "foldchange",
stage.names   = c("S1","S2","S3"))

head(DEGs)
   Phylostratum      GeneID    S1->S2    S1->S3    S2->S1    S2->S3    S3->S1    S3->S2
1            1 at1g01040.2 1.6713881 2.0806706 0.5983051 1.2448758 0.4806143 0.8032930
2            1 at1g01050.1 1.0273222 1.2709185 0.9734045 1.2371177 0.7868325 0.8083305
3            1 at1g01070.1 1.3087379 1.4044799 0.7640949 1.0731560 0.7120073 0.9318310
4            1 at1g01080.2 0.7779572 0.7286769 1.2854177 0.9366542 1.3723503 1.0676299
5            1 at1g01090.1 0.3803866 0.2288961 2.6289042 0.6017460 4.3687939 1.6618307

The resulting output shows all combinations of fold-changes between samples (developmental stages). Here, S1->S2 denotes that the fold-change was computed for expression levels of stage S1 against stage S2.

Example: Log-Fold-Change

When selecting method = "log-foldchange" it is assumed that the input ExpressionSet stores log2 expression levels. Here, we transform absolute expression levels stored in PhyloExpressionSetExample to log2 expression levels using the tf() function before log-fold-changes are computed.

data("PhyloExpressionSetExample")

# Detection of DEGs using the logfold-change measure
log.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep          = 2,
method        = "log-foldchange",
stage.names   = c("S1","S2","S3"))

head(log.DEGs)
  Phylostratum      GeneID      S1->S2     S1->S3      S2->S1      S2->S3     S3->S1      S3->S2
1            1 at1g01040.2  0.74104679  1.0570486 -0.74104679  0.31600182 -1.0570486 -0.31600182
2            1 at1g01050.1  0.03888868  0.3458715 -0.03888868  0.30698280 -0.3458715 -0.30698280
3            1 at1g01070.1  0.38817621  0.4900360 -0.38817621  0.10185975 -0.4900360 -0.10185975
4            1 at1g01080.2 -0.36223724 -0.4566488  0.36223724 -0.09441158  0.4566488  0.09441158
5            1 at1g01090.1 -1.39446159 -2.1272350  1.39446159 -0.73277345  2.1272350  0.73277345

The resulting output stores all combinations of log fold-changes between samples (developmental stages).

Welch t-test

The Welch t-test is a parametric test to statistically quantify the difference of sample means in cases where the assumption of homogeneity of variance (equal variances in the two populations) is violated (Boslaugh, 2013). The Welch t-test is a sufficient parameter test for small sample sizes and thus, has been used to detect differentially expressed genes based on p-values returned by the test statistic (Hahne et al., 2008).

In detail, the test statistic is computed as follows:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

where $$\bar{x}_1$$ and $$\bar{x}_2$$ are sample means, $$s_1^2$$ and $$s_2^2$$ are the sample variances, and $$n_1$$ and $$n_2$$ are the sample sizes.

The degrees of freedom for Welch’s t-test are then computed as follows:

$$df = \frac{\big(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\big)^2}{\frac{s_1^4}{n_1^2 (n_1 - 1)} + \frac{s_2^4}{n_2^2 (n_2 - 1)}}$$

To perform a sufficient Welch t-test the following assumptions about the input data need to be fulfilled to test whether two samples come from populations with equal means:

• independent samples
• continuous data
• (approximate) normality

Nevertheless, although in most cases log2 expression levels are used to perform the Welch t-test assuming that expression levels are log-normal distributed which approximates a normal distribution in infinity, in most cases the small number of replicates is not sufficient enough to fulfill the (approximate) normality assumption of the Welch t-test.

Due to this fact, non-parametric, sampling based, or generalized linear model based methods have been proposed to quantify p-values of differential expression. Nevertheless, the DiffGenes() function implements the Welch t-test for the detection of differentially expressed genes, allowing users to compare the results with more recent DEG detection methods/methodologies also implemented in DiffGenes().

• DEG detection based on statistical quantification
• Parametric test resulting in a strong test statistic
• Can handle small sample sizes
• Test assumptions must be fulfilled to return sufficient p-values
• Can hardly assure normality with very sample sizes of $$n = 3,4,5,..$$ (replicates)
• Pairwise comparisons between different stages or experiments

Example: Welch t-test

Performing Welch t-test with DiffGenes() can be done by specifying method = "t.test". Internally DiffGenes() performs a two sided Welch t-test. This means that the Welch t-test quantifies only whether or not a gene is significantly differentially expressed, but not the direction of enrichment (over-expressed or under-expressed).

The PhyloExpressionSetExample we use in the following example stores absolute expression levels. In case your ExpressionSet also stores absolute expression levels (which is likely due to the ExpressionSet standard for Phylotranscriptomics analyses), you can use the tf() function implemented in myTAI to transform absolute expression levels to log2 expression levels before performing DiffGenes() with a Welch t-test, e.g. tf(PhyloExpressionSetExample[1:5,1:8],log2). In general, using log2 transformed expression levels as input ExpressionSet of DiffGenes() allows us to (at least) assume that samples (replicate expression levels) used to perform the Welch t-test are log-normal distributed and therefore, somewhat approximate normal distributed.

Please notice however, that RNA-Seq data can include count values of 0. So when transforming absolute counts to log2 counts infinity values of log2(0) = -Inf will be produced and therefore, p-value computations will not be possible. To avoid this case you could either remove RNA-Seq count values of 0 from the input dataset using the Expressed() function (see section Filter for Expressed Genes), e.g. pass tf(Expressed(PhyloExpressionSetExample[1:5,1:8], cut.off = 1),log2) as ExpressionSet argument to DiffGenes() or shift all count values by a constant value, e.g. pass tf(PhyloExpressionSetExample[1:5,1:8], function(x) log2(x + 1)) as ExpressionSet argument to DiffGenes().

Internally, DiffGenes() will also check for 0 values in input data and will automatically shift all expression levels by +1 in case 0 values are included.

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Welch t-test
ttest.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep          = 2,
method        = "t.test",
stage.names   = c("S1","S2","S3"))

# look at the results
ttest.DEGs
  Phylostratum      GeneID     S1<->S2    S1<->S3    S2<->S3
1            1 at1g01040.2 0.027832572 0.04020203 0.13481563
2            1 at1g01050.1 0.852379466 0.31471871 0.36326955
3            1 at1g01070.1 0.003200692 0.00113536 0.02236621
4            1 at1g01080.2 0.086426813 0.03092924 0.45999438
5            1 at1g01090.1 0.090387087 0.04638872 0.04978092

The resulting data.frame stores the p-values of stage-wise comparisons for each gene. To adjust p-values for multiple testing of stage-wise comparisons you can specify the p.adjust.method argument with one of the p-value adjustment methods implemented in DiffGenes().

In detail, correcting for multiple testing allows to appropriately choose selection cut-offs for p-values fulfilling the differential expression criteria. Hahne et al., 2008 (p. 87) give a nice example of correcting for multiple testing to determine appropriate selection cut-offs.

Please consult the documentation of ?p.adjust to see which p-value adjustment methods are implemented in DiffGenes().

Please also consult these reviews (Biostatistics Handbook, Gelman et al., 2008, and Slides) to decide whether or not to apply p-value adjustment to your own dataset.

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
nrep            = 2,
method          = "t.test",
stage.names     = c("S1","S2","S3"))

ttest.DEGs.p_adjust
  Phylostratum      GeneID    S1<->S2   S1<->S3   S2<->S3
1            1 at1g01040.2 0.06958143 0.0579859 0.2246927
2            1 at1g01050.1 0.85237947 0.3147187 0.4540869
3            1 at1g01070.1 0.01600346 0.0056768 0.1118311
4            1 at1g01080.2 0.11298386 0.0579859 0.4599944
5            1 at1g01090.1 0.11298386 0.0579859 0.1244523

The resulting p-value adjusted data.frame can be used to filter for differentially expressed genes. Here, specifying the arguments: comparison, alpha, and filter.method in DiffGenes() allows users to obtain only significant differentially expressed genes.

# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
ttest.DEGs.p_adjust.filtered <- DiffGenes(ExpressionSet   = tf(PhyloExpressionSetExample[1:10 ,1:8],log2),
nrep            = 2,
method          = "t.test",
stage.names     = c("S1","S2","S3"),
comparison      = "above",
alpha           = 0.05,
filter.method   = "n-set",
n               = 1)

# look at the genes fulfilling the filter criteria
ttest.DEGs.p_adjust.filtered
  Phylostratum      GeneID    S1<->S2   S1<->S3   S2<->S3
3            1 at1g01070.1 0.03200692 0.0113536 0.2192432

In this example, only 1 out of 10 genes fulfills the p-value criteria (alpha = 0.05) in at least one stage comparison.

Rank top p-values

Finally, users can rank genes in increasing p-value order for each stage comparison by typing:

ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet   = tf(PhyloExpressionSetExample[1:500,1:8],log2),
nrep            = 2,
method          = "t.test",
stage.names     = c("S1","S2","S3"))

head(ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3])
    Phylostratum      GeneID  S1<->S2
54             1 at1g02400.1 0.151388
119            1 at1g03870.1 0.151388
137            1 at1g04380.1 0.151388
289            1 at1g08110.4 0.151388
383            1 at1g10360.1 0.151388
413            1 at1g11040.1 0.151388

Here the line ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3] will sort p-values of stage comparison "S1<->S2" in increasing order.

Wilcoxon-Mann-Whitney test (Mann-Whitney U test)

The Wilcoxon-Mann-Whitney test is a nonparametric test to quantify the shift in empirical distribution parameters. Nonparametric tests are useful when sample populations do not meet the test assumptions of parametric tests.

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
Wilcox.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep          = 2,
method        = "wilcox.test",
stage.names   = c("S1","S2","S3"))

# look at the results
Wilcox.DEGs
  Phylostratum      GeneID   S1<->S2   S1<->S3   S2<->S3
1            1 at1g01040.2 0.3333333 0.3333333 0.3333333
2            1 at1g01050.1 1.0000000 0.3333333 0.3333333
3            1 at1g01070.1 0.3333333 0.3333333 0.3333333
4            1 at1g01080.2 0.3333333 0.3333333 0.6666667
5            1 at1g01090.1 0.3333333 0.3333333 0.3333333

Again, users can adjust p-values by specifying the p.adjust.method argument.

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
nrep            = 2,
method          = "wilcox.test",
stage.names     = c("S1","S2","S3"),

# look at the results
Wilcox.DEGs.adj
  Phylostratum      GeneID   S1<->S2   S1<->S3   S2<->S3
1            1 at1g01040.2 0.4166667 0.3333333 0.4166667
2            1 at1g01050.1 1.0000000 0.3333333 0.4166667
3            1 at1g01070.1 0.4166667 0.3333333 0.4166667
4            1 at1g01080.2 0.4166667 0.3333333 0.6666667
5            1 at1g01090.1 0.4166667 0.3333333 0.4166667

Negative Binomial (Exact Tests)

Exact Tests for Differences between two groups of negative-binomial counts implemented in DiffGenes() are based on the edgeR function exactTest(). Please consult the edgeR Users Guide for mathematical details.

Install edgeR Package

The detection of DEGs using negative binomial models is based on the powerful implementations provided by the edgeR package. Hence, before using the negative binomial models in DiffGenes() users need to install the edgeR package.

# install edgeR
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

Double Tail Method

This method computes two-sided p-values by doubling the smaller tail probability (see ?exactTestByDeviance for details). To compute p-values for stagewise comparisons based on negative binomial models, the DiffGenes() argument method = "doubletail", the number of replicates per stage nrep, and lib.size quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also ?equalizeLibSizes).

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Double Tail Method
DoubleTail.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep          = 2,
method        = "doubletail",
lib.size      = 1000,
stage.names   = c("S1","S2","S3"))

# look at the results
DoubleTail.DEGs
  Phylostratum      GeneID    S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.26026604 0.110233012 0.6304508
2            1 at1g01050.1 0.95314428 0.598102712 0.6398757
3            1 at1g01070.1 0.55461941 0.456018563 0.8774231
4            1 at1g01080.2 0.58130025 0.487028051 0.8860005
5            1 at1g01090.1 0.03615134 0.001773543 0.2645537

Again, users can adjust p-values by specifying the p.adjust.method argument.

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Double Tail Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
nrep            = 2,
method          = "doubletail",
lib.size        = 1000,
stage.names     = c("S1","S2","S3"),

# look at the results
DoubleTail.DEGs.adj
  Phylostratum      GeneID   S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.6506651 0.275582530 0.8860005
2            1 at1g01050.1 0.9531443 0.598102712 0.8860005
3            1 at1g01070.1 0.7266253 0.598102712 0.8860005
4            1 at1g01080.2 0.7266253 0.598102712 0.8860005
5            1 at1g01090.1 0.1807567 0.008867715 0.8860005

Small-P Method

This method performs the method of small probabilities as proposed by Robinson and Smyth (2008) (see exactTestBySmallP for details). To compute p-values for stagewise comparisons based on negative binomial models, the DiffGenes() argument method = "doubletail", the number of replicates per stage nrep, and lib.size quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also ?equalizeLibSizes).

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Small-P Method
SmallP.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep          = 2,
method        = "smallp",
lib.size      = 1000,
stage.names   = c("S1","S2","S3"))

# look at the results
SmallP.DEGs
  Phylostratum      GeneID    S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.26026604 0.110233012 0.6304508
2            1 at1g01050.1 0.95314428 0.598102712 0.6398757
3            1 at1g01070.1 0.55461941 0.456018563 0.8774231
4            1 at1g01080.2 0.58130025 0.487028051 0.8860005
5            1 at1g01090.1 0.03615134 0.001773543 0.2645537

Again, users can adjust p-values by specifying the p.adjust.method argument.

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Small-P Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
nrep            = 2,
method          = "smallp",
lib.size        = 1000,
stage.names     = c("S1","S2","S3"),

# look at the results
SmallP.DEGs.adj
  Phylostratum      GeneID   S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.6506651 0.275582530 0.8860005
2            1 at1g01050.1 0.9531443 0.598102712 0.8860005
3            1 at1g01070.1 0.7266253 0.598102712 0.8860005
4            1 at1g01080.2 0.7266253 0.598102712 0.8860005
5            1 at1g01090.1 0.1807567 0.008867715 0.8860005

Deviance Method

This method uses the deviance goodness of fit statistics to define the rejection region, and is therefore equivalent to a conditional likelihood ratio test (see exactTestByDeviance for details). To compute p-values for stagewise comparisons based on negative binomial models, the DiffGenes() argument method = "doubletail", the number of replicates per stage nrep, and lib.size quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also ?equalizeLibSizes).

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Deviance
Deviance.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep          = 2,
method        = "deviance",
lib.size      = 1000,
stage.names   = c("S1","S2","S3"))

# look at the results
Deviance.DEGs
  Phylostratum      GeneID    S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.26026604 0.110233012 0.6304508
2            1 at1g01050.1 0.95314428 0.598102712 0.6398757
3            1 at1g01070.1 0.55461941 0.456018563 0.8774231
4            1 at1g01080.2 0.58130025 0.487028051 0.8860005
5            1 at1g01090.1 0.03615134 0.001773543 0.2645537

Again, users can adjust p-values by specifying the p.adjust.method argument.

data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Deviance Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
nrep            = 2,
method          = "deviance",
lib.size        = 1000,
stage.names     = c("S1","S2","S3"),

# look at the results
Deviance.DEGs.adj
  Phylostratum      GeneID   S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.6506651 0.275582530 0.8860005
2            1 at1g01050.1 0.9531443 0.598102712 0.8860005
3            1 at1g01070.1 0.7266253 0.598102712 0.8860005
4            1 at1g01080.2 0.7266253 0.598102712 0.8860005
5            1 at1g01090.1 0.1807567 0.008867715 0.8860005

Replicate Quality Check

Users can also perform replicate quality checks to quantify the variability between replicate expression levels fo each stage separately.

The PlotReplicateQuality() is designed to perform customized replicate variablity checks for any ExpressionSet object storing replicates.

data(PhyloExpressionSetExample)

# visualize the sd() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
nrep          = 2,
legend.pos   = "topright",
ylim          = c(0,0.2),
lwd           = 6)

The resulting plot visualizes the kernel density estimates for the variance (log variance) between replicates. Each curve represents the density function for the replicate variation within one stage or experiment. In this case the variance between replicates of Stage 1 to Stage 3 (each including 2 replicates) seem to deviate from each other allowing the conclusion that each stage has a different expression level variability between replicates.

The FUN argument implemented in PlotReplicateQuality() allows users to furthermore, specify customized criteria quantifying replicate varibility. Please notice that the function specified in FUN will be performed separately on each gene and stage.

In the following example the median bbsolute deviation function mad() is used to quantify replicate variability.

data(PhyloExpressionSetExample)

# visualize the mad() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
nrep          = 2,
legend.pos    = "topright",
ylim          = c(0,0.015),
lwd           = 6)

In general, users are not limited to specific functions implememnted in R. By writing customized functions such as FUN = function(x) return((x - mean(x))^2) users can define their own criteria to quantify replicate variability and can then apply this criteria to PlotReplicateQuality() by specifying the FUN argument.

Collapsing Replicate Samples

After performing differential gene expression analyses, replicate expression levels are collapsed to a single stage specific expression level. For this purpose, myTAI implements the CollapseReplicates() function, allowing users to combine replicate expression levels stored in a standard PhyloExpressionSet or DivergenceExpressionSet object to a stage specific expression level using a specified window function.

library(myTAI)

data(PhyloExpressionSetExample)

# genrate an example PhyloExpressionSet with replicates
ExampleReplicateExpressionSet <- PhyloExpressionSetExample[ ,1:8]

# rename stages
names(ExampleReplicateExpressionSet)[3:8] <- c("Stage_1_Repl_1","Stage_1_Repl_2",
"Stage_2_Repl_1","Stage_2_Repl_2",
"Stage_3_Repl_1","Stage_3_Repl_2")
# have a look at the example dataset
head(ExampleReplicateExpressionSet, 5)
  Phylostratum      GeneID Stage_1_Repl_1 Stage_1_Repl_2 Stage_2_Repl_1 Stage_2_Repl_2 Stage_3_Repl_1 Stage_3_Repl_2
1            1 at1g01040.2       2173.635      1911.2001       1152.555      1291.4224       1000.253       962.9772
2            1 at1g01050.1       1501.014      1817.3086       1665.309      1564.7612       1496.321      1114.6435
3            1 at1g01070.1       1212.793      1233.0023        939.200       929.6195        864.218       877.2060
4            1 at1g01080.2       1016.920       936.3837       1181.338      1329.4734       1392.643      1287.9746
5            1 at1g01090.1      11424.567     16778.1685      34366.649     39775.6405      56231.569     66980.3673

Now, assume that this example PhyloExpressionSet stores three developmental stages and 2 biological replicates for each developmental stage. Of course, we could now compute and visualize the TAI profile by typing:

# visualize the TAI profile over 3 stages of development
# and 2 replicates per stage
PlotPattern(ExpressionSet = ExampleReplicateExpressionSet,
type          = "l",
lwd           = 6)

Usually, one would expect that variations in replicate values are smaller than variations between developmental stages. In this example however, we constructed replicate values that vary larger than expression levels between developmental stages. For many applications it might be useful to visualize TAI/TDI values of replicates as well, but normally replicate values are collapsed to one gene and stage specific value after differential gene expression analyses and replicate quality control have been performed.

The following example illustrates how to collapse replicates with CollapseReplicates():

# combine the expression levels of the 2 replicates (const) per stage
# using geom.mean as window function and rename new stages to: "S1","S2","S3"
CollapssedPhyloExpressionSet <- CollapseReplicates(
ExpressionSet = ExampleReplicateExpressionSet,
nrep          = 2,
FUN           = geom.mean,
stage.names   = c("S1","S2","S3"))

# have a look at the collpased PhyloExpressionSet
head(CollapssedPhyloExpressionSet)
   Phylostratum      GeneID         S1         S2         S3
1            1 at1g01040.2  2038.1982  1220.0147   981.4381
2            1 at1g01050.1  1651.6070  1614.2524  1291.4582
3            1 at1g01070.1  1222.8557   934.3975   870.6878
4            1 at1g01080.2   975.8215  1253.2189  1339.2866
5            1 at1g01090.1 13844.9740 36972.3612 61371.0937
6            1 at1g01120.1   815.3288   894.8987   905.8272

The nrep argument specifies either a constant number of replicates per stage or a numeric vector storing variable numbers of replicates for each developmental stage. In our example, each developmental stage had a constant (equal) number of replicates per developmental stage (nrep = 2). In case a variable stage specific number of replicates is present, one could specify nrep = c(2,3,2) defining the case that developmental stage 1 stores 2 replicates, stage 2 stores 3 replicates, and stage 3 again, stores 2 replicates.

The argument FUN specifies the window function to collapse replicate expression levels to a single stage specific value. In this example, we chose the geom.mean() (geometric mean) function implemented in myTAI, because our example PhyloExpressionSet stores absolute expression levels. Notice that the mathematical equivalent of performing arithmetic mean (mean()) computations on log expression levels is to perform the geometric mean (geom.mean()) on absolute expression levels.

The stage.names argument then specifies the new names of collapsed stages.

Filter for Expressed Genes

After differential gene expression analyses and replicate aggregation have been performed, some studies filter gene expression levels in RNA-Seq count tables or microarray expression matrices for non-expressed or outlier genes. For example, in most studies performing RNA-Seq experiments FPKM/RPKM values < 1 are remove from the processed (final) count table.

For this purpose myTAI implements the Expressed() function to filter (remove) expression levels in RNA-Seq count tables or microarray expression matrices which do not pass a defined expression threshold.

The Expressed() function takes a standard PhyloExpressionSet or DivergenceExpressionSet object storing a RNA-Seq count table (CT) or microarray gene expression matrix and removes genes from this count table or gene expression matrix that have an expression level below a defined cut.off value.

Expressed() allows users to choose from several gene extraction methods (see ?Expressed for details):

• const: all genes that have at least one stage that undercuts or exceeds the expression cut.off will be excluded from the ExpressionSet. Hence, for a 7 stage ExpressionSet genes passing the expression level cut.off in 6 stages will be retained in the ExpressionSet.

• min-set: genes passing the expression level cut.off in ceiling(n/2) stages will be retained in the ExpressionSet, where n is the number of stages in the ExpressionSet.

• n-set: genes passing the expression level cut.off in n stages will be retained in the ExpressionSet. Here, the argument n is defining the number of stages for which the threshold criteria should be fulfilled.

# check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260

# remove genes that have an expression level below 8000
# in at least one developmental stage
FilterConst <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off       = 8000,
comparison    = "below",
method        = "const")

nrow(FilterConst) # check number of retained genes
#> [1] 449

Users will observe that only 449 out of 25260 genes in PhyloExpressionSetExample have an absolute expression level above 8000 when omitting genes using method = 'const'. The argument comparison specifies whether genes having expression levels below, above, or below AND above (both) the cut.off value should be removed from the dataset.

The following comparison methods can be selected:

• comparison = "below": define genes as not expressed which undercut the cut-off threshold.
• comparison = "above": define genes as outliers which exceed the cut-off threshold.
• comparison = "both": remove genes fulfilling the comparison = "below" AND comparison = "above" criteria.
# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260

# remove genes that have an expression level above 12000
# in at least one developmental stage (outlier removal)
FilterConst.above <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off       = 12000,
comparison    = "above",
method        = "const")

nrow(FilterConst.above) # check number of retained genes
#> [1] 23547

For this example 25260 - 23547 = 1713 have been classified as outliers (expression levels above 12000) and were removed from the dataset.

# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260

# remove genes that have an expression level below 8000 AND above 12000
# in at least one developmental stage (non-expressed genes AND outlier removal)
FilterConst.both <-  Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off       = c(8000,12000),
comparison    = "both",
method        = "const")

nrow(FilterConst.both) # check number of retained genes
#> [1] 2

When selecting comparison = 'both', the cut.off argument receives 2 threshold values: the below cut.off as first element and the above cut.off as second element. In this case cut.off = c(8000,12000). Here, only 2 genes fulfill these criteria.

Analogously, users can specify the number of stages that should fulfill the threshold criteria using the n-set method.

# remove genes that have an expression level below 8000
# in at least 5 developmental stages (in this case: n = 2 stages fulfilling the criteria)
FilterNSet <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off       = 8000,
method        = "n-set",
comparison    = "below",
n             = 2)

nrow(FilterMinSet) # check number of retained genes
#> [1] 20

Here, 20 genes are fulfilling these criteria.

Compute the Statistical Significance of Each Replicate Combination

In some cases (high variability of replicates) it might be useful to verify that there is no sequence of replicates (for all possible combination of replicates) that results in a non-significant TAI or TDI pattern, when the initial pattern with combined replicates was shown to be significant.

The CombinatorialSignificance() function implemented in myTAI allows users to compute the p-values quantifying the statistical significance of the underlying pattern for all combinations of replicates.

A small Example:

Assume a PhyloExpressionSet stores 3 developmental stages with 3 replicates measured for each stage. The 9 replicates in total are denoted as: $$1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3$$. Now the function computes the statistical significance of each pattern derived by the corresponding combination of replicates, e.g.

• 1.1, 2.1, 3.1 : p-value for combination 1

• 1.1, 2.2, 3.1 : p-value for combination 2

• 1.1, 2.3, 3.1 : p-value for combination 3

• 1.2, 2.1, 3.1 : p-value for combination 4

• 1.2, 2.1, 3.1 : p-value for combination 5

• 1.2, 2.1, 3.1 : p-value for combination 6

• 1.3, 2.1, 3.1 : p-value for combination 7

• 1.3, 2.2, 3.1 : p-value for combination 8

• 1.3, 2.3, 3.1 : p-value for combination 9

This procedure yields 27 p-values for the $$3^3$$ ($$n^m$$) replicate combinations, where $$n$$ denotes the number of developmental stages and $$m$$ denotes the number of replicates per stage.

Note that in case users have a large amount of stages/experiments and a large amount of replicates the computation time will increase by $$n^m$$. For 11 stages and 4 replicates, $$4^{11}$$ = 4194304 p-values have to be computed. Each p-value computation itself is based on a permutation test running with $$1,000, 10,000, ...$$ or more permutations. Be aware that this might take some time.

The p-value vector returned by this function can then be used to plot the p-values to see whether an critical value $$\alpha$$ is exceeded or not (e.g. $$\alpha = 0.05$$).

# load a standard PhyloExpressionSet
data(PhyloExpressionSetExample)

# we assume that the PhyloExpressionSetExample
# consists of 3 developmental stages
# and 2 replicates for stage 1, 3 replicates for stage 2,
# and 2 replicates for stage 3
# FOR REAL ANALYSES PLEASE USE: permutations = 1000 or 10000
# BUT NOTE THAT THIS TAKES MUCH MORE COMPUTATION TIME
p.vector <- CombinatorialSignificance(ExpressionSet = PhyloExpressionSetExample,
replicates    = c(2,3,2),
TestStatistic = "FlatLineTest",
permutations  = 100,
parallel      = FALSE)
 [1] 2.436296e-03 2.288593e-02 1.608399e-03 1.185615e-02 1.835306e-06 1.077012e-05
[7] 2.025515e-07 5.148342e-07 1.654885e-07 6.251145e-06 9.265520e-10 1.047479e-06

Users will observe that none of the replicate combinations resulted in p-values > 0.05 and thus we can assume that the phylotranscriptomic pattern computed based on collapsed replicates is not biased by insignificant replicate combinations.

any(p.vector > 0.05)
#> FALSE

CombinatorialSignificance() can perform all significance tests introduced in the Introduction and Intermediate vignettes.

Furthermore, the parallel argument allows users to perform significance computations in parallel on a multicore machine. This will speed up p-value computations for a large number of combinations.