Genomic Annotation in Livestock for positional candidate LOci (GALLO) is an R package, for the accurate annotation of genes and Quantitative Trait Loci (QTLs) located within candidate markers and/or regions (haplotypes, windows, CNVs, etc) identified in the most common genomic analyses performed in livestock, such as Genome-Wide Association Studies or transcriptomics. Moreover, GALLO allows the graphical visualization of gene and QTL annotation results, data comparison among different grouping factors (e.g., methods, breeds, tissues, statistical models, studies, etc.), and QTL enrichment in different livestock species including cattle, pigs, sheep, and chicken, among others.
The example datasets composing this tutorial are subsets of Genome-Wide Association Studies (GWAS) for male fertility traits in cattle, which are summarized in Fonseca et al. (2018) and Cánovas et al. (2014). Additionally, the respective databases for gene and QTL annotation for these subsets are also available as internal data into GALLO package. It is possible to access the datasets using the following code:
#Installation
#devtools::install_github("pablobio/GALLO")
#Loading the package
library(GALLO)
#Importing QTL markers from example dataset
data("QTLmarkers")
::datatable(QTLmarkers, rownames = FALSE, extensions = 'FixedColumns',
DToptions = list(scrollX = TRUE))
dim(QTLmarkers)
[1] 141 7
#Importing QTL windows from example dataset
data("QTLwindows")
::datatable(QTLwindows, rownames = FALSE, extensions = 'FixedColumns',
DToptions = list(scrollX = TRUE))
dim(QTLwindows)
[1] 50 8
Note that two datasets are available: QTLwindows and QTLmarkers. The QTLwindows dataset is composed by 50 candidate genomic regions, while the QTLmarkers dataset is composed by 141 candidate markers. The QTLmarkes dataset is composed by significantly associated markers for male fertility traits in cattle, while QTLwindows is composed by candidate windows in the genome.
#Importing QTL annotation database
data("gffQTLs")
#Printing the first 100 rows
::datatable(gffQTLs[1:100,], rownames = FALSE, extensions = 'FixedColumns',
DToptions = list(scrollX = TRUE))
dim(gffQTLs)
[1] 59600 6
#Importing gene annotation database
data("gtfGenes")
#Printing the first 100 rows
::datatable(gtfGenes[1:100,], rownames = FALSE, extensions = 'FixedColumns',
DToptions = list(scrollX = TRUE))
dim(gtfGenes)
[1] 17831 8
Note that two databases are available: gffQTLs and gtfGenes The gffQTLs dataset is composed by 59600 annoatted QTLs in the bovine Genome, while the gtfGenes dataset is composed by 17831 genes.
import_gff_gtf(): Takes a .gft or .gff file and import into a dataframe
find_genes_qtls_around_markers: Takes a dataframe with candidate markers and/or regions (haplotypes, windows, CNVs, etc) and search for genes or QTLs in a specified interval
overlapping_among_groups: Takes a dataframe with a column for genes, QTLs (or any other data) and a grouping column and create matrices with the ovelapping information
plot_overlapping: Takes the output from overlapping_amoung_groups function and creates a heatmap with the overlapping between groups
plot_qtl_info: Takes the output from find_genes_qtls_around_markers and create plots for the frequency of each QTL type and trait
qtl_enrich: Takes the output from find_genes_qtls_around_markers and perform a QTL enrichment analysis
QTLenrich_plot: Takes the output from _find_genes_qtls_around_markers function and creates a heatmap with the overlapping between groups
relationship_plot: Takes the output from find_genes_qtls_around_markers function and creates a chord plot with the relationship between groups
In a conventional routine analysis, both .gff and .gtf files can be imported using the import_gff_gtf() function from GALLO package.
Arguments from import_gff_gtf
import_gff_gtf(db_file, file_type)
db_file: file with the gene mapping or QTL information. For the gene mapping, you should use the .gtf file download from Ensembl data base. For the QTL search, you need to inform the .gff file that can be downloaded from Animal QTlLdb. Both files must use the same reference annotation used in the original study.
marker_file: gff (for QTL annotation) or gtf (for gene annotation).
#An example of how to import a QTL annotation file
#qtl.inp <- import_gff_gtf(db_file="QTL_db.gff",file_type="gff")
#An example of how to import a gene annotation file
#qtf.inp <- import_gff_gtf(db_file="Gene_db.gtf",file_type="gtf")
The main function of GALLO, find_genes_qtls_around_markers(), is responsible to perform the annotation of genes and/or co-localized QTLs within or nearby candidate markers or genomic regions (using a user’s defined interval/window). This function uses the information provided in the .gtf file (for gene annotation) or .gff (for QTL annotation) to retrieve the requested information. The gtf files can be downloaded from the Ensembl database and the gff file from the Animal QTLdb.
find_genes_qtls_around_markers(db_file, marker_file, method = c(“gene”,“qtl”), marker = c(“snp”, “haplotype”), interval = 0, nThreads = NULL)
db_file: The dataframe created using the _import_gff_gtf function.
marker_file: The file with the SNP or haplotype positions. Detail: For SNP files, you must have a column called “CHR” and a column called “BP” with the chromosome and base pair position, respectively. For the haplotype, you must have three columns: “CHR”, “BP1” and “BP2”. All the columns names are in uppercase.
method: “gene” or “qtl”. If “gene” method is selected, a .gtf files must be provided for the db_file argument. On the other hand, if the method “qtl” is selected, a .gff file from Animal QTLdb must be provided for the db_file argument.
marker: “snp” or “haplotype”. If “snp” option is selected, a dataframe with at least two mandatory columns (CHR and BP) must be provided for the marker_file argument. On the other hand, if “haplotype” option is selected, a dataframe with at least three mandatory columns (CHR, BP1 and BP2) must be provided for the marker_file argument. Any additional column can be included in the dataframe provided for the marker_file argument, for example, a column informing the study, model, breed, etc. from which the results were obtained
interval: The interval in base pair which can be included upstream and downstream from the markers or haplotype coordinates
nThreads: Number of threads to be used in the analysis
For example, let`s run a gene and QTL annotation using the QTLwindows dataset without additional intervals (upstream and downstream, using the interval=0 argument) from the windows coordinates:
#Running gene annotation
<-find_genes_qtls_around_markers(db_file=gtfGenes,
out.genesmarker_file=QTLmarkers, method = "gene",
marker = "snp", interval = 500000, nThreads = 1)
You are using the method: gene with snp
## Warning: executing %dopar% sequentially: no parallel backend registered
#Checking the first rows from the output file
::datatable(out.genes, rownames = FALSE,
DTextensions = 'FixedColumns',
options = list(scrollX = TRUE))
#Checking the dimensions of the output file
dim(out.genes)
[1] 652 15
The gene annotation resulted in 652 genes within the 1 Mb interval (500 Kb upstream and 500 Kb downstream) from the candidate markers.
#Running QTL annotation
<-find_genes_qtls_around_markers(db_file=gffQTLs,
out.qtlsmarker_file=QTLmarkers, method = "qtl",
marker = "snp", interval = 500000, nThreads = 1)
You are using the method: qtl with snp Starting QTL searching using 5e+05 bp as interval Preparing output file for QTL annotation
#Checking the first rows from the output file
::datatable(out.qtls, rownames = FALSE,
DTextensions = 'FixedColumns',
options = list(scrollX = TRUE))