# genotypeR vignette

#### 2018-05-21

This vignette describes the parts of/and how to use the genotypeR package for analysing genotyping data. This package provides an integrated interface from designing genotyping markers to final analysis. I will describe the general work flow related to using the package.

## Installing genotypeR

At present, the development version of the package can be installed from https://github.com/StevisonLab/genotypeR, and the stable version can be installed from (Bioconductor) or (CRAN) (depending on where it is released).

### install genotypeR

library(devtools)
devtools::install_git("https://github.com/StevisonLab/genotypeR")

library(genotypeR)
##
## Attaching package: 'genotypeR'
## The following object is masked from 'package:methods':
##
##     show

## Marker design

genotypeR provides SequenomMarkers to run a perl/vcftools pipeline to design genotyping markers. As a result, vcftools, perl, and a *nix environment are needed for marker design. There are 3 ways to develop markers with genotypeR. The example data is located in the SequenomMarkers directory in the package install directory, and should be useful for exploring the use of the package. Importantly, there are 2 different marker platforms (i.e., argument type) 100 bp flanking reference bases (platform=“sq”), and 50 bp flanking reference bases (platform=“gg”).

### R console

SequenomMarkers can be run from the R console.

## not run, but does work
## uncomment and run
## example_files <- system.file("SequenomMarkers_v2/two_sample/test_files", package = "genotypeR")

## vcf1 <- paste(example_files, "Sample1.vcf", sep="/")
## vcf2 <- paste(example_files, "Sample2.vcf", sep="/")

## look in outdir to look at the results in Master_SNPs.sorted.txt.
## outdir <- paste(example_files, "test_dir", sep="/")

## SequenomMarkers(vcf1, vcf2, outdir, platform="sq")

### R script

SequenomMarkers can be run using the Rscript interface. This interface is useful when designing marker on a high performace computing platform.

## not run, but does work
## uncomment and run
## #!/usr/bin/env Rscript

## library(genotypeR)

## example_files <- system.file("SequenomMarkers_v2/two_sample/test_files", package = "genotypeR")

## vcf1 <- paste(example_files, "Sample1.vcf", sep="/")
## vcf2 <- paste(example_files, "Sample2.vcf", sep="/")
## outdir <- paste(example_files, "test_dir", sep="/")

## SequenomMarkers(vcf1, vcf2, outdir, platform="sq")

### Shell/Perl/vcftools pipline

The original code for marker design software can be run from the commandline. This code is contained in the SequenomMarkers folder in the genotypeR installation directory (R/SequenomMarkers/R/R_Pipeline_Rapper.sh), or directly from github (https://github.com/StevisonLab/genotypeR/tree/master/inst/SequenomMarkers).

## Post marker design/Pre-genotyping

After markers are designed, and before genotyping is conducted on the sequenom, marker names need to be generated from SequenomMarkers. The marker output from SequenomMarkers needs to be read into R with read_in_Master_SNPs_data and converted to a data frame with make_marker_names with marker names suitable for genotypeR. We have provied example data in the extdata directory of the package installation directory. 2_sequenom_plate_2_chip_8_dpseudo_plasticity_1_10_2017.csv is a raw data file that is the same data as the genotypes_data provided with the package and filtered_markers.txt is the same as the data provided by markers. For simplicity I will use the data provided with the package, but the raw data is provided for the user’s reference.

dir(system.file("extdata", package = "genotypeR"))
## [1] "filtered_markers.txt"   "sequenom_test_data.csv"
## read in marker data
data(markers)
## make marker names
marker_names <- make_marker_names(markers)

These can be written as a csv for use in the Sequenom as a csv file.

write.csv(marker_names, "/your/dir/file.csv")

## Genotyping

This will be conducted on the platform the markers were designed for. Our example will revolve around the Sequenom platform, but we have implemented these the work flow for other genotyping platforms as well.

## Post-genotyping QA/QC

The QA/QC of genotyping data is an important step in any genotyping experiment. First, genotyping data will need to be read into R with the appropriate function read_in_sequenom_data. These data are from the GENOTYPES tab of the excel sheet that is produced from the Sequenom machine, and will be used once converted to a csv. A genotype table is made from make_marker_names output with Ref_Alt_Table. Generally, you will want to pre-filter the sequenom markers that were used in the genotyping assay in order to be able to use these in further analysis.

  data(genotypes_data)

## genotype table
GT_table <- Ref_Alt_Table(marker_names)

## remove those markers that did not work
genotypes_data_filtered <- genotypes_data[,c(1, 2, grep("TRUE", colnames(genotypes_data)%in%GT_table\$marker_names))]

We designed this package from a backcross experiment and, as a result, initialize_genotypeR_data is the name of this function. This is where a genotypeR object is made, and processing data with this function is required for all other downstream analyses with genotypeR. Backcross warnings, those having to do with an impossible genotype, are produced by this function with the output = “warnings”.

## warnings for a backcross design
##Note warning_allele is Ref allele
warnings_out <- initialize_genotypeR_data(seq_data = genotypes_data_filtered, genotype_table = GT_table, output="warnings")

Once, the warnings have been inspected for errors, and then all of these can then be changed to NAs if we are sure that they are incorrect (i.e., bad spectra, etc.).

## warnings for a backcross design
##Note warning_allele is Ref allele
warnings_out <- initialize_genotypeR_data(seq_data = genotypes_data_filtered, genotype_table = GT_table, output="warnings2NA")

If the user is satisfied that all the data is correct, or the data was generated by another process (different cross, SNP genotyping in wild populations, etc.) the user can use the “pass_through” option for the output argument of initialize_genotypeR_data.

## warnings for a backcross design
##Note warning_allele is Ref allele
warnings_out <- initialize_genotypeR_data(seq_data = genotypes_data_filtered, genotype_table = GT_table, output="pass_through")

Let’s return to our example. We can use the “warnings2NA” argument to output because we are confident that we have correct genotypes.

##We are confident that the warngins should be turned into no Calls
warnings_out2NA <- initialize_genotypeR_data(seq_data = genotypes_data_filtered, genotype_table = GT_table, output = "warnings2NA")

We can now binary code this table as 0 (Homozygous) and 1 (Heterozygous) with binary_coding.

binary_coding_genotypes <- binary_coding(warnings_out2NA, genotype_table = GT_table)

This would be the typical use for a backcross design. zero_one_two_coding if data is used where 3 states would be able to be assigned.

binary_coding_genotypes <- zero_one_two_coding(warnings_out2NA, genotype_table = GT_table)

After binary coding, we can count crossovers with count_CO, but first the user should subset based on chromosome.

chr2 <- subsetChromosome(binary_coding_genotypes, chromosome="chr2")

crossover_count <- count_CO(chr2)

## Other import/export functions

### export

Our backcross example can be utilized to show how our data can be output to the rqtl package.

##writes out the data
convert2qtl_table(chr2, chromosome_vect=rep(2, length(binary_genotypes(chr2))-2))

##reads in the data to rqtl
library(qtl)
rqtl_object <- read.cross("csv", ".", "temp_cross_for_qtl.csv", na.strings="NA")
##  --Read the following data:
##   96  individuals
##   25  markers
##   1  phenotypes
##  --Estimating genetic map
##  --Cross type: bc

This allows genotyping data to undergo QA/QC, and then be utilized for other things like qualitative trait mapping among other things. The convert2qtl_table will also work with data that has been 0/1/2 coded with zero_one_two_coding.

### import

Another popular genotyping platform, Illumina’s GoldenGate, is also supported by the genotypeR package. A/B coded data can be read in with read_in_illumina_GoldenGate, and a marker table can be created with illumina_Genoype_Table. This data can then be utilized in further genotypeR analysis.

test_data <- read_in_illumina_GoldenGate(tab_delimited_file="path_to_goldengate_file", flanking_region_length=50, chromosome=rep("chr2", length.out=length(552960)))

illumina_table <- illumina_Genoype_Table(tab_delimited_file="path_to_goldengate_file", flanking_region_length=50, chromosome=rep("chr2", length.out=length(552960)))

illumina_cross <- initialize_genotypeR_data(seq_data = test_data, genotype_table = illumina_table, output="warnings")

## accessor functions

genotypeR is written in the S4 Object Oriented style. As a result we have provided methods for both accessing slots, and to access slots

##accessor functions
impossible_genotype
genotypes
impossible_genotype
binary_genotypes
counted_crossovers

##replacement methods for:
genotypes
impossible_genotype
binary_genotypes
counted_crossovers

## Internal functions

genotypeR also provides functions that have not been described yet. CO will count crossovers per individual, and can be used for further QA/QC, or can be used for specific research questions.

library(doBy)
to_count_CO <- binary_genotypes(chr2)
counted_per_individuals <- lapply(splitBy(~SAMPLE_NAME+WELL, data=to_count_CO), CO)

grep_df_subset is used to subset a dataframe based on column names. sort_sequenom sorts a dataframe based on marker columns produced by genotypeR (i.e., with make_marker_names).

## Conclusion

genotypeR should be a useful package to facilitate a genotyping work flow for both natual and experimental genotyping results.