# 1. Introduction

In the precision medicine era, the phenotype-gene-variant database is very important for a special Mendelian disorder or phenotypes. The information of phenotype-gene-variant relationships is continually increasing in the public databases and the literatures. Thus, recurrent updating of the phenotype-gene-variant database is essential. However, it is time-consuming and error-prone to capture the information from the databases one by one and the literatures directly. Thus, how to capture this information rapidly and accurately to translate into practice is a bottleneck to be solved.

Fortunately, some databases focusing on the relationships among human variants/genes and phenotypes are public and freely accessible. These include Human Phenotype Ontology (HPO), Orphanet, Online Mendelian Inheritance in Man (OMIM), ClinVar, and Universal Protein Resource (UniProt) etc. The public databases collect not only the phenotypes including names and aliases of diseases and clinical features, such as HPO, but also names and aliases of the genes and variants. The databases may be compiled from the literature and other databases automatically, even manually, or submitted by the researchers directly and updated regularly. Additionally, PubMed provides the primary and latest source of the information. However, manually parsing and searching for information from these databases and PubMed is time-consuming and error-prone.

VarfromPDB is an easy-to-use R package for capturing the genes and variants from the public databases and PubMed abstracts automatically. It can be very valuable for R programmers or anyone who is interested in disease-related genes/variants in precision medicine based on the target sequencing strategy using automated scripting.

# 2 Getting started

The VarfromPDB package captures the genes and variants related to a Mendelian disorder from the public databases and PubMed abstracts.

## 2.1 Localize the public databases

localPDB() performs the localization of the necessary files in several databases, including HPO, Orphanet, OMIM, ClinVar, Uniprot and UCSC. All the files can be freely accessed except for those from OMIM. An OMIM account and an API key should be applied from OMIM website in advance. Each database can be specified flexibly, which can be selected depending on the database update frequency.

## 2.2 Getting the alias of a genetic disease

VarfromPDB gets the aliases of a genetic disease from HPO database for a given keyword, which can be a disease name or a clinical feature. The aliases of a genetic disease can be used in the capturing process from other databases to make sure that this information is complete.

## 2.5 Capture the genes and variants from PubMed

The reported genes and variants related to a disease can be extracted from the PubMed abstracts based on text mining methods. The genes were extracted based on a dictionary-based method. To identify mutations, mutation nomenclature recommendations at the DNA level and protein-level followed by HGVS are searched for by regular expression and the names of amino acids. Each gene-variant relationship pairs need to be resolved one by one based on the bipartite graph theory and sentence-level concurrence.

## 2.6 Compile the genes and variants

The union of the gene sets from different databases by the approved gene symbols are integrated, and then the related phenotypes from different databases and physical positions of the genes are annotated. A VarfromPDB score was assigned based on the evidences from the curated databases and literatures.

# 3 Start up

## 3.1 set the environment and parameters

Assuming that you have installed VarfromPDB package, and other dependent packages including RISmed and stringi. You first need to load them

library(VarfromPDB)
library(RISmed)
library(stringi)

The only one parameter, keywords, is necessary. You can specify the working path or get from the current working directory. Then the output directory will be created automatically.

keywords = "Joubert syndrome"
work.path = getwd()
setwd(work.path)
prefix = ifelse(length(grep("\\|",keywords)) >0,str_trim(unlist(strsplit(keywords,"\\|")))[1],keywords)
prefix = ifelse(length(grep(" ",prefix)) >0,paste(unlist(strsplit(prefix," ")),collapse="."),prefix)
out.path = paste(work.path,prefix,sep="/")
dir.create(out.path)

## 3.2 Creating a local database

We strongly recommend that you download the files from HPO, HGNC, ClinVar, OMIM, Ophanet and UniProt before you start a job for the first time, which maybe more efficient. All the databases can be free accessed except the OMIM, so you should apply for the FTP URL and API key from OMIM before your first job. Suppose you already have the OMIM FTP URL and API key, omim.url and omim.api. However, the OMIM database is optional for the process. Suppose you have no available OMIM FTP URL and API key, so the OMIM database will be ignored.You can just type

localPDB()

In general, it needs several minutes to localize the database files. However, the run time of localPDB step mainly depends on internet speed. If the network is not very satisfying, the step will interrupt. Then you should have a check of the internet and try again.

## 3.3.3 Extract genes and variants from OMIM

Suppose you have no available OMIM API key, omim.api, so we skip this step here.

## 3.3.4 Extract the genes and variants from ClinVar

The function extract_clinvar extracts the genes and variants from the ClinVar database.

clinvar.phenotype = extract_clinvar(keyword= keywords, HPO.disease = phenoID.hpo.omim, genelist = genes.merge)
genes.clinvar = clinvar.phenotype [[1]]
variants.clinvar = clinvar.phenotype [[2]]
write.table(genes.clinvar,paste(out.path,"/","Clinvar_genes_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)
write.table(variants.clinvar,paste(out.path,"/","Clinvar_variants_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)

### 3.3.5 Extract the genes and variants from UniProt

The function extract_UniProt extracts the genes and variants from the UniProt database, which focus on the amino acid-altering variants which are manually curated human polymorphisms and disease mutations from UniProtKB/Swiss-Prot. The variants in the genes from HPO and Orphanet or other information of interest can also be added in the function too.

UniProt.phenotype = extract_uniprot(keyword= keywords,HPO.disease = phenoID.hpo.omim, genelist = genes.merge)
genes.UniProt = UniProt.phenotype [[1]]
variants.UniProt = UniProt.phenotype [[2]]
write.table(genes.UniProt,paste(out.path,"/","Uniprot_genes_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)
write.table(variants.UniProt,paste(out.path,"/","Uniprot_variants_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)

### 3.3.6 Compile the gene and variant database

Then we integrate the genes captured from multiple databases. The function genes.compile compiles the gene sets from different databases into a union set according to the approved gene symbols, and then the related phenotypes from different databases and physical positions of the genes are annotated.

genesPDB = genes_compile(HPO = HPO.phenotype, orphanet = orphanet.phenotype, clinvar = genes.clinvar, uniprot = genes.UniProt)
write.table(genesPDB,paste(out.path,"/","PublicDB_genes_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)
variantsPDB <- variants_compile(clinvar = variants.clinvar, uniprot = variants.UniProt)
write.table(variantsPDB,paste(out.path,"/","PublicDB_variants_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)

### 3.3.7 Capture the genes and variants from PubMed abstracts

The function extract_PubMed performs an equery in PubMed E-utilities using the search strategy similar to that of the web, and then captures the information from disease-related abstracts based on text mining. The information of phenotypes, genes, variants, article titles, first authors, PubMed IDs, publication years, and publication journals will be captured. In the text mining process, the phenotype information and the genes were extracted based on a dictionary-based method. To identify mutations, mutation nomenclature recommendations at the DNA level and protein-level followed by HGVS are searched for by regular expression and the names of amino acids. When there are multiple genes and variants reported in one article, each gene-variant relationship pairs need to be resolved one by one based on the bipartite graph theory and sentence-level concurrence.

pubmed.phenotype <- extract_pubmed(query = "(Joubert syndrome[Title\\/Abstract]) AND (gene[Title\\/Abstract] OR genes[Title\\/Abstract] OR mutation[Title\\/Abstract] OR mutations[Title\\/Abstract] OR polymorphisms[Title\\/Abstract] OR genotype[Title\\/Abstract] OR SNP[Title\\/Abstract] OR SNPs[Title\\/Abstract] OR associated[Title\\/Abstract] OR translocation[Title\\/Abstract])", keyword=keywords);
genes.pubmed <- pubmed.phenotype[[1]]
write.table(genes.pubmed,paste(out.path,"/","PubMed_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)
write.table(pubmed.phenotype[[2]],paste(out.path,"/","PubMed_abstract_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F,col.names=F)

genes.pubmed.sel <- genes.pubmed[genes.pubmed[,"cdna.change.HGVS"] != ""|genes.pubmed[,"p.change.HGVS"] != "",]
write.table(geneAll,paste(out.path,"/","Genes_all_",prefix,".txt",sep=""),sep="\t",row.names=F,quote=F)