SSHAARP: Searching Shared HLA Amino Acid Residue Prevalence

Livia Tran - [email protected]

September 13, 2021

Overview

The Searching Shared HLA Amino Acid Residue Prevalence (SSHAARP) Package processes amino acid alignments produced by the IPD-IMGT/HLA Database to identify user-defined amino acid residue motifs shared across HLA alleles, and calculates the frequencies of those motifs based on HLA allele frequency data. SSHAARP’s functions interact with protein alignments from the ANHIG/IMGTHLA Github repository (https://github.com/ANHIG/IMGTHLA/tree/Latest/alignments) and the Solberg dataset, both of which are included in the package, and the Generic Mapping Tools (gmt) R package. Details about the Solberg dataset can be found at doi: 10.1016/j.humimm.2008.05.001. The bundled solberg_dataset object is the 1-locus-alleles.dat file in the results.zip archive at http://pypop.org/popdata/. The gmt R package requires operating system installations of GMT software (version 5 or 6; available at: https://www.soest.hawaii.edu/gmt/) and Ghostscript (version 9.6 or higher) to generate maps. Earlier versions of Ghostscript may result in the generation of incorrect maps. To determine if maps are being generated correctly, compare maps for the DRB1*26F~28E~30Y motif to the examples in this vignette.

Functions

BLAASD()

BLAASD() extracts alignment sequence information for a given locus from the ANHIG/IMGTHLA GitHub repository to produce a data frame of individual amino acid data for each amino acid position for all alleles. The first 4 columns are locus, allele, trimmed allele, and allele_name, as shown below.

Most DQ-Beta proteins (e.g., DQB1*05:01) lack the eight amino-acids encoded by exon 5 due to an intron 4 splice-acceptor mutation. Some DQB1 alleles (e.g., DQB1*05:03 and *06:01) do not have this intron mutation, and include these eight amino acids. However, because DQB1*05:01 is the DQB1 reference allele, these amino acid positions are represented as insertion positions in the protein alignment generated by the IPD-IMGT/HLA Database. It is not clear how these exon 5-encoded amino acid positions are numbered in the IPD-IMGT/HLA Database; BLAASD() assigns amino-acid position numbers 227-234 to these eight amino-acid indel positions, and the C-terminal DQB1 amino-acid position in the DQB1 protein alignment generated by BLAASD() is position 237.

The output for HLA-DRB3/4/5 will contain DRB1 information for the first row since the reference sequence derived for DRB3/4/5 is from DRB1 in the DRB alignment.

Example output of BLAASD() for HLA-A, where the selected row outputs are 2 and 8 to illustrate different HLA-A alleles, and selected columns are the first 19 columns of the dataframe, where the first 4 columns consist of the locus, allele, trimmed_allele of 2 fields, and the full allele name. The rest of the columns consist of corresponding amino acids for each amino acid position for a given allele:

>BLAASD("A")[[1]][c(2,8),1:19]

  locus       allele trimmed_allele    allele_name -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10
2     A 01:01:01:02N        A*01:01 A*01:01:01:02N   M   A   V   M   A   P   R   T   L   L   L   L   L   S   G
8     A  01:01:01:08        A*01:01  A*01:01:01:08   M   A   V   M   A   P   R   T   L   L   L   L   L   S   G

findMotif()

findMotif() parses the provided amino-acid motif, which should be written as locus*##$~##$~##$, where ## identifies a peptide position, and $ identifies an amino acid residue. Motifs can include any number of amino acids positions. The following output is an example of the first three DRB1 alleles that have the DRB1*26F~28E~30Y motif.

>findMotif("DRB1*26F~28E~30Y")[1:3,1:65]

 locus   allele trimmed_allele   allele_name -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17
204  DRB1 03:02:01     DRB1*03:02 DRB1*03:02:01   M   V   C   L   R   L   P   G   G   S   C   M   A
205  DRB1 03:02:02     DRB1*03:02 DRB1*03:02:02   *   *   *   *   *   *   *   *   *   *   *   *   *
206  DRB1 03:02:03     DRB1*03:02 DRB1*03:02:03   *   *   *   *   *   *   *   *   *   *   *   *   *
    -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
204   V   L   T   V   T   L   M  V  L  S  S  P  L  A  L  A G D T R P R F L E  Y  S  T  S  E  C  H  F
205   *   *   *   *   *   *   *  *  *  *  *  *  *  *  *  * G D T R P R F L E  Y  S  T  S  E  C  H  F
206   *   *   *   *   *   *   *  *  *  *  *  *  *  *  *  * * * * * * R F L E  Y  S  T  S  E  C  H  F
    18 19 20 21 22 23 24 25 26 INDEL-1 27 28 29 30 31
204  F  N  G  T  E  R  V  R  F       .  L  E  R  Y  F
205  F  N  G  T  E  R  V  R  F       .  L  E  R  Y  F
206  F  N  G  T  E  R  V  R  F       .  L  E  R  Y  F

If a motif is not found, a warning message is output:

>findMotif("DRB1*26F~28E~30Z")

Warning message:
In findMotif("DRB1*26F~28E~30Z") :
  DRB1*26F~28E~30Z: No alleles possess this motif

If a motif has formatting errors, a warning message is output:

>findMotif("DRB1**27F")

Warning message:
In findMotif("DRB1**27F") :
  Your motif is formatted incorrectly. Please use the Locus*##$~##$~##$ format, where ## identifies a peptide position, and $ identifies an amino acid residue.

If a motif contains an amino acid position not present in the 3.39.0 ANHIG/IMGTHLA alignments, a warning message is output:

> findMotif("DRB1*27999F")

Warning message:
In findMotif("DRB1*27999F") :
  One or more of your amino acid positions is not present in the alignment. Please make sure amino acid positions of interest are present in the current release of ANHIG/IMGTHLA alignments.

PALM()

PALM() generates a frequency heatmap for an HLA amino-acid motif, based on the allele frequency data in the Solberg dataset, using the gmt R package and the Generic Mapping Tools (GMT) map-making software. GMT software is required for this function and can be downloaded at https://www.soest.hawaii.edu/gmt/.

JPEG heatmap files named “‘motif’.jpg”, where ‘motif’ is the value of the motif parameter, are written to a directory identified by the direct parameter, with the working directory set as the default. The file names of maps generated on Microsoft Windows systems have a different delimiter between the locus name and the amino acid positions; instead of an asterisk (*), a dash (-) is used to accomodate for Windows naming conventions (e.g. DRB1-26F~28E~30Y).

The filename parameter defaults to the solberg_dataset list object. The color and filter_migrant parameters default to TRUE. When color = TRUE, PALM generates a color heatmap; when color = FALSE, a greyscale heatmap is generated. When filter_migrant = TRUE, populations from the OTH region and populations with complexity values suffixed with ‘mig’ are excluded from the heatmap plot. When filter_migrant = FALSE, these populations are included in the heatmap plot.

The Ghostscript software suite (>= version 9.26) is required to generate the heatmap file. Versions of Ghostscript < 9.26 introduce a layering error during heatmap generation. To determine if maps are being generated correctly, compare heatmaps of the DRB1*26F~28E~30Y motif (with filter_migrant = TRUE) against the examples in this vignette.

Note: While the map legend identifies the highest frequency value, values in this range may not be represented on the map due to frequency averaging over neighboring populations.

#color heatmap plot
>PALM("DRB1*26F~28E~30Y")
...

#greyscale heatmap plot
>PALM("DRB1*26F~28E~30Y", color=FALSE)