# Overview

gemma2 is an R implementation of an EM algorithm (Zhou and Stephens 2014) to fit a multivariate linear mixed effects model. Specifically, it fits, via restricted maximum likelihood (REML), the model:

$vec(Y) = X vec(B) + vec(G) + vec(E)$

where $$Y$$ is a n by 2 phenotypes matrix, with each row being a single subject and each column a distinct phenotype. $$X$$ is a 2n by 2f matrix of genotypes, where $$n$$ is the sample size and $$f$$ is the number of founder lines. $$X$$ has a block-diagonal structure, so that it contains, on the diagonal, two $$n$$ by $$f$$ blocks of genotype data. The two off-diagonal blocks are all zeros. $$B$$ is a $$f$$ by $$2$$ matrix of founder allele effect sizes. $$G$$ is a $$n$$ by $$2$$ matrix of polygenic random effects that follows the distribution:

$G \sim MN_{n x 2}(0, K, V_g)$

where $$MN$$ denotes the matrix-variate normal distribution with mean being the n by 2 zero matrix, $$K$$ - a kinship matrix - being the n by n covariance among rows, and $$V_g$$ being the 2 by 2 covariance matrix among columns.

$$E$$ is a n by 2 matrix of random errors that follows the distribution:

$E \sim MN_{n x 2}(0, I_n, V_e)$ where $$I_n$$ is the n by n identity matrix.

Additionally, we have the usual assumption that $$E$$ and $$G$$ are independent.

By applying the vectorization operator, “vec”, we see that:

$vec(Y) \sim N(Xvec(B), V_g \otimes K + V_e \otimes I_n)$ where $$\otimes$$ denotes the Kronecker product.

## Writing gemma2

I wrote gemma2 by translating, line by line, the C++ code for version 0.96 of GEMMA. Specifically, I drew heavily on the file mvlmm.cpp when writing the update functions for the EM algorithm.

## Algorithmic differences from Zhou’s GEMMA

When fitting the above multivariate linear mixed effects model, Zhou’s GEMMA uses first an EM algorithm to get reasonable starting values for input to a Newton-Raphson algorithm. The model fit is taken as the output of the Newton-Raphson algorithm. They explain that the reason for doing this is that the Newton-Raphson algorithm - with reasonable starting values - converges much faster than does the EM algorithm. I elected to not implement the Newton-Raphson algorithm. Instead, I use exclusively the EM algorithm output. For that reason, I can’t merely look at the output of Zhou’s GEMMA and compare it directly to my output from gemma2. Zhou’s GEMMA doesn’t print the (intermediate) output of its EM algorithm.

## Comparing Zhou’s GEMMA (v0.96) EM output with that of gemma2

To compare the output of Zhou’s GEMMA with that of gemma2, I decided to insert print statements in the C++ code of GEMMA. That way, when I run GEMMA, I get verbose output of intermediate results. That is, I can see exactly how each component is updated in the EM algorithm because I print the numerical values at each iteration of the GEMMA EM algorithm.

I can then ask the question:

for a set of starting values, does the first iteration of EM updates by GEMMA give the same values as the first iteration of EM updates by gemma2?

There is an additional complication. GEMMA is set up so that you can’t merely specify a chosen set of starting values for EM algorithm. Rather, it takes a data set as input. GEMMA then fit univariate LMMs to get the starting values for the variance components, $$V_g$$ and $$V_e$$. Note that the off-diagonal elements of $$V_g$$ and $$V_e$$ start at zero for GEMMA’s methods. In other words, GEMMA takes the inputted data and fits a univariate LMM for each phenotype. It then uses those fitted values for input to the EM algorithm for a multivariate LMM, before, ultimately, using a NR algorithm to get the final mvlmm fit.

Because of how I’ve added the print statements to GEMMA’s C++ code, I’ll need to look at the printed output from GEMMA to ensure that I use the same starting values for gemma2. In other words, I can’t pre-specify starting values and use them in both GEMMA and gemma2. Instead, I need to specify a data set, then look to see where GEMMA starts its EM algorithm for mvLMM, and use input those same starting values into gemma2.

## Running gemma2 with example data

First, we read the data into R.

readr::read_csv(system.file("extdata", "mouse100.geno.txt", package = "gemma2"), col_names = FALSE) -> geno
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   X1 = col_character(),
#>   X2 = col_character(),
#>   X3 = col_character()
#> )
#> See spec(...) for full column specifications.
#> Parsed with column specification:
#> cols(
#>   X1 = col_double(),
#>   X2 = col_double(),
#>   X3 = col_double(),
#>   X4 = col_double(),
#>   X5 = col_double(),
#>   X6 = col_double()
#> )
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   X101 = col_logical()
#> )
#> See spec(...) for full column specifications.

We then isolate the genotypes for a single marker and convert it to a matrix.

# isolate first SNP
g1 <- geno[1, - c(1:3)] # first 3 columns are SNP annotations!
t(as.matrix(g1)) -> g1m
as.matrix(pheno[, c(1, 6)]) -> phe16

We load the gemma2 R package before we decompose the kinship matrix into eigenvalues and eigenvectors.

library(gemma2)
e_out <- eigen2(kinship)

We then multiply the genotypes matrix by the eigenvectors:

t(g1m) %*% e_out$vectors -> X1U We then call the function - MphEM - that uses the EM algorithm: MphEM(eval = e_out$values,
X = X1U,
Y = t(phe16) %*% e_out$vectors, V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2), V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2) ) -> foo The output of MphEM has a complicated structure. It’s a list of lists. length(foo) #> [1] 1 class(foo) #> [1] "list" There is one list for every iteration of the EM algorithm. For each iteration of the EM algorithm, the list contains 17 components. sapply(FUN = length, X = foo) #> [1] 17 We can examine the code for MphEM to understand the outputs: MphEM #> function (max_iter = 10000, max_prec = 1/1e+06, eval, X, Y, V_g, #> V_e, verbose_output = FALSE) #> { #> n_size <- length(eval) #> c_size <- nrow(X) #> d_size <- nrow(Y) #> dc_size <- c_size * d_size #> XXt <- X %*% t(X) #> logl_const <- -(n_size - c_size) * d_size * log(2 * pi)/2 + #> d_size * log(det(XXt))/2 #> out <- list() #> for (t in 1:max_iter) { #> ep_out <- eigen_proc(V_g, V_e) #> logdet_Ve <- ep_out[[1]] #> UltVeh <- ep_out[[2]] #> UltVehi <- ep_out[[3]] #> D_l <- ep_out[[4]] #> cq_out <- calc_qi(eval, D_l, X) #> Qi <- cq_out[[1]] #> lndetQ <- cq_out[[2]] #> UltVehiY <- UltVehi %*% Y #> xHiy <- calc_XHiY(eval, D_l, X, UltVehiY) #> logl_new <- logl_const + MphCalcLogL(eval = eval, xHiy = xHiy, #> D_l = D_l, UltVehiY = UltVehiY, Qi = Qi) - 0.5 * #> n_size * logdet_Ve #> logl_new <- logl_new - 0.5 * (lndetQ - c_size * logdet_Ve) #> if (t > 1) { #> if (logl_new - logl_old < max_prec) { #> break #> } #> } #> logl_old <- logl_new #> co_out <- calc_omega(eval, D_l) #> OmegaU <- co_out[[1]] #> OmegaE <- co_out[[2]] #> UltVehiB <- UpdateRL_B(xHiy, Qi, d_size = nrow(Y)) #> UltVehiBX <- UltVehiB %*% X #> UltVehiU <- update_u(OmegaE, UltVehiY, UltVehiBX) #> UltVehiE <- update_e(UltVehiY, UltVehiBX, UltVehiU) #> U_hat <- t(UltVeh) %*% UltVehiU #> E_hat <- t(UltVeh) %*% UltVehiE #> B <- t(UltVeh) %*% UltVehiB #> cs_out <- calc_sigma(eval = eval, D_l = D_l, X = X, OmegaU = OmegaU, #> OmegaE = OmegaE, UltVeh = UltVeh, Qi = Qi) #> Sigma_ee <- cs_out[[1]] #> Sigma_uu <- cs_out[[2]] #> uv_out <- update_v(eval, U_hat, E_hat, Sigma_uu, Sigma_ee) #> V_g <- uv_out[[1]] #> V_e <- uv_out[[2]] #> if (verbose_output) { #> out[[t]] <- list(logl_new = logl_new, Vg = V_g, Ve = V_e, #> Sigma_uu = Sigma_uu, Sigma_ee = Sigma_ee, B = B, #> U_hat = U_hat, E_hat = E_hat, OmegaU = OmegaU, #> OmegaE = OmegaE, logdet_Ve = logdet_Ve, UltVeh = UltVeh, #> UltVehi = UltVehi, Dl = D_l, xHiy = xHiy, logl_const = logl_const, #> UltVehiU = UltVehiU) #> } #> else { #> out[[1]] <- list(logl_new = logl_new, Vg = V_g, Ve = V_e, #> Sigma_uu = Sigma_uu, Sigma_ee = Sigma_ee, B = B, #> U_hat = U_hat, E_hat = E_hat, OmegaU = OmegaU, #> OmegaE = OmegaE, logdet_Ve = logdet_Ve, UltVeh = UltVeh, #> UltVehi = UltVehi, Dl = D_l, xHiy = xHiy, logl_const = logl_const, #> UltVehiU = UltVehiU) #> } #> } #> if (length(out) == max_iter) { #> warning("EM algorithm didn't converge.") #> } #> return(out) #> } #> <bytecode: 0x55e7dbfb9fe0> #> <environment: namespace:gemma2> We see that the first component is the log-likelihood. We can verify that the log-likelihood uniformly increases (over EM iterations) with the following code: sapply(FUN = function(x)x[[1]], X = foo) -> loglik plot(loglik) ## Direct comparison of GEMMA results with those of gemma2 We can run GEMMA (v0.96) with a collection of data in text files that I distribute with this package. They are a subset of the data that come with GEMMA. According to GEMMA v0.96, we should first be updating Vg from its initial value: $\left(\begin{array}{cc} 1.91352 & 0\\ 0 & 0.530827 \end{array}\right)$ to $\left(\begin{array}{cc} 1.91352 & 0.0700492\\ 0.0700492 & 0.530827 \end{array}\right)$ in the first EM iteration. We then ask, when we give gemma2::MphEM() the same starting values and same data, do we see that the first update of $$V_g$$ coincides with that from GEMMA (above)? e_out <- eigen2(as.matrix(kinship)) center_kinship(as.matrix(kinship)) -> kinship_centered ec_out <- eigen2(kinship_centered) ## first two MphEM(eval = ec_out$values,
X = t(g1m) %*% ec_out$vectors, Y = t(phe16) %*% ec_out$vectors,
V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2),
V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2)
) -> bar00
MphEM(eval = e_out$values, X = t(g1m) %*% e_out$vectors,
Y = t(phe16) %*% e_out$vectors, V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2), V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2) ) -> bar10 ## second two MphEM(eval = ec_out$values,
X = t(cbind(1, g1m)) %*% ec_out$vectors, Y = t(phe16) %*% ec_out$vectors,
V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2),
V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2)
) -> bar01

MphEM(eval = e_out$values, X = t(cbind(1, g1m)) %*% e_out$vectors,
Y = t(phe16) %*% e_out$vectors, V_g = matrix(c(1.91352, 0, 0, 0.530827), nrow = 2), V_e = matrix(c(0.320028, 0, 0, 0.561589), nrow = 2) ) -> bar11 We see that the centering of the kinship matrix doesn’t change the results here. Look also at the convergence points: bar00[[length(bar00)]][[2]] #> [,1] [,2] #> [1,] 2.8112632 -0.4202874 #> [2,] -0.4202874 0.5839737 bar01[[length(bar01)]][[2]] #> [,1] [,2] #> [1,] 2.6853565 -0.3978194 #> [2,] -0.3978194 0.7337082 bar10[[length(bar10)]][[2]] #> [,1] [,2] #> [1,] 2.8112632 -0.4202874 #> [2,] -0.4202874 0.5839737 bar11[[length(bar11)]][[2]] #> [,1] [,2] #> [1,] 2.6853565 -0.3978194 #> [2,] -0.3978194 0.7337082 I see now that the X, as inputted to MphEM is printed 3 times in each of my output files. I’m not sure why this is. t(cbind(1, g1m)) %*% e_out$vectors
#>       [,1]         [,2]         [,3]         [,4]          [,5]
#> [1,] -10.0 8.209571e-10 3.423880e-10 8.799904e-10  3.741043e-10
#> [2,]  -8.2 5.214029e-01 1.593502e-01 2.047141e-01 -2.119088e-01
#>              [,6]          [,7]          [,8]         [,9]         [,10]
#> [1,] 5.193286e-10 -9.694780e-10 -3.196897e-10 6.163396e-10  5.822616e-10
#> [2,] 1.434733e-01 -2.042852e-01 -1.109870e-01 6.902534e-02 -2.025183e-01
#>              [,11]         [,12]         [,13]         [,14]         [,15]
#> [1,]  3.875483e-10 -5.354989e-10  4.053215e-10 -1.139380e-09 -1.024021e-09
#> [2,] -1.264110e-01  3.671612e-01 -3.176738e-01 -2.499061e-01 -4.861219e-01
#>              [,16]         [,17]        [,18]         [,19]         [,20]
#> [1,] -2.905492e-10 -4.402925e-10 3.027143e-10  4.031493e-10 -2.617206e-10
#> [2,]  5.697363e-02  1.413093e-01 4.235653e-01 -9.595364e-02  2.091495e-01
#>             [,21]         [,22]         [,23]         [,24]         [,25]
#> [1,] 9.958686e-10 -4.930074e-10 -1.710612e-10 -3.388599e-10 -7.418881e-10
#> [2,] 5.928482e-02  5.366786e-01 -2.983362e-01 -3.451517e-01  2.110054e-01
#>             [,26]         [,27]         [,28]        [,29]         [,30]
#> [1,] 5.581811e-10 -9.200053e-10 -9.621139e-11 4.626981e-10 -6.783181e-10
#> [2,] 1.807756e-01  2.627168e-01  1.685653e-01 7.195078e-01  4.955564e-02
#>             [,31]         [,32]         [,33]         [,34]         [,35]
#> [1,] 6.249647e-11 -2.818461e-10  4.676824e-10  2.764171e-10  1.328437e-10
#> [2,] 1.142125e-01 -9.182112e-01 -1.214265e+00 -1.380134e-01 -5.150178e-01
#>              [,36]        [,37]         [,38]         [,39]        [,40]
#> [1,]  5.698039e-10 8.583911e-11 -2.430915e-10 -5.453552e-10 1.853685e-10
#> [2,] -5.743965e-01 3.627758e-01  1.455956e-01 -3.543663e-01 4.206109e-01
#>              [,41]        [,42]        [,43]        [,44]        [,45]
#> [1,]  2.150999e-10 2.055196e-10 3.165112e-10 4.230766e-11 2.304351e-10
#> [2,] -1.125620e+00 4.366374e-03 4.839497e-01 5.627990e-01 1.881103e-02
#>              [,46]         [,47]        [,48]         [,49]         [,50]
#> [1,] -6.131784e-11  1.198367e-10 4.710556e-10  5.443409e-10  2.393824e-10
#> [2,]  1.534764e-03 -9.599920e-02 2.913756e-01 -1.670098e-01 -3.429212e-01
#>              [,51]        [,52]         [,53]         [,54]         [,55]
#> [1,] -8.476331e-12 1.665049e-10 -2.681996e-11 -2.718945e-10 -3.359916e-10
#> [2,]  9.796257e-02 8.098477e-01 -1.309711e+00 -5.100377e-02 -4.045755e-01
#>              [,56]         [,57]        [,58]         [,59]        [,60]
#> [1,]  2.576899e-10 -2.166084e-11 2.015501e-10 -8.589462e-12 3.397737e-10
#> [2,] -4.973091e-01  6.244735e-01 2.753963e-01  3.056504e-01 1.027420e+00
#>             [,61]         [,62]         [,63]         [,64]         [,65]
#> [1,] 1.629878e-10 -1.125520e-10 -9.604466e-11 -1.782041e-10  6.516876e-11
#> [2,] 5.171210e-01 -9.679433e-01  9.252804e-01 -2.548765e-01 -6.264708e-01
#>              [,66]         [,67]         [,68]        [,69]         [,70]
#> [1,] -2.114665e-10 -4.046121e-10  1.220681e-10 1.596400e-10 -4.175104e-10
#> [2,] -8.376359e-01 -6.859459e-01 -8.372176e-01 8.521224e-01  1.009573e-02
#>              [,71]         [,72]         [,73]         [,74]        [,75]
#> [1,] -2.283430e-10 -3.944047e-10 -1.824441e-10 -2.485945e-10 2.511755e-11
#> [2,]  2.286194e-01 -1.347910e+00  5.240299e-01  5.949179e-01 4.958164e-01
#>              [,76]         [,77]        [,78]        [,79]         [,80]
#> [1,] -2.587797e-11  1.795558e-10 4.519914e-12 2.866099e-11 -2.266703e-10
#> [2,]  1.359179e+00 -1.211559e-01 6.212741e-01 7.158448e-01 -9.853067e-01
#>              [,81]         [,82]         [,83]         [,84]         [,85]
#> [1,] -9.518052e-11 -7.734569e-11 -4.454652e-11 -5.285478e-11 -1.012893e-10
#> [2,]  1.570249e-01  7.387338e-01  1.681314e-02 -1.937415e+00  1.512452e-01
#>             [,86]         [,87]         [,88]         [,89]        [,90]
#> [1,] 3.718089e-11 -1.109072e-10  1.729715e-10 -4.520380e-11 2.238981e-10
#> [2,] 8.376756e-01  8.955207e-01 -1.930832e+00 -8.796979e-01 2.219885e-01
#>              [,91]        [,92]         [,93]        [,94]         [,95]
#> [1,] -3.103531e-12 1.040781e-10 -7.791080e-11 2.907737e-11  1.831045e-11
#> [2,]  5.667558e-01 1.251593e-01 -9.984738e-01 6.009364e-02 -1.296282e-01
#>              [,96]         [,97]         [,98]         [,99]        [,100]
#> [1,] -7.721866e-11 -8.969529e-11 -2.186182e-11 -3.175127e-11  2.421263e-11
#> [2,] -1.681463e+00  2.364884e-01  3.328094e-01 -5.143277e-01 -1.334516e+00

I see that the third instance of “X,” - in the file of GEMMA output - specifies that X has two rows, while the first two seem to say that X has only one row. Furthermore, it seems that the third instance of “X,” has the same entries that I get when calculating X %*% U - namely, the first column is -10 and -8.2, the remaining entries in the first row are near zero, and the remaining entries in the second row begin with 0.52.

I see now that I misread the output file from GEMMA. Within it, there are actually 3 calls (ie, per SNP) to MphEM. It’s the third call that we want to reproduce.

From the above output, we see that the initial values, for the third call to MphEM() are:

$V_g = \left(\begin{array}{cc} 2.73 & -0.39 \\ -0.39 & 0.74 \end{array} \right)$

and

$V_e = \left(\begin{array}{cc} 0.15 & 0.27 \\ 0.27 & 0.50 \end{array} \right)$

and, after a single update, the matrices are:

$V_g = \left(\begin{array}{cc} 2.74 & -0.40 \\ -0.40 & 0.74 \end{array} \right)$

and

$V_e = \left(\begin{array}{cc} 0.15 & 0.27 \\ 0.27 & 0.51 \end{array} \right)$

Can we get our gemma2 R code to do this?

MphEM(eval = e_out$values, X = t(cbind(1, g1m)) %*% e_out$vectors,
Y = t(phe16) %*% e_out$vectors, V_g = matrix(c(2.73, - 0.39, - 0.39, 0.74), nrow = 2), V_e = matrix(c(0.15, 0.27, 0.27, 0.50), nrow = 2) ) -> out1 out1[[1]][c(2:3)] #>$Vg
#>           [,1]       [,2]
#> [1,]  2.706773 -0.4119620
#> [2,] -0.411962  0.7428087
#>
#> $Ve #> [,1] [,2] #> [1,] 0.1531700 0.2761671 #> [2,] 0.2761671 0.5119295 Now try with ec_out MphEM(eval = ec_out$values,
X = t(cbind(1, g1m)) %*% ec_out$vectors, Y = t(phe16) %*% ec_out$vectors,
V_g = matrix(c(2.731324, - 0.394865, - 0.394865, 0.737116), nrow = 2),
V_e = matrix(c(0.147467, 0.272090, 0.272090, 0.502031), nrow = 2)
) -> out1
out1[[1]][c(2:3)]
#> $Vg #> [,1] [,2] #> [1,] 2.7372540 -0.4041481 #> [2,] -0.4041481 0.7434534 #> #>$Ve
#>           [,1]      [,2]
#> [1,] 0.1495667 0.2759642
#> [2,] 0.2759642 0.5091793

The results, above, are accurate to within rounding.

## References

Zhou, Xiang, and Matthew Stephens. 2014. “Efficient Multivariate Linear Mixed Model Algorithms for Genome-Wide Association Studies.” Nature Methods 11 (4). Nature Research: 407–9.