[LMM] literature overview: approximate methods

March 29, 2015
[LMM] literature overview: performance

March 27, 2015
[LMM] literature overview: approximate methods

March 15, 2015
[FaST-LMM] Proximal contamination

March 13, 2015
[FaST-LMM] REML estimate

March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)

March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)

March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)

March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2

February 27, 2015
[FaST-LMM] high-level overview, part 2

February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1

February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py

February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

Let’s now look at the big picture. There are two recent reviews of various methods:

‘Advantages and pitfalls in the application of mixed-model association methods’ — January 2014
- the authors strongly suggest excluding the tested marker from GRM, either by correcting a single global GRM as it’s done in FaST-LMM, or simply by using the LOCO approach (leave-one-chromosome-out), i.e. computing as many kinship matrices as there are chromosomes
- a table is provided giving computational cost of EMMAX, FaST-LMM, GEMMA, GRAMMAR-Gamma, and GCTA methods (the second column is a bit incorrect, though - in some places there should be $\mathcal{O}(MN^2)$ instead of $\mathcal{O}(N^3)$)
- GRAMMAR-Gamma is an approximate method that gives $\mathcal{O}(MN)$ time of SNP testing instead of $\mathcal{O}(MN^2)\ $
  ($M$ is the number of SNPs, $N$ - that of individuals)
‘Accounting for relatedness in family-based association studies: application to Genetic Analysis Workshop 18 data’ — June 2014
- tested methods: EMMAX, FaST-LMM, GEMMA, and GenABEL (FASTA/GRAMMAR-Gamma)
- the authors conclude that ‘from a practical point of view <…> it makes little difference to the results which method/software package is used, and the user can make the choice of package on the basis of personal taste or computational speed/convenience.’

Thus we shouldn’t disregard approximate methods if speed is important. At the very least, they can filter out SNPs that are definitely not significant. Also one shouldn’t forget that LMM is itself an approximation, typically the model is built with assumptions of infinitesimal genetic architecture and gaussian noise.

Approximate methods

They are based on the score test statistic instead of the LRT statistic considered gold-standard. The two statistics are asymptotically equivalent (read about the score test on Wikipedia).

Score test

Applying the general theory to our case goes as follows:

the parameter of interest is $\beta$, and we test the hypothesis $\mathcal{H}_0: \beta = 0$
$\mathcal{U}(0)$, the partial derivative of LL at $\beta = 0$, equals to $X^TV^{-1}y\ $ (here $V = \sigma^2(h^2K + (1-h^2)I)\,$)
Fisher information, $\mathcal{I}(0)$, is $X^TV^{-1}X$
thus the score test statistic is $(X^TV^{-1}y)^T(X^TV^{-1}X)^{-1}(X^TV^{-1}y)$

In the simplest and most common case $X$ is a single tested SNP, so that the multipliers in the above expression are scalars, and

$T^2_{\mathrm{score}} = \frac{(X^TV^{-1}y)^2}{X^TV^{-1}X}$

GRAMMAR-Gamma approximation

The computation of $T^2_{\mathrm{score}}$ requires $\mathcal{O}(N^2)$ operations - for calculating $V^{-1}y$ and $V^{-1}X$. The first, $V^{-1}y$, can be computed once for all SNPs, but $V^{-1}X$ must be computed each time.

Authors of GRAMMAR-Gamma paper (2012) suggest that the ratio of $X^TV^{-1}X$ and $X^TX$ is nearly constant in practice, and therefore suggest another statistic:

$T^2_{\mathrm{new}} = \frac{(X^TV^{-1}y)^2}{X^TX},$

and $T^2_{\mathrm{score}}$ is approximated as $T^2_{\mathrm{new}}/\gamma$ , where $\gamma$ is a correction factor, for which the analytical expression is derived:

$\gamma = \frac{1}{\sigma^2h^2}\left(1 - \frac{\sigma^2(1-h^2)}{N-1}\mathrm{tr}(V^{-1})\right)$

This approach allows to approximate $T^2_{\mathrm{score}}$ statistic in $\mathcal{O}(N)$ time for each SNP instead of $\mathcal{O}(N^2)$ needed for exact computation.

BOLT-LMM

Borrowing some ideas from earlier methods, BOLT-LMM avoids matrix decomposition (or inversion) that other methods require, which has complexity $\mathcal{O}(MN^2)$ The approach is described in a recent paper published in Nature and freely available on bioarxiv.

Computing $V^{-1}y$ and $V^{-1}X$ is done through conjugate gradient method, taking around $\mathcal{O}(\sqrt{N})$ iterations to converge (empirical estimate made by the authors), on each iteration computing matrix-vector products ($\mathcal{O}(MN)$), leading to total runtime of $\mathcal{O}(MN^{1.5})$ for each such computation. The trick is to compute $V^{-1}X$ only for a few (pseudo-)randomly taken non-causal SNPs, and the suggested statistic is

$T^2_{\mathrm{BOLT-LMM-inf}} = \frac{(X^TV^{-1}y)^2}{c_{\mathrm{inf}}},$

where $c_{\mathrm{inf}}$ is taken to be the following ratio

$\frac{\mathrm{Mean} (X_{\mathrm{test}}^TV^{-1}y)^2}{\mathrm{Mean}\dfrac{(X_{\mathrm{test}}^TV^{-1}y)^2}{X_{\mathrm{test}}^TV^{-1}X_{\mathrm{test}}}}$

with the averages taken over a few (authors suggests 30) random SNPs.

Another trick is in estimating the variance parameter ($h^2$), also allowing to avoid decomposition/inversion of $V$.

In short, a few smart approximations allow to perform GWAS in $\mathcal{O}(MN^{1.5})$ time instead of the usual $\mathcal{O}(MN^2)$, retaining almost the same power. The method is highly promising, I will look into it further.