March 29, 2015
[LMM] literature overview: performance March 27, 2015 [LMM] literature overview: approximate methods March 15, 2015 [FaST-LMM] Proximal contamination March 13, 2015 [FaST-LMM] REML estimate March 11, 2015 [FaST-LMM] comparison with PyLMM (continued) March 10, 2015 [FaST-LMM] comparison with PyLMM (practice) March 9, 2015 [FaST-LMM] comparison with PyLMM (theory) March 3, 2015 [FaST-LMM] fastlmm/inference/lmm_cov.py, part 2 February 27, 2015 [FaST-LMM] high-level overview, part 2 February 25, 2015 [FaST-LMM] high-level overview of the codebase, part 1 February 18, 2015 [FaST-LMM] fastlmm/inference/lmm.py February 16, 2015 [FaST-LMM] fastlmm/inference/lmm_cov.py, part 1 |
Let’s now look at the big picture. There are two recent reviews of various methods:
Thus we shouldn’t disregard approximate methods if speed is important. At the very least, they can filter out SNPs that are definitely not significant. Also one shouldn’t forget that LMM is itself an approximation, typically the model is built with assumptions of infinitesimal genetic architecture and gaussian noise. Approximate methodsThey are based on the score test statistic instead of the LRT statistic considered gold-standard. The two statistics are asymptotically equivalent (read about the score test on Wikipedia). Score testApplying the general theory to our case goes as follows:
In the simplest and most common case $X$ is a single tested SNP, so that the multipliers in the above expression are scalars, and GRAMMAR-Gamma approximationThe computation of $T^2_{\mathrm{score}}$ requires $\mathcal{O}(N^2)$ operations - for calculating $V^{-1}y$ and $V^{-1}X$. The first, $V^{-1}y$, can be computed once for all SNPs, but $V^{-1}X$ must be computed each time. Authors of GRAMMAR-Gamma paper (2012) suggest that the ratio of $X^TV^{-1}X$ and $X^TX$ is nearly constant in practice, and therefore suggest another statistic: and is approximated as , where $\gamma$ is a correction factor, for which the analytical expression is derived: This approach allows to approximate $T^2_{\mathrm{score}}$ statistic in $\mathcal{O}(N)$ time for each SNP instead of $\mathcal{O}(N^2)$ needed for exact computation. BOLT-LMMBorrowing some ideas from earlier methods, BOLT-LMM avoids matrix decomposition (or inversion) that other methods require, which has complexity $\mathcal{O}(MN^2)$ The approach is described in a recent paper published in Nature and freely available on bioarxiv. Computing $V^{-1}y$ and $V^{-1}X$ is done through conjugate gradient method, taking around $\mathcal{O}(\sqrt{N})$ iterations to converge (empirical estimate made by the authors), on each iteration computing matrix-vector products ($\mathcal{O}(MN)$), leading to total runtime of $\mathcal{O}(MN^{1.5})$ for each such computation. The trick is to compute $V^{-1}X$ only for a few (pseudo-)randomly taken non-causal SNPs, and the suggested statistic is where $c_{\mathrm{inf}}$ is taken to be the following ratio with the averages taken over a few (authors suggests 30) random SNPs. Another trick is in estimating the variance parameter ($h^2$), also allowing to avoid decomposition/inversion of $V$. In short, a few smart approximations allow to perform GWAS in $\mathcal{O}(MN^{1.5})$ time instead of the usual $\mathcal{O}(MN^2)$, retaining almost the same power. The method is highly promising, I will look into it further. |