March 29, 2015
[LMM] literature overview: performance
March 27, 2015
[LMM] literature overview: approximate methods
March 15, 2015
[FaST-LMM] Proximal contamination
March 13, 2015
[FaST-LMM] REML estimate
March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)
March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)
March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)
March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2
February 27, 2015
[FaST-LMM] high-level overview, part 2
February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1
February 18, 2015
February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1
Now let’s focus on performance optimizations on the computing side. The methods discussed below still use exact formulas, but organize computations in a smart way. They happen to be implemented in
I start with High-Performance Mixed Models Based Genome-Wide Association Analysis with omicABEL software (August 2014), where two methods are compared, one using eigendecomposition and the other Cholesky decomposition.
To speed up the operations, tested SNPs are packed into batches, so that instead of relatively slow matrix-vector operations (BLAS-2) matrix-matrix are employed (BLAS-3). This suggestion first appeared in the paper Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures (2014), also there’s a less polished version of it on arxiv.org (2012).
It turns out that the eigendecomposition-based method is faster in case of multiple trait analysis, but Cholesky beats it when a single trait is analysed.
The authors test the method against FaST-LMM, and announce the victory:
They also provide an excellent tutorial here that I followed along. It’s slightly outdated, though:
The results are indeed well correlated, even for large values of $\chi^2$, and FaST-LMM is indeed beaten (timings on my laptop are similar to those in the tutorial).
Extremely large datasets
It goes without saying that there is no need to load all SNPs at once, and out-of-core algorithm is provided, featuring asynchronous I/O so that I/O and computations overlap.
But the authors go further and consider how to process huge number of individuals. All algorithms require storing $N\times N$ kinship matrix (or its eigenvectors, or Cholesky factors), which might not fit into memory of a single machine. The authors suggest use of Elemental library and provide a description of how to distribute matrix blocks across nodes.
Finally, there actually IS a tool allowing to speed up GWAS with GPU!
In 2012, Lucas Beyer from RWTH Aachen University defended his thesis, Exploiting Graphics Accelerators for Computational Biology. (It should not be surprising that the papers discussed above also include people from the same university.) Later, a paper appeared on
The work is available on Github in the form of fork of