# [LMM] literature overview: performance

March 29, 2015
[LMM] literature overview: performance
March 27, 2015
[LMM] literature overview: approximate methods
March 15, 2015
[FaST-LMM] Proximal contamination
March 13, 2015
[FaST-LMM] REML estimate
March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)
March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)
March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)
March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2
February 27, 2015
[FaST-LMM] high-level overview, part 2
February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1
February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py
February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

Now let’s focus on performance optimizations on the computing side. The methods discussed below still use exact formulas, but organize computations in a smart way. They happen to be implemented in omicABEL package (GPLv3) and its fork, cuOmicABEL.

## omicABEL

I start with High-Performance Mixed Models Based Genome-Wide Association Analysis with omicABEL software (August 2014), where two methods are compared, one using eigendecomposition and the other Cholesky decomposition.

To speed up the operations, tested SNPs are packed into batches, so that instead of relatively slow matrix-vector operations (BLAS-2) matrix-matrix are employed (BLAS-3). This suggestion first appeared in the paper Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures (2014), also there’s a less polished version of it on arxiv.org (2012).

It turns out that the eigendecomposition-based method is faster in case of multiple trait analysis, but Cholesky beats it when a single trait is analysed.

The authors test the method against FaST-LMM, and announce the victory:

In case of single-trait analysis, our results are somewhat less impressive, and our CLAK-Chol solution outperforms advanced current methods (e.g. FaST-LMM) by about one order of magnitude for large-sample-size problems. It is worth mentioning that the latter speed-up becomes possible because we show that for single-trait GWAS problems our CLAK-Chol algorithm is superior to CLAK-Eig, but other current methods are actually implementing solutions similar to our CLAK-Eig algorithm to address the single-trait GWAS problem.

They also provide an excellent tutorial here that I followed along. It’s slightly outdated, though:

• The program name was changed from CLAK-GWAS to OmicABEL, and the links to binaries are found here
• The name of output file produced by reshuffle tool has been reshuffled into data_chi.txt from chi_data.txt, and the option has been renamed from --chi2 to --chi and now requires a (positive integer) threshold, so I used --chi2=1. I’m a bit annoyed that I can’t specify a float or zero, but OK, nobody cares about non-significant SNPs.

The results are indeed well correlated, even for large values of $\chi^2$, and FaST-LMM is indeed beaten (timings on my laptop are similar to those in the tutorial).

### Extremely large datasets

There is another related paper (almost the same set of authors) focusing more on the algorithmic & computational aspects:
High performance solutions for big-data GWAS; on arxiv.org since March 2014.

It goes without saying that there is no need to load all SNPs at once, and out-of-core algorithm is provided, featuring asynchronous I/O so that I/O and computations overlap.

But the authors go further and consider how to process huge number of individuals. All algorithms require storing $N\times N$ kinship matrix (or its eigenvectors, or Cholesky factors), which might not fit into memory of a single machine. The authors suggest use of Elemental library and provide a description of how to distribute matrix blocks across nodes.

## cuOmicABEL

Finally, there actually IS a tool allowing to speed up GWAS with GPU! In 2012, Lucas Beyer from RWTH Aachen University defended his thesis, Exploiting Graphics Accelerators for Computational Biology. (It should not be surprising that the papers discussed above also include people from the same university.) Later, a paper appeared on arxiv.org, Streaming Data from HDD to GPUs for Sustained Peak Performance. As the name suggests, a clever buffering scheme is proposed, and the testing shows that it leads to near-ideal scaling.

The work is available on Github in the form of fork of omicABEL: https://github.com/lucasb-eyer/cuOmicABEL; it doesn’t seem to be merged, and there’s been no progress made since spring 2013. Hopefully it works.