[FaST-LMM] high-level overview, part 2

March 29, 2015
[LMM] literature overview: performance
March 27, 2015
[LMM] literature overview: approximate methods
March 15, 2015
[FaST-LMM] Proximal contamination
March 13, 2015
[FaST-LMM] REML estimate
March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)
March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)
March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)
March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2
February 27, 2015
[FaST-LMM] high-level overview, part 2
February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1
February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py
February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

LMM(select) + PCs

The authors claim that under some circumstances this method gives as good results as LMM(all + select) but is much faster.

The algorithm is described in a supplementary note. That note also contains a brief definition of Probabilistic PCA. A more comprehensible description is found in http://www.robots.ox.ac.uk/~cvrg/hilary2006/ppca.pdf.

The components are computed by fastlmm.util.compute_auto_pcs function. The algorithm is referred to as ‘PCgeno’. For choosing best PCs, cross-validation is run (scikit-learn is used here for multiple splittings into test and training sets).

The rest is not implemented in Python yet. Instead, the C++ executable fastlmmc is called to run GWAS.


Another available procedure is searching for pairs of correlated SNPs. This is done by module fastlmm.association.epistasis.

Main function there is do_work, which takes an LMM object with precomputed decomposition of $ K $ and two sets of SNPs $S_1$ and $S_2$.

The purpose of having such function (instead of having a function which runs on a pair of SNPs) is that these jobs can be distributed on a cluster and thus the granularity has to be somewhat coarse-grained.

The job first computes the matrix $X$, containing

  • covariate columns
  • SNP values encoded as 0/1/2, for all SNPs from the union of two provided sets
  • finally, products of SNP values from the two sets

Then, for each pair, two models are compared.

In the null model, the $ X$ matrix contains only covariates and SNP values of the pair.

The alternative model adds to that the additional column with the product of the SNP values. (If there’s no correlation between the two SNPs, it will have mean close to zero.)

For both models, log-likelihood is computed, and likelihood-ratio test is performed to obtain p-value.

Testing sets of SNPs

Not only pairs, but also sets of SNPs can be checked for correlations. This functionality is available in the module fastlmm.association.snp_set. The implementation is in fastlmm.association.FastLmmSet.

TODO: find description of the algorithm