[FaST-LMM] comparison with PyLMM (theory)

March 29, 2015
[LMM] literature overview: performance
March 27, 2015
[LMM] literature overview: approximate methods
March 15, 2015
[FaST-LMM] Proximal contamination
March 13, 2015
[FaST-LMM] REML estimate
March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)
March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)
March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)
March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2
February 27, 2015
[FaST-LMM] high-level overview, part 2
February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1
February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py
February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

[NEWS]

During the last month, the github repository of FaST-LMM has seen some updates, including much awaited addition of comments in lmm_cov.py, so that it is now not as puzzling as before. In particular, the filename now makes sense for me: ‘cov’ stays for ‘covariates regressed out’.

I’ve downloaded source code of PyLMM which looks extremely simplistic after so much time spent on FaST-LMM, and below are my notes about the differences between the two.

The next step will be running the relevant pieces of code and comparing their results and runtime.

Kinship matrix calculation

PyLMM (lmm.py, calculateKinship)

Given $W$, “an n x m matrix encoding SNP minor alleles”, it performs matrix multiplication to get $\frac1m W W^T$. Nothing fancy.

FaST-LMM

The distinguishing feature of the method is that it makes a distinction between low-rank case and full-rank case.

In full-rank case, when the number of SNPs is equal or greater than the number of individuals, it computes $K$ as $G G^T$, as usual.

However, if the number of individuals exceeds the number of SNPs (i.e. the matrix $K$ is not full-rank), the method computes SVD of $G$ instead of taking eigendecomposition of $K$. All the remaining computations are then performed with the $U$ and $S$, where $K = USU^T$.

Handling XKX matrix

This difference is going to be seen only when the number of covariates is large, and that’s, I guess, extremely rare.

PyLMM (lmm.py, getMLsoln)

Straightforwardly computes inverse of $(U^T X)^T S (U^T X)$, and then does two matrix multiplications to obtain $\hat{\beta} = ((U^T X)^T S (U^T X))^{-1}(U^TX)^T S U^T y = (X^T K X)^{-1} XKy$

FaST-LMM (full-rank case)

Instead of inverting $X K X$, computes its eigendecomposition. Part of the rationale is that in REML log-likelihood calculation, we need the determinant of this matrix, and where PyLMM computes it as numpy.log(scipy.linalg.det(XX)) (which internally takes a decomposition), FaST-LMM simply calculates scipy.log(SxKx).sum()

Refitting

PyLMM

If refit parameter set to True, each SNP is included in the X matrix, and the mixing parameter is reestimated before computing the p-value. Otherwise, log-likelihood is computed with the mixing coefficient taken from the null model.

Default value is false, presumably the effect of a single SNP is not large to cause a significant change in $\delta$

FaST-LMM

I couldn’t find such an option. Seemingly FaST-LMM aims to be used for larger datasets, where this approach is unaffordable.

Fitting a model with two variance components

PyLMM: misc.py

With fixed $n$ (wgrids, default 100), for multiple LMMs with kinship matrices of form $\frac{k}{n}K_1 + \frac{n-k}{n}K_2$ likelihood is computed, and the model with the largest value is selected.

Although it’s possible to use two kernels, there’s no way to select which SNPs will go into the foreground kernel.

FaST-LMM: feature_selection/feature_selection_two_kernel.py

This much more computation-intensive procedure does the following by searching on a 2D grid (the first dimension is the number of SNPs, and the second one is the mixing coefficient):

This method selects SNPs for a foreground kernel, given a background kernel. The background kernel is obtained from all SNPs and is intended to capture population structure, as well as family structure.

Selection is performed by first ranking SNPs based on a LMM using the full kernel according to their univariate association with the phenotype and then choosing a number to cut off using cross-validation.

Cross-validation also adds to the run time.

What is missing in PyLMM (compared to FaST-LMM)

Two-kernel case

As mentioned above, feature selection and estimating the mixing parameters are not implemented.

Proximal contamination handling

FaST-LMM applies smart updates to many involved matrices so as to virtually eliminate the tested SNP (and a few nearby) from the kinship matrix.

Efficient evaluation of various expressions

The thesis devotes a section (3.3.1) to the issue of efficient evaluation.