[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

March 29, 2015
[LMM] literature overview: performance
March 27, 2015
[LMM] literature overview: approximate methods
March 15, 2015
[FaST-LMM] Proximal contamination
March 13, 2015
[FaST-LMM] REML estimate
March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)
March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)
March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)
March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2
February 27, 2015
[FaST-LMM] high-level overview, part 2
February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1
February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py
February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

This post starts the series devoted to studying FaST-LMM source code located here.

In these series, ‘thesis’ refers to the publication ‘Linear mixed models for genome-wide association studies’ (C. Lippert)

First of all, some definitions are to be provided:

  • $\gamma$ is the ratio $\sigma^2_g / \sigma^2 $ of the variance parameters (genetic variance and level of environmental noise, respectively; see thesis 2.2);
  • $ K $ - genetic similarity matrix, ‘that quantifies the genetic relationship between individuals based on the causal loci’ (also thesis 2.2);
  • $ S = I - XX^\dagger $ (thesis 3.1.1)

The working horse is LMM class, versions of which can be found in as many as three files - lmm2k.py, lmm.py, and lmm_cov.py. They are all similar. The latter is more recently updated, but some of docstrings are not up-to-date.

Nevertheless, let’s start from lmm_cov.py. One of the differences it has with lmm.py is that LMM.setK method signature doesn’t include the mixing parameter - its optimal value is searched for automatically.

Linreg class

Linreg(X).regress(Y) returns residuals, i.e. $ Y - X\beta = Y - XX^\dagger Y $. Here $ X^\dagger $ is a pseudo-inverse of $ X $ computed by numpy.linalg.pinv.

With our definition of the matrix $S$, its effect can be written simply as $SY$.

LMM class


def setSU_fromK(self):
  N = self.K.shape[0]
  D = self.linreg.D
  ar = np.arange(self.K.shape[0])
  K_ = self.linreg.regress(Y=self.K)
  K_ = self.linreg.regress(Y=K_.T)
  [self.S,self.U] = la.eigh(K_)
  self.U = self.U[:,D:N]
  self.S = self.S[D:N] - 1.0

Here we see that K_ is first set to $S(K+I)$, and then to $S(S(K+I))^T = S(K+I)S\ $ (note that $K$ and $S$ are both symmetric).

Thesis 3.1.1. (Maximum likelihood estimation) contains a reference to the lemma C.15, because the expression for the likelihood function includes $ S(\gamma K + I) S $.

Lemma C.15 allows to quickly compute eigenpairs of a matrix of this form with any $\gamma$ by considering the matrix $S(K + I)S$:

Let $ H_1 $ be defined as $ K + I $ and $ H_γ $ be defined as $ γK + I $, where $ K $ is a positive semi-definite matrix and $ γ \geq 0 $. Further, let the economy spectral decomposition of $ SH_1S $ be . Then, the economy spectral decomposition of $ SH_γ S $ is given by , where $ (γΣ + I_{N −D} ) $ is a diagonal matrix holding the $ N − D $ non-zero eigenvalues of $ SH_γ S $ and the first $ N − D $ eigenvectors given as columns of $ U_S $ are unchanged.

Note that although documentation of numpy.linalg.eigh suggests that returned eigenvalues are not necessarily ordered, quick check shows that at least in the real-valued case they are sorted in ascending order. That’s why last $ N - D $ eigenpairs are taken (first D are zeros).

Thus we conclude that

  • self.S corresponds to $\Sigma$;
  • self.U holds eigenvectors of $S(K + I)S$ corresponding to non-zero eigenvalues.

Update from 04 May 2015:

  • Although the lemma is stated in terms of $ \gamma $, the same can be said about matrices of form $ K + \delta I $, and from that it is easily seen that $ \Sigma $ holds eigenvalues of $ SKS $ on the diagonal (set $\delta = 0$). And since the lemma states that $ N - D $ eigenvectors stay the same no matter what $\delta$ is, we get the eigendecomposition of $ SKS $ in this way.

  • The projection onto the subspace orthogonal to fixed effects ($X$) is because REML (restricted maximum likelihood) estimates are calculated in lmm_cov (always! another sign of it is absence of REML parameter in method signatures, which is still seen in copy-pasted comments from old code)

The rationale and approach are described in the thesis 2.2.3

  • It’s illustrative to see how the lemma is proved for this simple case:

Here idempotency of $S$ was used and the fact that $S = U_SU_S^T$ where columns of $U_S$ are eigenvectors of $S(K+I)S$. The latter fact is less trivial, its proof is found in the thesis.


Some branches in the function body are trivial, so let’s skip them and look at the most interesting piece of code:

PxG = self.linreg.regress(Y=self.G)
   [self.U,self.S,V] = la.svd(PxG,False,True)
   inonzero = self.S > 1E-10
   self.S = self.S[inonzero]
   self.S = self.S * self.S
   self.U = self.U[:,inonzero]

The $SG$ matrix is decomposed as $U\Sigma^{1/2}V^T$. This corresponds to decomposing $SKS = SGG^TS$ as $U\Sigma U^T$.