[FaST-LMM] fastlmm/inference/lmm

March 29, 2015
[LMM] literature overview: performance

March 27, 2015
[LMM] literature overview: approximate methods

March 15, 2015
[FaST-LMM] Proximal contamination

March 13, 2015
[FaST-LMM] REML estimate

March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)

March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)

March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)

March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2

February 27, 2015
[FaST-LMM] high-level overview, part 2

February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1

February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py

February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

This post starts the series devoted to studying FaST-LMM source code located here.

In these series, ‘thesis’ refers to the publication ‘Linear mixed models for genome-wide association studies’ (C. Lippert)

First of all, some definitions are to be provided:

$\gamma$ is the ratio $\sigma^2_g / \sigma^2 $ of the variance parameters (genetic variance and level of environmental noise, respectively; see thesis 2.2);
$ K $ - genetic similarity matrix, ‘that quantifies the genetic relationship between individuals based on the causal loci’ (also thesis 2.2);
$ S = I - XX^\dagger $ (thesis 3.1.1)

The working horse is LMM class, versions of which can be found in as many as three files - lmm2k.py, lmm.py, and lmm_cov.py. They are all similar. The latter is more recently updated, but some of docstrings are not up-to-date.

Nevertheless, let’s start from lmm_cov.py. One of the differences it has with lmm.py is that LMM.setK method signature doesn’t include the mixing parameter - its optimal value is searched for automatically.

Linreg class

Linreg(X).regress(Y) returns residuals, i.e. $ Y - X\beta = Y - XX^\dagger Y $. Here $ X^\dagger $ is a pseudo-inverse of $ X $ computed by numpy.linalg.pinv.

With our definition of the matrix $S$, its effect can be written simply as $SY$.

LMM class

setSU_fromK

def setSU_fromK(self):
  N = self.K.shape[0]
  D = self.linreg.D
  ar = np.arange(self.K.shape[0])
  self.K[ar,ar]+=1.0
  K_ = self.linreg.regress(Y=self.K)
  K_ = self.linreg.regress(Y=K_.T)
  [self.S,self.U] = la.eigh(K_)
  self.U = self.U[:,D:N]
  self.S = self.S[D:N] - 1.0

Here we see that K_ is first set to $S(K+I)$, and then to $S(S(K+I))^T = S(K+I)S\ $ (note that $K$ and $S$ are both symmetric).

Thesis 3.1.1. (Maximum likelihood estimation) contains a reference to the lemma C.15, because the expression for the likelihood function includes $ S(\gamma K + I) S $.

Lemma C.15 allows to quickly compute eigenpairs of a matrix of this form with any $\gamma$ by considering the matrix $S(K + I)S$:

Let $ H_1 $ be defined as $ K + I $ and $ H_γ $ be defined as $ γK + I $, where $ K $ is a positive semi-definite matrix and $ γ \geq 0 $. Further, let the economy spectral decomposition of $ SH_1S $ be $U_S ( \Sigma + I_{N-D} ) U_S^T$ . Then, the economy spectral decomposition of $ SH_γ S $ is given by $U (γΣ + I_{N-D} ) U^T$ , where $ (γΣ + I_{N −D} ) $ is a diagonal matrix holding the $ N − D $ non-zero eigenvalues of $ SH_γ S $ and the first $ N − D $ eigenvectors given as columns of $ U_S $ are unchanged.

Note that although documentation of numpy.linalg.eigh suggests that returned eigenvalues are not necessarily ordered, quick check shows that at least in the real-valued case they are sorted in ascending order. That’s why last $ N - D $ eigenpairs are taken (first D are zeros).

Thus we conclude that

self.S corresponds to $\Sigma$;
self.U holds eigenvectors of $S(K + I)S$ corresponding to non-zero eigenvalues.

Update from 04 May 2015:

Although the lemma is stated in terms of $ \gamma $, the same can be said about matrices of form $ K + \delta I $, and from that it is easily seen that $ \Sigma $ holds eigenvalues of $ SKS $ on the diagonal (set $\delta = 0$). And since the lemma states that $ N - D $ eigenvectors stay the same no matter what $\delta$ is, we get the eigendecomposition of $ SKS $ in this way.
The projection onto the subspace orthogonal to fixed effects ($X$) is because REML (restricted maximum likelihood) estimates are calculated in lmm_cov (always! another sign of it is absence of REML parameter in method signatures, which is still seen in copy-pasted comments from old code)

The rationale and approach are described in the thesis 2.2.3

It’s illustrative to see how the lemma is proved for this simple case: $SKS = S(K+I)S - S = SH_1S - S = U_S(\Sigma + I)U_S^T - U_SU_S^T = U_S\left(\Sigma + I - I\right)U_S^T = U_S\Sigma U_S^T$

Here idempotency of $S$ was used and the fact that $S = U_SU_S^T$ where columns of $U_S$ are eigenvectors of $S(K+I)S$. The latter fact is less trivial, its proof is found in the thesis.

setSU_fromG

Some branches in the function body are trivial, so let’s skip them and look at the most interesting piece of code:

PxG = self.linreg.regress(Y=self.G)
try:
   [self.U,self.S,V] = la.svd(PxG,False,True)
   inonzero = self.S > 1E-10
   self.S = self.S[inonzero]
   self.S = self.S * self.S
   self.U = self.U[:,inonzero]

The $SG$ matrix is decomposed as $U\Sigma^{1/2}V^T$. This corresponds to decomposing $SKS = SGG^TS$ as $U\Sigma U^T$.