# [FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

March 29, 2015
[LMM] literature overview: performance
March 27, 2015
[LMM] literature overview: approximate methods
March 15, 2015
[FaST-LMM] Proximal contamination
March 13, 2015
[FaST-LMM] REML estimate
March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)
March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)
March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)
March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2
February 27, 2015
[FaST-LMM] high-level overview, part 2
February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1
February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py
February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

This post starts the series devoted to studying FaST-LMM source code located here.

In these series, ‘thesis’ refers to the publication ‘Linear mixed models for genome-wide association studies’ (C. Lippert)

First of all, some definitions are to be provided:

• $\gamma$ is the ratio $\sigma^2_g / \sigma^2$ of the variance parameters (genetic variance and level of environmental noise, respectively; see thesis 2.2);
• $K$ - genetic similarity matrix, ‘that quantifies the genetic relationship between individuals based on the causal loci’ (also thesis 2.2);
• $S = I - XX^\dagger$ (thesis 3.1.1)

The working horse is LMM class, versions of which can be found in as many as three files - lmm2k.py, lmm.py, and lmm_cov.py. They are all similar. The latter is more recently updated, but some of docstrings are not up-to-date.

Nevertheless, let’s start from lmm_cov.py. One of the differences it has with lmm.py is that LMM.setK method signature doesn’t include the mixing parameter - its optimal value is searched for automatically.

## Linreg class

Linreg(X).regress(Y) returns residuals, i.e. $Y - X\beta = Y - XX^\dagger Y$. Here $X^\dagger$ is a pseudo-inverse of $X$ computed by numpy.linalg.pinv.

With our definition of the matrix $S$, its effect can be written simply as $SY$.

## LMM class

### setSU_fromK

def setSU_fromK(self):
N = self.K.shape[0]
D = self.linreg.D
ar = np.arange(self.K.shape[0])
self.K[ar,ar]+=1.0
K_ = self.linreg.regress(Y=self.K)
K_ = self.linreg.regress(Y=K_.T)
[self.S,self.U] = la.eigh(K_)
self.U = self.U[:,D:N]
self.S = self.S[D:N] - 1.0

Here we see that K_ is first set to $S(K+I)$, and then to $S(S(K+I))^T = S(K+I)S\$ (note that $K$ and $S$ are both symmetric).

Thesis 3.1.1. (Maximum likelihood estimation) contains a reference to the lemma C.15, because the expression for the likelihood function includes $S(\gamma K + I) S$.

Lemma C.15 allows to quickly compute eigenpairs of a matrix of this form with any $\gamma$ by considering the matrix $S(K + I)S$:

Let $H_1$ be defined as $K + I$ and $H_γ$ be defined as $γK + I$, where $K$ is a positive semi-definite matrix and $γ \geq 0$. Further, let the economy spectral decomposition of $SH_1S$ be $U_S ( \Sigma + I_{N-D} ) U_S^T$ . Then, the economy spectral decomposition of $SH_γ S$ is given by $U (γΣ + I_{N-D} ) U^T$ , where $(γΣ + I_{N −D} )$ is a diagonal matrix holding the $N − D$ non-zero eigenvalues of $SH_γ S$ and the first $N − D$ eigenvectors given as columns of $U_S$ are unchanged.

Note that although documentation of numpy.linalg.eigh suggests that returned eigenvalues are not necessarily ordered, quick check shows that at least in the real-valued case they are sorted in ascending order. That’s why last $N - D$ eigenpairs are taken (first D are zeros).

Thus we conclude that

• self.S corresponds to $\Sigma$;
• self.U holds eigenvectors of $S(K + I)S$ corresponding to non-zero eigenvalues.

Update from 04 May 2015:

• Although the lemma is stated in terms of $\gamma$, the same can be said about matrices of form $K + \delta I$, and from that it is easily seen that $\Sigma$ holds eigenvalues of $SKS$ on the diagonal (set $\delta = 0$). And since the lemma states that $N - D$ eigenvectors stay the same no matter what $\delta$ is, we get the eigendecomposition of $SKS$ in this way.

• The projection onto the subspace orthogonal to fixed effects ($X$) is because REML (restricted maximum likelihood) estimates are calculated in lmm_cov (always! another sign of it is absence of REML parameter in method signatures, which is still seen in copy-pasted comments from old code)

The rationale and approach are described in the thesis 2.2.3

• It’s illustrative to see how the lemma is proved for this simple case: $SKS = S(K+I)S - S = SH_1S - S = U_S(\Sigma + I)U_S^T - U_SU_S^T = U_S\left(\Sigma + I - I\right)U_S^T = U_S\Sigma U_S^T$

Here idempotency of $S$ was used and the fact that $S = U_SU_S^T$ where columns of $U_S$ are eigenvectors of $S(K+I)S$. The latter fact is less trivial, its proof is found in the thesis.

## setSU_fromG

Some branches in the function body are trivial, so let’s skip them and look at the most interesting piece of code:

PxG = self.linreg.regress(Y=self.G)
try:
[self.U,self.S,V] = la.svd(PxG,False,True)
inonzero = self.S > 1E-10
self.S = self.S[inonzero]
self.S = self.S * self.S
self.U = self.U[:,inonzero]

The $SG$ matrix is decomposed as $U\Sigma^{1/2}V^T$. This corresponds to decomposing $SKS = SGG^TS$ as $U\Sigma U^T$.