[FaST-LMM] fastlmm/inference/lmm.py

March 29, 2015
[LMM] literature overview: performance

March 27, 2015
[LMM] literature overview: approximate methods

March 15, 2015
[FaST-LMM] Proximal contamination

March 13, 2015
[FaST-LMM] REML estimate

March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)

March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)

March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)

March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2

February 27, 2015
[FaST-LMM] high-level overview, part 2

February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1

February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py

February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

It’s better to start from the old lmm.py version in order to gain understanding.

Its setG and setK methods are easy to understand, because they simply compute the economy decomposition $ K = U_1^T \Lambda_1 U_1 $, either by taking SVD of the matrix $ G $ (recall that $ K = GG^T $) or by spectral decomposition of $K$. Both methods take the a2 parameter, which is the mixing parameter, i.e. the kinship matrix $K $ equals $ (1-a_2)K_0 + a_2 K_1 $.

The next step is setting matrix $X$ (covariates) and vector $y$ (phenotypes). The methods setX and sety also compute the rotated versions $U_1^TX$ and $U_1^Ty$. Additionally, if $k$ (the rank of $K$) is less than the number of individuals, $N$, the matrices $(I-U_1U_1^T)X$ and $(I-U_1U_1^T)y$ are computed. The rationale can be found in thesis 3.4.3 under the subheading ‘Finding the maximum likelihood and parameters efficiently’ - the formula (3.27) includes these matrices.

The class also can find optimal a2 and h2 parameters for the model, where the total variance is given by $ h_2 ((1-a_2)K_0+a_2 K_1) + (1-h_2)I $. This task is handled by the findA2 method which searches over a 2d grid. In the inner loop it calls findH2 which is however not recommended to use - instead it’s suggested to use another method find_log_delta, which optimizes the $\delta = \sigma^2 / \sigma^2_g$. This, in fact, is one of the crucial ideas. The thesis section 3.3 talks about this algorithmic optimization in terms of $\gamma = 1/\delta$. The formulation of the same idea in terms of $\delta$ can be found in the supplementary note for the paper ‘FaST linear mixed models for genome-wide association studies’.

Thus let’s now peek into find_log_delta method. Unfortunately, its docstring is not as polished as the ones for the previous methods but honestly states ‘#Need comments’.
The function body includes the following relationship: $ h_2 = 1/(\delta s_c + 1) $. It is not made clear what is meant by $s_c$ (sid_count) here, but at least setting it to 1 makes sense because $\sigma^2_g(K + \delta I) = h_2 K + (1-h_2) I \Rightarrow \delta = \frac{1-h_2}{h_2} \Rightarrow 1-h_2 = h_2\delta \Rightarrow h_2 = 1/(1 + \delta)\,.$ So findH2 and find_log_delta essentially differ only in the grid spacing, but it’s implied that this difference can boost the computations.

Ultimately, find_log_delta calls nLLeval method, passing h2 to it. In it we suddenly see a commented out reference to lmm_cov.py:

#if REML == True:
        #    # this needs to be fixed, please see test_gwas.py for details
        #    raise NotImplementedError("this feature is not ready to use at this time, please use lmm_cov.py instead")

But let’s not digress and move on to consider the most important method in this module.

nLLeval

The function computes log-likelihood, cleverly using the spectral decomposition of $K$.

Since $K+\delta I = U_1^T\Lambda_1 U_1 + \delta I= U_1^T(\Lambda_1 + \delta I)U_1$ and all we need from this matrix is the value of its determinant, the first step is computing $\Lambda_1 + \delta I$. These are the following lines in the code:

if logdelta is not None:
    delta = SP.exp(logdelta)

if delta is not None:
    Sd = (self.S+delta)*scale
else:
    Sd = (h2*self.S + (1.0-h2))*scale

logdetK = SP.log(Sd).sum()

Now it depends whether we’re dealing with low-rank case ($k < N$) or the general case. For simplicity, let’s start with the general case and also skip the optional proximal contamination part.
A bunch of matrices is computed:

UXS = self.UX / NP.lib.stride_tricks.as_strided(Sd, (Sd.size,self.UX.shape[1]), (Sd.itemsize,0))
UyS = self.Uy / Sd

XKX = UXS.T.dot(self.UX)
XKy = UXS.T.dot(self.Uy)
yKy = UyS.T.dot(self.Uy)

The estimate $ \hat\beta$ is calculated as follows:

[SxKx,UxKx]= LA.eigh(XKX)
i_pos = SxKx>1E-10
beta = SP.dot(UxKx[:,i_pos],(SP.dot(UxKx[:,i_pos].T,XKy)/SxKx[i_pos]))

Comparing these expressions with the thesis equations 3.26-3.27 reveals that XKX corresponds to $C_{X,X}$ , and XKy to $c_{X,y}$ . But instead of inverting the matrix its eigendecomposition is computed: $C_{X,X} = U\Lambda U^T \Rightarrow C_{X,X}^{-1}c_{X,y} = U(U^T c_{X,y})\Lambda^{-1}\,.$

The next step is estimating $\sigma^2_g$. Depending on the type of likelihood, its denominator is either $ N $ (ML) or $ N-D $ (REML).
Finally, these estimates are put into the formula for the log-likelihood:

r2 = yKy-XKy.dot(beta)

if dof is None:#Use the Multivariate Gaussian
    if REML:
        XX = self.X.T.dot(self.X)
        [Sxx,Uxx]= LA.eigh(XX)
        logdetXX  = SP.log(Sxx).sum()
        logdetXKX = SP.log(SxKx).sum()
        sigma2 = r2 / (N - D)
        nLL =  0.5 * ( logdetK + logdetXKX - logdetXX + (N-D) * ( SP.log(2.0*SP.pi*sigma2) + 1 ) )
    else:
        sigma2 = r2 / (N)
        nLL =  0.5 * ( logdetK + N * ( SP.log(2.0*SP.pi*sigma2) + 1 ) )

The ML formula is found in section 1.4 of the mentioned supplementary note. Chapter 3 of the note covers the case of low-rank, which just adds a few extra terms into the equations. Finally, chapter 4 provides the log-likelihood formula for the REML case. The skipped piece of code dealing with proximal contamination is covered in yet another supplementary note as well as in the thesis (section 4.1). The difference between the two sources is minor, the former is closer to the code (i.e. $\delta$ is used instead of $\gamma=1/\delta$).