It’s better to start from the old lmm.py
version in order to gain understanding.
Its setG
and setK
methods are easy to understand, because they simply compute the economy decomposition $ K = U_1^T \Lambda_1 U_1 $, either by taking SVD of the matrix $ G $ (recall that $ K = GG^T $) or by spectral decomposition of $K$. Both methods take the a2
parameter, which is the mixing parameter, i.e. the kinship matrix $K $ equals $ (1-a_2)K_0 + a_2 K_1 $.
The next step is setting matrix $X$ (covariates) and vector $y$ (phenotypes). The methods setX
and sety
also compute the rotated versions $U_1^TX$ and $U_1^Ty$. Additionally, if $k$ (the rank of $K$) is less than the number of individuals, $N$, the matrices $(I-U_1U_1^T)X$ and $(I-U_1U_1^T)y$ are computed. The rationale can be found in thesis 3.4.3 under the subheading ‘Finding the maximum likelihood and parameters efficiently’ - the formula (3.27) includes these matrices.
The class also can find optimal a2
and h2
parameters for the model, where the total variance is given by $ h_2 ((1-a_2)K_0+a_2 K_1) + (1-h_2)I $. This task is handled by the findA2
method which searches over a 2d grid. In the inner loop it calls findH2
which is however not recommended to use - instead it’s suggested to use another method find_log_delta
, which optimizes the $\delta = \sigma^2 / \sigma^2_g$. This, in fact, is one of the crucial ideas. The thesis section 3.3 talks about this algorithmic optimization in terms of $\gamma = 1/\delta$. The formulation of the same idea in terms of $\delta$ can be found in the supplementary note for the paper ‘FaST linear mixed models for genome-wide association studies’.
Thus let’s now peek into find_log_delta
method.
Unfortunately, its docstring is not as polished as the ones for the previous methods but honestly states ‘#Need comments’.
The function body includes the following relationship: $ h_2 = 1/(\delta s_c + 1) $. It is not made clear what is meant by $s_c$ (sid_count
) here, but at least setting it to 1 makes sense because
So findH2
and find_log_delta
essentially differ only in the grid spacing, but it’s implied that this difference can boost the computations.
Ultimately, find_log_delta
calls nLLeval
method, passing h2
to it.
In it we suddenly see a commented out reference to lmm_cov.py
:
#if REML == True:
# # this needs to be fixed, please see test_gwas.py for details
# raise NotImplementedError("this feature is not ready to use at this time, please use lmm_cov.py instead")
But let’s not digress and move on to consider the most important method in this module.
nLLeval
The function computes log-likelihood, cleverly using the spectral decomposition of $K$.
- Since and all we need from this matrix is the value of its determinant, the first step is computing $\Lambda_1 + \delta I$. These are the following lines in the code:
if logdelta is not None:
delta = SP.exp(logdelta)
if delta is not None:
Sd = (self.S+delta)*scale
else:
Sd = (h2*self.S + (1.0-h2))*scale
logdetK = SP.log(Sd).sum()
-
Now it depends whether we’re dealing with low-rank case ($k < N$) or the general case. For simplicity, let’s start with the general case and also skip the optional proximal contamination part.
-
A bunch of matrices is computed:
UXS = self.UX / NP.lib.stride_tricks.as_strided(Sd, (Sd.size,self.UX.shape[1]), (Sd.itemsize,0))
UyS = self.Uy / Sd
XKX = UXS.T.dot(self.UX)
XKy = UXS.T.dot(self.Uy)
yKy = UyS.T.dot(self.Uy)
- The estimate $ \hat\beta$ is calculated as follows:
[SxKx,UxKx]= LA.eigh(XKX)
i_pos = SxKx>1E-10
beta = SP.dot(UxKx[:,i_pos],(SP.dot(UxKx[:,i_pos].T,XKy)/SxKx[i_pos]))
Comparing these expressions with the thesis equations 3.26-3.27 reveals that XKX
corresponds to , and XKy
to .
But instead of inverting the matrix its eigendecomposition is computed:
- The next step is estimating $\sigma^2_g$. Depending on the type of likelihood, its denominator is either $ N $ (ML) or $ N-D $ (REML).
- Finally, these estimates are put into the formula for the log-likelihood:
r2 = yKy-XKy.dot(beta)
if dof is None:#Use the Multivariate Gaussian
if REML:
XX = self.X.T.dot(self.X)
[Sxx,Uxx]= LA.eigh(XX)
logdetXX = SP.log(Sxx).sum()
logdetXKX = SP.log(SxKx).sum()
sigma2 = r2 / (N - D)
nLL = 0.5 * ( logdetK + logdetXKX - logdetXX + (N-D) * ( SP.log(2.0*SP.pi*sigma2) + 1 ) )
else:
sigma2 = r2 / (N)
nLL = 0.5 * ( logdetK + N * ( SP.log(2.0*SP.pi*sigma2) + 1 ) )
The ML formula is found in section 1.4 of the mentioned supplementary note.
Chapter 3 of the note covers the case of low-rank, which just adds a few extra terms into the equations.
Finally, chapter 4 provides the log-likelihood formula for the REML case.
The skipped piece of code dealing with proximal contamination is covered in yet another supplementary note as well as in the thesis (section 4.1). The difference between the two sources is minor, the former is closer to the code (i.e. $\delta$ is used instead of $\gamma=1/\delta$).