It’s better to start from the old
lmm.py version in order to gain understanding.
setK methods are easy to understand, because they simply compute the economy decomposition $ K = U_1^T \Lambda_1 U_1 $, either by taking SVD of the matrix $ G $ (recall that $ K = GG^T $) or by spectral decomposition of $K$. Both methods take the
a2 parameter, which is the mixing parameter, i.e. the kinship matrix $K $ equals $ (1-a_2)K_0 + a_2 K_1 $.
The next step is setting matrix $X$ (covariates) and vector $y$ (phenotypes). The methods
sety also compute the rotated versions $U_1^TX$ and $U_1^Ty$. Additionally, if $k$ (the rank of $K$) is less than the number of individuals, $N$, the matrices $(I-U_1U_1^T)X$ and $(I-U_1U_1^T)y$ are computed. The rationale can be found in thesis 3.4.3 under the subheading ‘Finding the maximum likelihood and parameters efficiently’ - the formula (3.27) includes these matrices.
The class also can find optimal
h2 parameters for the model, where the total variance is given by $ h_2 ((1-a_2)K_0+a_2 K_1) + (1-h_2)I $. This task is handled by the
findA2 method which searches over a 2d grid. In the inner loop it calls
findH2 which is however not recommended to use - instead it’s suggested to use another method
find_log_delta, which optimizes the $\delta = \sigma^2 / \sigma^2_g$. This, in fact, is one of the crucial ideas. The thesis section 3.3 talks about this algorithmic optimization in terms of $\gamma = 1/\delta$. The formulation of the same idea in terms of $\delta$ can be found in the supplementary note for the paper ‘FaST linear mixed models for genome-wide association studies’.
Thus let’s now peek into
Unfortunately, its docstring is not as polished as the ones for the previous methods but honestly states ‘#Need comments’.
The function body includes the following relationship: $ h_2 = 1/(\delta s_c + 1) $. It is not made clear what is meant by $s_c$ (
sid_count) here, but at least setting it to 1 makes sense because
find_log_delta essentially differ only in the grid spacing, but it’s implied that this difference can boost the computations.
nLLeval method, passing
h2 to it.
In it we suddenly see a commented out reference to
#if REML == True:
# # this needs to be fixed, please see test_gwas.py for details
# raise NotImplementedError("this feature is not ready to use at this time, please use lmm_cov.py instead")
But let’s not digress and move on to consider the most important method in this module.
The function computes log-likelihood, cleverly using the spectral decomposition of $K$.
- Since and all we need from this matrix is the value of its determinant, the first step is computing $\Lambda_1 + \delta I$. These are the following lines in the code:
if logdelta is not None:
delta = SP.exp(logdelta)
if delta is not None:
Sd = (self.S+delta)*scale
Sd = (h2*self.S + (1.0-h2))*scale
logdetK = SP.log(Sd).sum()
Now it depends whether we’re dealing with low-rank case ($k < N$) or the general case. For simplicity, let’s start with the general case and also skip the optional proximal contamination part.
A bunch of matrices is computed:
UXS = self.UX / NP.lib.stride_tricks.as_strided(Sd, (Sd.size,self.UX.shape), (Sd.itemsize,0))
UyS = self.Uy / Sd
XKX = UXS.T.dot(self.UX)
XKy = UXS.T.dot(self.Uy)
yKy = UyS.T.dot(self.Uy)
- The estimate $ \hat\beta$ is calculated as follows:
i_pos = SxKx>1E-10
beta = SP.dot(UxKx[:,i_pos],(SP.dot(UxKx[:,i_pos].T,XKy)/SxKx[i_pos]))
Comparing these expressions with the thesis equations 3.26-3.27 reveals that
XKX corresponds to , and
XKy to .
But instead of inverting the matrix its eigendecomposition is computed:
- The next step is estimating $\sigma^2_g$. Depending on the type of likelihood, its denominator is either $ N $ (ML) or $ N-D $ (REML).
- Finally, these estimates are put into the formula for the log-likelihood:
r2 = yKy-XKy.dot(beta)
if dof is None:#Use the Multivariate Gaussian
XX = self.X.T.dot(self.X)
logdetXX = SP.log(Sxx).sum()
logdetXKX = SP.log(SxKx).sum()
sigma2 = r2 / (N - D)
nLL = 0.5 * ( logdetK + logdetXKX - logdetXX + (N-D) * ( SP.log(2.0*SP.pi*sigma2) + 1 ) )
sigma2 = r2 / (N)
nLL = 0.5 * ( logdetK + N * ( SP.log(2.0*SP.pi*sigma2) + 1 ) )
The ML formula is found in section 1.4 of the mentioned supplementary note.
Chapter 3 of the note covers the case of low-rank, which just adds a few extra terms into the equations.
Finally, chapter 4 provides the log-likelihood formula for the REML case.
The skipped piece of code dealing with proximal contamination is covered in yet another supplementary note as well as in the thesis (section 4.1). The difference between the two sources is minor, the former is closer to the code (i.e. $\delta$ is used instead of $\gamma=1/\delta$).