[FaST-LMM] Proximal contamination

March 29, 2015
[LMM] literature overview: performance
March 27, 2015
[LMM] literature overview: approximate methods
March 15, 2015
[FaST-LMM] Proximal contamination
March 13, 2015
[FaST-LMM] REML estimate
March 11, 2015
[FaST-LMM] comparison with PyLMM (continued)
March 10, 2015
[FaST-LMM] comparison with PyLMM (practice)
March 9, 2015
[FaST-LMM] comparison with PyLMM (theory)
March 3, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 2
February 27, 2015
[FaST-LMM] high-level overview, part 2
February 25, 2015
[FaST-LMM] high-level overview of the codebase, part 1
February 18, 2015
[FaST-LMM] fastlmm/inference/lmm.py
February 16, 2015
[FaST-LMM] fastlmm/inference/lmm_cov.py, part 1

Today I decided to perform some practical testing of removing nearby SNPs: does it really help? and if so, how much?

Simulating phenotypes

I decided to try a different, larger, dataset. Such one is found in pylmm distrubution. However, I ran into an issue with it. I don’t know how phenotype values from pylmm/data/snps.132k.clean.noX.fake.phenos were generated, but Shapiro-Wilk suggest it’s simply normal distribution. A more disturbing fact is that h2 is estimated to be almost zero. It seems that there are no causal SNPs at all.

So I decided to simulate phenotype values on my own.

First of all, I downloaded GCTA software, which can generate phenotypes according to LMM:

$ wget http://www.complextraitgenomics.com/software/gcta/gcta_1.24.4.zip
$ unzip gcta_1.24.4.zip

Then I moved to the ‘data/’ directory of PyLMM, selected a few SNP IDs from the snps.132k.clean.noX.bim file and simulated a phenotype:

$ cat causal.snps.fake 
mm37-4-133676920
mm37-8-65761568
mm37-14-67815624
mm37-17-79322387
mm37-2-102530353
mm37-4-63910807
mm37-16-74943366
mm37-14-107893320
mm37-15-70067681
mm37-7-13055310
mm37-14-88557274
mm37-18-77348115
$ ./gcta64 --bfile snps.132k.clean.noX --simu-qt  --out simulated.phenos --simu-causal-loci causal.snps.fake

It generated a file with phenotype values, and a .par file with the list of selected SNPs’ effects:

$ cat simulated.phenos.par 
QTL	RefAllele	Frequency	Effect
mm37-2-102530353	1	0.467511	0.476187
mm37-4-63910807	1	0.325677	-2.4254
mm37-4-133676920	1	0.412418	0.874929
mm37-7-13055310	1	0.487124	-0.349036
mm37-8-65761568	1	0.498354	-1.22581
mm37-14-67815624	1	0.240277	-0.655073
mm37-14-88557274	1	0.488778	0.56847
mm37-14-107893320	1	0.283429	1.04148
mm37-15-70067681	1	0.419196	0.0930597
mm37-16-74943366	1	0.388843	-0.484555
mm37-17-79322387	1	0.465428	0.282487
mm37-18-77348115	1	0.365039	1.03661

The values from the phenotype file don’t seem to follow normal distribution (Shapiro-Wilk p-value is 0.03), which is good. The characteristics of the distribution are these:

summary(read.table("simulated.phenos.phen")$V3)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-42.37000  -8.70700   0.27810   0.03152   8.77100  40.03000

The effects are small compared to noise, but that only makes it more interesting. Let’s see how well the procedure can handle this. The relatively large number of causal SNPs also adds complexity.

The estimated mixing parameter h2 is around 0.075 and doesn’t change much when $X$ with SNPs is provided, and a refit is done.

Assessing p-values for causal SNPs

Let’s save snps and causal SNPs into arrays.

causal_ids = open("causal.snps.fake").read().split("\n")
causal_ids = set(causal_ids[:-1]) # last line is empty

# using pylmm to read files
IN = input.plink(bfile,type='b', phenoFile=pheno,normGenotype=True)
i = 0
snps = []
causal_snps = []

for snp,id in IN:
    if np.var(snp) == 0.0:
        continue
    if id in causal_ids:
        causal_snps.append((snp, id, i)) # indexes will be helpful later
    snps.append(snp)
    i += 1

snps = np.array(snps).T

import fastlmm.util.standardizer as stdizer
stdizer.Unit().standardize(snps) # set mean=0, var=1

And now compute P-values using the built LMM (here I use lmm_cov.py):

import scipy.stats as st

h2 = lmm_cov.findH2()['h2']

for snp, id, i in causal_snps:
    x = snp.reshape((snp.shape[0], 1))
    X = np.hstack((x, X0))

    r2 = lmm_cov.nLLeval_2K(h2, snps=X, penalty=1e-8)
    b2 = r2['beta'][:,0]
    varbeta = r2['variance_beta'][:,0]
    pv = st.chi2.sf(b2*b2/varbeta, 1)
    print "| %s\t| %.5f |" % (id, pv[0])

The output shows that only three of them are identified (p < 0.05):

mm37-2-102530353 0.33208
mm37-4-63910807 0.00010
mm37-4-133676920 0.06684
mm37-7-13055310 0.29550
mm37-8-65761568 0.00119
mm37-14-67815624 0.68446
mm37-14-88557274 0.48970
mm37-14-107893320 0.00561
mm37-15-70067681 0.58416
mm37-16-74943366 0.74048
mm37-17-79322387 0.56762
mm37-18-77348115 0.43114

Now let’s try to apply code for avoiding proximal contamination and see how that influences the p-values. In lmm_cov.py, the matrix WW is computed in a weird way, I had to change it to

WW = np.eye(num_exclude) - computeAKA(Sd=Sd, denom=denom, UA=UW, UUA=UUW)

and replace every -= with += for everything to make sense.

Then, I couldn’t figure out how to call nLLeval_2K with right parameters, so I simply called the low-level method nLLcore:

for snp, id, i in causal_snps:
    x = snp.reshape((snp.shape[0], 1))
    X = np.hstack((x, X0))
    
    # usual results
    r2_1 = lmm_cov.nLLeval_2K(h2, snps=X, penalty=1e-8)
    
    b2 = r2_1['beta'][:,0]
    varbeta = r2_1['variance_beta'][:,0]
    pv1 = st.chi2.sf(b2*b2/varbeta, 1)[0]
    print "| %s\t | %.5f | " % (id, pv1),

    for n in [1, 2, 3, 5, 10, 20, 30]:
        UW, UUW = lmm_cov.rotate(snps[:,i-n:i+n+1]/np.sqrt(snps.shape[1]))
        Usnps,UUsnps = lmm_cov.rotate(X)
        Sd = h2*lmm_cov.S + (1-h2)
    
        # results with nearby SNPs removed
        r2_2 = lmm_cov.nLLcore(Sd=Sd, denom=1-h2, Usnps=Usnps, UUsnps=UUsnps, UW=UW, UUW=UUW, weightW=np.ones(UW.shape[1]), penalty=1e-8)
    
        b2 = r2_2['beta'][:,0]
        varbeta = r2_2['variance_beta'][:,0]
        pv2 = st.chi2.sf(b2*b2/varbeta, 1)[0]
        print "%.5f |" % pv2,
    print

Most P-values got closer to zero, but not enough to call it a success. And that is with 30*2+1=61 SNPs removed, which is quite a lot.

Number of SNPs removed 0 3 5 7 11 21 41 61
mm37-2-102530353 0.33208 0.32986 0.32827 0.32912 0.32674 0.31521 0.30549 0.29103
mm37-4-63910807 0.00010 0.00009 0.00009 0.00009 0.00008 0.00006 0.00003 0.00001
mm37-4-133676920 0.06684 0.06464 0.06311 0.06176 0.05882 0.05218 0.04023 0.02979
mm37-7-13055310 0.29550 0.29241 0.29009 0.28927 0.28496 0.27669 0.26859 0.24872
mm37-8-65761568 0.00119 0.00111 0.00106 0.00102 0.00093 0.00075 0.00055 0.00038
mm37-14-67815624 0.68446 0.68521 0.68596 0.68656 0.68755 0.68705 0.68727 0.69435
mm37-14-88557274 0.48970 0.48670 0.48425 0.48245 0.47828 0.46994 0.44329 0.41584
mm37-14-107893320 0.00561 0.00526 0.00516 0.00501 0.00452 0.00324 0.00158 0.00075
mm37-15-70067681 0.58416 0.58259 0.58186 0.57981 0.57646 0.57024 0.56077 0.54728
mm37-16-74943366 0.74048 0.73834 0.73689 0.73542 0.73253 0.72131 0.69880 0.68521
mm37-17-79322387 0.56762 0.56415 0.56313 0.56259 0.56054 0.55554 0.54652 0.53043
mm37-18-77348115 0.43114 0.42660 0.42330 0.42007 0.41291 0.39242 0.38557 0.38033

Comparing $\hat{\beta} $ with $\beta$

Since we know what are the real coefficients, it’s interesting to plot true values against estimated ones (errorbars are $\pm 2\,$SD).

No wonder that even the largest effects won’t pass Bonferroni correction:

Simpler input

Let’s now leave only three causal SNPs and specify the effect size in the file fed to GCTA tool:

$ cat causal.snps.fake2
mm37-2-102530353  3
mm37-4-63910807 4
mm37-4-133676920  5
$ ./gcta64 --bfile snps.132k.clean.noX --simu-qt  --out simulated2.phenos --simu-causal-loci causal.snps.fake2

Running the same analysis but with the new phenotypes produces the following P-values:

Number of SNPs removed 0 3 5 7 11 21 41 61
mm37-2-102530353 0.0006150 0.0005883 0.0005624 0.0005629 0.0005381 0.0004462 0.0003483 0.0002527
mm37-4-63910807 0.0017976 0.0016955 0.0016545 0.0015902 0.0015148 0.0012178 0.0007377 0.0003833
mm37-4-133676920 0.0000017 0.0000014 0.0000012 0.0000011 0.0000008 0.0000004 0.0000001 0.0000000

Although p-values are better now, estimated effect sizes are 5.0, 5.6, and 6.8 instead of 3,4,5, with SDs of 1.5–1.8. Also, large effect size doesn’t guarantee small p-value (variance can be high as well).

Conclusions

  • the simple synthetic tests performed above show that GWAS is not a simple business
  • removing nearby SNPs indeed helps, but it can at best reduce a p-value by an order or two of magnitude