5 Association Testing¶

After QC and calculation of MDS components, the data are ready for subsequent association tests. Depending on the expected genetic model of the trait or disease of interest and the nature of the phenotypic trait studied, the appropriate statistical test can be selected. In this tutorial, we provide scripts for various types of association that are suitable for binary traits (e.g., alcohol dependent patients vs. healthy controls) or quantitative traits (e.g., the number of alcoholic beverages consumed per week).

Biometic model¶

The theory of genetic association is based on the biometrical model. Here's an example (taken from the Biometrical Genetics chapter in Methodology of Genetic Studies of Twins and Families):

Imagine that we are talking about genes that influece adult stature. Let us assume that the normal range for males is from 4 feet 10 inches to 6 feet 8 inches: that is, about 22 inches. And let us assume that each somatic chromosome has one gene of roughly equivalent effect. Then, roughly speaking, we are thinking in terms of loci for which the homozygotes contribute ±1/2 (from the midpoint), depending on whether they are AA, the increasing homozygote, or aa, the decreasing homozygote. In reality, although some loci may contribute greater effects than this, others will almost certainly contribute less; thus, we are talking about the kind of model in which any particular polygene is having an effect that would be difficult too detect by the methods of classical genetics.

Typically in GWAS, we use linear models for when the phenotype is continuous (BMI, height, blood pressure) and logistic models for when the phenotpye is binary (whether you have the disease or not). Both of these models should incorporate an additional random effect term $e$, and this is to minimise the effect of covariates. For example, in the above example, the same height genes might affect females in the same way they do males, but the range might be smaller.

6.1 Modelling¶

The linear/logistic model¶

The basic linear regression model is written as:

$$y = G\beta_G + X\beta_X + \epsilon$$

Here, $y$ is the phenotype vector, $G$ is the genotype/dosage matrix (the allele counts) for the current variant, $X$ is the fixed-covariate matrix, and $e$ is the error term subject to least-squares minimization.

For binary phenotypes, --logistic fits a logistic or Firth regression model instead, with the same $G\beta_G + X\beta_X$ terms. More details abot PLINK's association methods can be found here.

Within PLINK, the association between SNPs and quantitative outcome measures can be tested with the options ‐‐assoc and ‐‐linear. When PLINK detects a quantitative outcome measure (i.e., values other than 1, 2, 0, or NA), the ‐‐assoc option will automatically treat it as such by performing an asymptotic version of the usual Student's t-test to compare two means. This option does not allow the use of covariates. The ‐‐linear option in PLINK performs a linear regression analysis with each individual SNP as a predictor. Similar to the ‐‐logistic option, the ‐‐linear option enables the use of covariates and is somewhat slower than the ‐‐assoc option.

PLINK in association testing¶

PLINK offers one degree of freedom (1 df) allelic tests in which the trait value, or the log‐odds of a binary trait, increases or decreases linearly as a function of the number of risk alleles (minor allele a against major allele A). In addition, non‐additive tests are available, for instance, the genotypic association test (2 df: aa vs Aa vs AA), the dominant gene action test (1 df: aa & Aa vs AA), and the recessive gene action test, (1 df: aa vs Aa & AA). However, non‐additive tests are not widely applied, because the statistical power to detect non‐additivity is low in practice. More complex analyses (e.g., Cox regression analysis) can be performed by using R‐based “plug‐in” functions in PLINK.

6.1.1 Commands¶

We link the data folder to easily access it

In [1]:

                
                    Copied!
                    
ln -sf ../Data
ln -sf ../Data

Recall that our dataset contains binary traits: either a phenotype of 0 or 1. In this tutorial, we will apply both --assocand --logistic to our data. A reminder that --assoc option does not allow to correct covariates such as principal components (PC's)/MDS components, which makes it less suited for association analyses. But it's worth seeing the results to make this clear.

We start with --assoc:

In [2]:

                
                    Copied!
                    
# --assoc
plink --bfile HapMap_3_r3_13 --assoc --out assoc_results > /dev/null
# --assoc
plink --bfile HapMap_3_r3_13 --assoc --out assoc_results > /dev/null

This is how the association results look

In [3]:

                
                    Copied!
                    
cat assoc_results.assoc | head -n 10
cat assoc_results.assoc | head -n 10

 CHR         SNP         BP   A1      F_A      F_U   A2        CHISQ            P           OR 
   1   rs3131972     742584    A   0.1944   0.1455    G       0.9281       0.3354        1.418 
   1   rs3131969     744045    A   0.1759      0.1    G        2.647       0.1037        1.921 
   1   rs1048488     750775    C   0.1981   0.1455    T        1.054       0.3045        1.451 
   1  rs12562034     758311    A  0.06481   0.1182    G        1.863       0.1723       0.5171 
   1  rs12124819     766409    G    0.287      0.3    A      0.04416       0.8336       0.9394 
   1   rs4040617     769185    G   0.1667      0.1    A          2.1       0.1473          1.8 
   1   rs4970383     828418    A   0.2222   0.1727    C       0.8431       0.3585        1.368 
   1   rs4475691     836671    T   0.1574   0.1273    C       0.4057       0.5242        1.281 
   1   rs1806509     843817    C   0.4259   0.3455    A         1.49       0.2222        1.406 
cat: write error: Broken pipe

where the columns represent:

CHR Chromosome code
SNP Variant identifier
BP Base-pair coordinate
A1 Allele 1 (usually minor)
F_A Allele 1 frequency among cases
F_U Allele 1 frequency among controls
A2 Allele 2
CHISQ Allelic test chi-square statistic P Allelic test p-value
OR odds(allele 1 | case) / odds(allele 1 | control)

Applying --logistic, we can now consider principal components as covariates. We will use the MDS components covar_mds.txt from the previous tutorial:

In [4]:

                
                    Copied!
                    
# --logistic 
plink --bfile HapMap_3_r3_13 --covar covar_mds.txt --logistic --hide-covar --out logistic_results > /dev/null
# Note, we use the option --hide-covar to only show the additive results of the SNPs in the output file.

# Remove NA values, those might give problems generating plots in later steps.
awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic
# --logistic 
plink --bfile HapMap_3_r3_13 --covar covar_mds.txt --logistic --hide-covar --out logistic_results > /dev/null
# Note, we use the option --hide-covar to only show the additive results of the SNPs in the output file.

# Remove NA values, those might give problems generating plots in later steps.
awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic

The results obtained from these GWAS analyses will be visualized in the last step. This will also show if the data set contains any genome-wide significant SNPs.

Note, in case of a quantitative outcome measure the option --logistic should be replaced by --linear. The use of the --assoc option is also possible for quantitative outcome measures (as metioned previously, this option does not allow the use of covariates).

6.3 Correction for multiple testing¶

Because modern genotyping arrays can genotype up to 4 million markers, multiple testing becomes a problem in our analyses. Three widely applied methods for determining genome‐wide significance are the use of Bonferroni correction, Benjamini–Hochberg false discovery rate (FDR), and permutation testing.

Bonferroni correction¶

The Bonferroni correction, which aims to control the probability of having at least one false positive finding, calculates the adjusted p-value threshold with the formula $0.05/n$, where $n$ is the number of SNPs tested. However, many SNPs are correlated due to linkage disequilibrium (LD). With independence a requirement to avoid unneccesary exclusions, this method is often too conservative and leads to an increase in the proportion of false negative findings.

False discovery rate¶

FDR controls the expected proportion of false positives among all signals with an FDR value below a fixed threshold, assuming that SNPs are independent. Whilst this method is less conservative than Bonferroni correction, controlling for FDR does not imply any notion of statistical significance; it is merely a method to minimize the expected proportion of false positives. Moreover, this method has its own limitation as SNPs and thus p-values are not independent whereas this is an assumption of the FDR method.

Permutation methods¶

The outcome measure labels are randomly permuted multiple times (say, on the order of 1 million) which effectively removes any true association between the outcome measure and the genotype. Then, for all permuted data sets, the statistical tests provide the empirical distribution of the test‐statistic and the p-values under the null hypothesis of no association. The original test statistic or p-value obtained from the observed data is subsequently compared to the empirical distribution of p-values to determine an empirically adjusted p-value.

6.3.1 Commands¶

Both the Bonferroni correction and FDR are straighforward to implement using PLINK, with the command --adjust:

In [5]:

                
                    Copied!
                    
# --adjust
plink --bfile HapMap_3_r3_13 -assoc --adjust --out adjusted_assoc_results > /dev/null
# --adjust
plink --bfile HapMap_3_r3_13 -assoc --adjust --out adjusted_assoc_results > /dev/null

This outputs the following file:

In [6]:

                
                    Copied!
                    
cat adjusted_assoc_results.assoc.adjusted | head -n 10
cat adjusted_assoc_results.assoc.adjusted | head -n 10

 CHR         SNP      UNADJ         GC       BONF       HOLM   SIDAK_SS   SIDAK_SD     FDR_BH     FDR_BY
   3   rs1097157   1.66e-06  1.861e-06          1          1     0.8316     0.8316     0.2871          1 
   3   rs1840290   1.66e-06  1.861e-06          1          1     0.8316     0.8316     0.2871          1 
   8    rs279466  2.441e-06  2.727e-06          1          1     0.9272     0.9272     0.2871          1 
   8    rs279460  2.441e-06  2.727e-06          1          1     0.9272     0.9272     0.2871          1 
   8    rs279453  2.441e-06  2.727e-06          1          1     0.9272     0.9272     0.2871          1 
   8   rs2623297  2.441e-06  2.727e-06          1          1     0.9272     0.9272     0.2871          1 
   3   rs1598120  2.527e-06  2.821e-06          1          1     0.9336     0.9336     0.2871          1 
   3   rs2085079  3.162e-06  3.523e-06          1          1     0.9664     0.9664     0.2871          1 
   3    rs834858  3.162e-06  3.523e-06          1          1     0.9664     0.9664     0.2871          1 
cat: write error: Broken pipe

The Bonferroni correction for all SNPs returns a value of 1, meaning that the SNP does not have significant association to the phenotypye given the threshold $0.05/n$. As stated above, this is an indication that this method can be too conservative and suggests that we have many false negatives (Type II error).

The same is not true for the Benjamini & Hochberg method of FDR (FDR-BH). We can plot the distrubution to see this:

The black line indicates that there are very few values with a FDR-adjusted p-value of ~0.94, with the lowest being ~0.28. Thus, in this particular case, we see that no single variant is significant at the 0.05 level after genome-wide correction.

Finally, we can look

In [7]:

                
                    Copied!
                    
                        
                        
                    
                    

            
## Permutation
# This is a computational intensive step. Further pros and cons of this method, which can be used for 
# association and dealing with multiple testing, are described in our article corresponding to this tutorial 
# (https://www.ncbi.nlm.nih.gov/pubmed/29484742).
# The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.
# The EMP2 collumn provides the for multiple testing corrected p-value.

# Generate subset of SNPs
awk '{ if ($1 == 22) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt
# Filter your bfile based on the subset of SNPs generated in the step above.
plink --bfile HapMap_3_r3_13 --extract subset_snp_chr_22.txt --make-bed --out HapMap_subset_for_perm > /dev/null
# Perform 1000000 perrmutations.
plink --bfile HapMap_subset_for_perm --assoc --mperm 1000000 --out subset_1M_perm_result >/dev/null

# Order your data, from lowest to highest p-value.
sort -gk 4 subset_1M_perm_result.assoc.mperm > sorted_subset.txt
# Check ordered permutation results
head sorted_subset.txt
## Permutation
# This is a computational intensive step. Further pros and cons of this method, which can be used for 
# association and dealing with multiple testing, are described in our article corresponding to this tutorial 
# (https://www.ncbi.nlm.nih.gov/pubmed/29484742).
# The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.
# The EMP2 collumn provides the for multiple testing corrected p-value.

# Generate subset of SNPs
awk '{ if ($1 == 22) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt
# Filter your bfile based on the subset of SNPs generated in the step above.
plink --bfile HapMap_3_r3_13 --extract subset_snp_chr_22.txt --make-bed --out HapMap_subset_for_perm > /dev/null
# Perform 1000000 perrmutations.
plink --bfile HapMap_subset_for_perm --assoc --mperm 1000000 --out subset_1M_perm_result >/dev/null

# Order your data, from lowest to highest p-value.
sort -gk 4 subset_1M_perm_result.assoc.mperm > sorted_subset.txt
# Check ordered permutation results
head sorted_subset.txt

 CHR          SNP         EMP1         EMP2 
  22    rs4821137      0.00012       0.5382 
  22   rs11704699     5.75e-05       0.7744 
  22     rs910541    0.0002495       0.8238 
  22    rs4821138    0.0001915       0.9046 
  22    rs5749022     0.000933       0.9696 
  22    rs5748561     0.000956        0.973 
  22    rs1647712    0.0004915       0.9735 
  22    rs1669114    0.0004915       0.9735 
  22    rs1669125    0.0004915       0.9735

In [8]:

                
                    Copied!
                    
tail -n 10 HapMap_3_r3_13.bim
tail -n 10 HapMap_3_r3_13.bim

22	rs715586	0	49510004	T	C
22	rs8137951	0	49512530	A	G
22	rs2301584	0	49518363	A	G
22	rs756638	0	49518559	A	G
22	rs3810648	0	49522492	G	A
22	rs2285395	0	49524956	A	G
22	rs13056621	0	49528625	A	G
22	rs3888396	0	49558258	C	T
22	rs2238837	0	49559741	C	A
22	rs28729663	0	49565872	A	G

The EMP2 collumn provides the for multiple testing corrected p-value. We have sorted and can see that the lowest value is 0.9071, again implying that there are no significant values.

All in all, when correcting for multiple testing, it is important to keep in mind that different correction measures have different properties, and it is up to the investigator to decide which to use and how to interpret them.

Manhattan and QQ-plots¶

We can make suggestions for the presence of high-association alleles using a visual approach. The first is called a Manhattan plot. The idea is straight-forward: for each SNP, we plot it against the negative log of its p-value from the test of association (Gibson 2010). We have provided a script for generating these plots, which uses the R package qqman.

In [10]:

                
                    Copied!
                    
# These scripts assume R >= 3.0.0.
# If you changed the name of the .assoc file or to the assoc.logistic file, please assign those 
# names also to the Rscripts for the Manhattan and QQ plot, otherwise the scripts will not run.

Rscript --no-save Data/Manhattan_plot.R > /dev/null
Rscript --no-save Data/QQ_plot.R > /dev/null
# These scripts assume R >= 3.0.0.
# If you changed the name of the .assoc file or to the assoc.logistic file, please assign those 
# names also to the Rscripts for the Manhattan and QQ plot, otherwise the scripts will not run.

Rscript --no-save Data/Manhattan_plot.R > /dev/null
Rscript --no-save Data/QQ_plot.R > /dev/null

For example usage please run: vignette('qqman')

Citation appreciated but not required:
Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.


For example usage please run: vignette('qqman')

Citation appreciated but not required:
Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.

Let's first have a look at the plot generated from the --logistic command:

The blue line represents the threshold for significance (in this case, $10^{-5}$). We can see that there are no significant SNPs associated to the phenotype. However, using the --assoc command, we obtain the following plot:

Why is there a difference? To recall, the --assoc command does not correct for covariates. So even though we have promising (and hopefully publishable!) results, this form of analysis may be flawed.

The second method of visually determining significance is to use a QQ-plot. This plots the expected $-\log_{10}p$ value agains the observed $-\log_{10}p$ value. It's a good way to observe not only outliers that could have significant association, but also observe if there are peculiarities with our data. For example, if a plot suggests an extreme deviation between the x- and y-axes, then there is probably an error with our analyses or data.

Again, we will start with our results from the --logistic command:

We can see that towards the upper half of the plot, the observed values and lower than their counterpart expected. This is representative in the corresponding Manhattan plot.

Now looking at the Q-Q plot obtained by the --assoc command:

Now, we have our SNPs whose expected p-value is greater than the observed p-value.

We should remind ourselves at this point that even though we are suggested an associated between these SNPs and the studied phenotype, there is not enough information here to say what is the causal variant. Indeed, there could even be multiple. This would have to be discovered with biological methods. But at least the population of SNPs which need to be studied is significantly reduced.