5 Association Testing¶
After QC and calculation of MDS components, the data are ready for subsequent association tests. Depending on the expected genetic model of the trait or disease of interest and the nature of the phenotypic trait studied, the appropriate statistical test can be selected. In this tutorial, we provide scripts for various types of association that are suitable for binary traits (e.g., alcohol dependent patients vs. healthy controls) or quantitative traits (e.g., the number of alcoholic beverages consumed per week).
Biometic model¶
The theory of genetic association is based on the biometrical model. Here's an example (taken from the Biometrical Genetics chapter in Methodology of Genetic Studies of Twins and Families):
Imagine that we are talking about genes that influece adult stature. Let us assume that the normal range for males is from 4 feet 10 inches to 6 feet 8 inches: that is, about 22 inches. And let us assume that each somatic chromosome has one gene of roughly equivalent effect. Then, roughly speaking, we are thinking in terms of loci for which the homozygotes contribute ±1/2 (from the midpoint), depending on whether they are AA, the increasing homozygote, or aa, the decreasing homozygote. In reality, although some loci may contribute greater effects than this, others will almost certainly contribute less; thus, we are talking about the kind of model in which any particular polygene is having an effect that would be difficult too detect by the methods of classical genetics.
Typically in GWAS, we use linear models for when the phenotype is continuous (BMI, height, blood pressure) and logistic models for when the phenotpye is binary (whether you have the disease or not). Both of these models should incorporate an additional random effect term $e$, and this is to minimise the effect of covariates. For example, in the above example, the same height genes might affect females in the same way they do males, but the range might be smaller.
6.1 Modelling¶
The linear/logistic model¶
The basic linear regression model is written as:
$$y = G\beta_G + X\beta_X + \epsilon$$Here, $y$ is the phenotype vector, $G$ is the genotype/dosage matrix (the allele counts) for the current variant, $X$ is the fixed-covariate matrix, and $e$ is the error term subject to least-squares minimization.
For binary phenotypes, --logistic
fits a logistic or Firth regression model instead, with the same $G\beta_G + X\beta_X$ terms. More details abot PLINK's association methods can be found here.
Within PLINK, the association between SNPs and quantitative outcome measures can be tested with the options ‐‐assoc
and ‐‐linear
. When PLINK detects a quantitative outcome measure (i.e., values other than 1, 2, 0, or NA), the ‐‐assoc
option will automatically treat it as such by performing an asymptotic version of the usual Student's t-test to compare two means. This option does not allow the use of covariates. The ‐‐linear
option in PLINK performs a linear regression analysis with each individual SNP as a predictor. Similar to the ‐‐logistic
option, the ‐‐linear
option enables the use of covariates and is somewhat slower than the ‐‐assoc
option.
PLINK in association testing¶
PLINK offers one degree of freedom (1 df) allelic tests in which the trait value, or the log‐odds of a binary trait, increases or decreases linearly as a function of the number of risk alleles (minor allele a against major allele A). In addition, non‐additive tests are available, for instance, the genotypic association test (2 df: aa vs Aa vs AA), the dominant gene action test (1 df: aa & Aa vs AA), and the recessive gene action test, (1 df: aa vs Aa & AA). However, non‐additive tests are not widely applied, because the statistical power to detect non‐additivity is low in practice. More complex analyses (e.g., Cox regression analysis) can be performed by using R‐based “plug‐in” functions in PLINK.
6.1.1 Commands¶
We link the data folder to easily access it
ln -sf ../Data
Recall that our dataset contains binary traits: either a phenotype of 0 or 1. In this tutorial, we will apply both --assoc
and --logistic
to our data. A reminder that --assoc
option does not allow to correct covariates such as principal components (PC's)/MDS components, which makes it less suited for association analyses. But it's worth seeing the results to make this clear.
We start with --assoc
:
# --assoc
plink --bfile HapMap_3_r3_13 --assoc --out assoc_results > /dev/null
This is how the association results look
cat assoc_results.assoc | head -n 10
CHR SNP BP A1 F_A F_U A2 CHISQ P OR 1 rs3131972 742584 A 0.1944 0.1455 G 0.9281 0.3354 1.418 1 rs3131969 744045 A 0.1759 0.1 G 2.647 0.1037 1.921 1 rs1048488 750775 C 0.1981 0.1455 T 1.054 0.3045 1.451 1 rs12562034 758311 A 0.06481 0.1182 G 1.863 0.1723 0.5171 1 rs12124819 766409 G 0.287 0.3 A 0.04416 0.8336 0.9394 1 rs4040617 769185 G 0.1667 0.1 A 2.1 0.1473 1.8 1 rs4970383 828418 A 0.2222 0.1727 C 0.8431 0.3585 1.368 1 rs4475691 836671 T 0.1574 0.1273 C 0.4057 0.5242 1.281 1 rs1806509 843817 C 0.4259 0.3455 A 1.49 0.2222 1.406 cat: write error: Broken pipe
where the columns represent:
CHR
Chromosome code
SNP
Variant identifier
BP
Base-pair coordinate
A1
Allele 1 (usually minor)
F_A
Allele 1 frequency among cases
F_U
Allele 1 frequency among controls
A2
Allele 2
CHISQ
Allelic test chi-square statistic
P
Allelic test p-value
OR
odds(allele 1 | case) / odds(allele 1 | control)
Applying --logistic
, we can now consider principal components as covariates. We will use the MDS components covar_mds.txt
from the previous tutorial:
# --logistic
plink --bfile HapMap_3_r3_13 --covar covar_mds.txt --logistic --hide-covar --out logistic_results > /dev/null
# Note, we use the option --hide-covar to only show the additive results of the SNPs in the output file.
# Remove NA values, those might give problems generating plots in later steps.
awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic
The results obtained from these GWAS analyses will be visualized in the last step. This will also show if the data set contains any genome-wide significant SNPs.
Note, in case of a quantitative outcome measure the option --logistic
should be replaced by --linear
. The use of the --assoc
option is also possible for quantitative outcome measures (as metioned previously, this option does not allow the use of covariates).
6.3 Correction for multiple testing¶
Because modern genotyping arrays can genotype up to 4 million markers, multiple testing becomes a problem in our analyses. Three widely applied methods for determining genome‐wide significance are the use of Bonferroni correction, Benjamini–Hochberg false discovery rate (FDR), and permutation testing.
Bonferroni correction¶
The Bonferroni correction, which aims to control the probability of having at least one false positive finding, calculates the adjusted p-value threshold with the formula $0.05/n$, where $n$ is the number of SNPs tested. However, many SNPs are correlated due to linkage disequilibrium (LD). With independence a requirement to avoid unneccesary exclusions, this method is often too conservative and leads to an increase in the proportion of false negative findings.
False discovery rate¶
FDR controls the expected proportion of false positives among all signals with an FDR value below a fixed threshold, assuming that SNPs are independent. Whilst this method is less conservative than Bonferroni correction, controlling for FDR does not imply any notion of statistical significance; it is merely a method to minimize the expected proportion of false positives. Moreover, this method has its own limitation as SNPs and thus p-values are not independent whereas this is an assumption of the FDR method.
Permutation methods¶
The outcome measure labels are randomly permuted multiple times (say, on the order of 1 million) which effectively removes any true association between the outcome measure and the genotype. Then, for all permuted data sets, the statistical tests provide the empirical distribution of the test‐statistic and the p-values under the null hypothesis of no association. The original test statistic or p-value obtained from the observed data is subsequently compared to the empirical distribution of p-values to determine an empirically adjusted p-value.
6.3.1 Commands¶
Both the Bonferroni correction and FDR are straighforward to implement using PLINK, with the command --adjust
:
# --adjust
plink --bfile HapMap_3_r3_13 -assoc --adjust --out adjusted_assoc_results > /dev/null
This outputs the following file:
cat adjusted_assoc_results.assoc.adjusted | head -n 10
CHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 3 rs1097157 1.66e-06 1.861e-06 1 1 0.8316 0.8316 0.2871 1 3 rs1840290 1.66e-06 1.861e-06 1 1 0.8316 0.8316 0.2871 1 8 rs279466 2.441e-06 2.727e-06 1 1 0.9272 0.9272 0.2871 1 8 rs279460 2.441e-06 2.727e-06 1 1 0.9272 0.9272 0.2871 1 8 rs279453 2.441e-06 2.727e-06 1 1 0.9272 0.9272 0.2871 1 8 rs2623297 2.441e-06 2.727e-06 1 1 0.9272 0.9272 0.2871 1 3 rs1598120 2.527e-06 2.821e-06 1 1 0.9336 0.9336 0.2871 1 3 rs2085079 3.162e-06 3.523e-06 1 1 0.9664 0.9664 0.2871 1 3 rs834858 3.162e-06 3.523e-06 1 1 0.9664 0.9664 0.2871 1 cat: write error: Broken pipe
The Bonferroni correction for all SNPs returns a value of 1, meaning that the SNP does not have significant association to the phenotypye given the threshold $0.05/n$. As stated above, this is an indication that this method can be too conservative and suggests that we have many false negatives (Type II error).
The same is not true for the Benjamini & Hochberg method of FDR (FDR-BH
). We can plot the distrubution to see this:
The black line indicates that there are very few values with a FDR-adjusted p-value of ~0.94, with the lowest being ~0.28. Thus, in this particular case, we see that no single variant is significant at the 0.05 level after genome-wide correction.
Finally, we can look
## Permutation
# This is a computational intensive step. Further pros and cons of this method, which can be used for
# association and dealing with multiple testing, are described in our article corresponding to this tutorial
# (https://www.ncbi.nlm.nih.gov/pubmed/29484742).
# The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.
# The EMP2 collumn provides the for multiple testing corrected p-value.
# Generate subset of SNPs
awk '{ if ($1 == 22) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt
# Filter your bfile based on the subset of SNPs generated in the step above.
plink --bfile HapMap_3_r3_13 --extract subset_snp_chr_22.txt --make-bed --out HapMap_subset_for_perm > /dev/null
# Perform 1000000 perrmutations.
plink --bfile HapMap_subset_for_perm --assoc --mperm 1000000 --out subset_1M_perm_result >/dev/null
# Order your data, from lowest to highest p-value.
sort -gk 4 subset_1M_perm_result.assoc.mperm > sorted_subset.txt
# Check ordered permutation results
head sorted_subset.txt
CHR SNP EMP1 EMP2 22 rs4821137 0.00012 0.5382 22 rs11704699 5.75e-05 0.7744 22 rs910541 0.0002495 0.8238 22 rs4821138 0.0001915 0.9046 22 rs5749022 0.000933 0.9696 22 rs5748561 0.000956 0.973 22 rs1647712 0.0004915 0.9735 22 rs1669114 0.0004915 0.9735 22 rs1669125 0.0004915 0.9735
tail -n 10 HapMap_3_r3_13.bim
22 rs715586 0 49510004 T C 22 rs8137951 0 49512530 A G 22 rs2301584 0 49518363 A G 22 rs756638 0 49518559 A G 22 rs3810648 0 49522492 G A 22 rs2285395 0 49524956 A G 22 rs13056621 0 49528625 A G 22 rs3888396 0 49558258 C T 22 rs2238837 0 49559741 C A 22 rs28729663 0 49565872 A G
The EMP2 collumn provides the for multiple testing corrected p-value. We have sorted and can see that the lowest value is 0.9071, again implying that there are no significant values.
All in all, when correcting for multiple testing, it is important to keep in mind that different correction measures have different properties, and it is up to the investigator to decide which to use and how to interpret them.
Manhattan and QQ-plots¶
We can make suggestions for the presence of high-association alleles using a visual approach. The first is called a Manhattan plot. The idea is straight-forward: for each SNP, we plot it against the negative log of its p-value from the test of association (Gibson 2010). We have provided a script for generating these plots, which uses the R package qqman
.
# These scripts assume R >= 3.0.0.
# If you changed the name of the .assoc file or to the assoc.logistic file, please assign those
# names also to the Rscripts for the Manhattan and QQ plot, otherwise the scripts will not run.
Rscript --no-save Data/Manhattan_plot.R > /dev/null
Rscript --no-save Data/QQ_plot.R > /dev/null
For example usage please run: vignette('qqman') Citation appreciated but not required: Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731. For example usage please run: vignette('qqman') Citation appreciated but not required: Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.
Let's first have a look at the plot generated from the --logistic
command:
The blue line represents the threshold for significance (in this case, $10^{-5}$). We can see that there are no significant SNPs associated to the phenotype. However, using the --assoc
command, we obtain the following plot:
Why is there a difference? To recall, the --assoc
command does not correct for covariates. So even though we have promising (and hopefully publishable!) results, this form of analysis may be flawed.
The second method of visually determining significance is to use a QQ-plot. This plots the expected $-\log_{10}p$ value agains the observed $-\log_{10}p$ value. It's a good way to observe not only outliers that could have significant association, but also observe if there are peculiarities with our data. For example, if a plot suggests an extreme deviation between the x- and y-axes, then there is probably an error with our analyses or data.
Again, we will start with our results from the --logistic
command:
We can see that towards the upper half of the plot, the observed values and lower than their counterpart expected. This is representative in the corresponding Manhattan plot.
Now looking at the Q-Q plot obtained by the --assoc
command:
Now, we have our SNPs whose expected p-value is greater than the observed p-value.
We should remind ourselves at this point that even though we are suggested an associated between these SNPs and the studied phenotype, there is not enough information here to say what is the causal variant. Indeed, there could even be multiple. This would have to be discovered with biological methods. But at least the population of SNPs which need to be studied is significantly reduced.