Association testing

Important notes for this notebook

After QC and generating MDS components as covariates to address population structure, the data is ready for association tests. In this tutorial, we provide scripts for association tests suitable for binary traits (e.g., alcohol-dependent patients vs. healthy controls) or quantitative traits (e.g., number of alcoholic beverages consumed per week).

Learning outcomes

  • Discuss different types of association tests
  • Identify the suitable association test for your data

How to make this notebook work

  • In this notebook, we will use both R and bash command line programming languages. Remember to change the kernel whenever you transition from one language to the other (Kernel --> Change Kernel) indicated by the languages’ images. We will first run Bash commands.

Bash Choose the Bash kernel

Some theory

Biometric model

The theory behind genetic association is rooted in the biometrical model, first established by Fisher. According to this model, each genetic variant has an additive effect on a trait’s value. For a specific variant and its genotypes, each genotype contributes differently to the trait. For example, in a two-allele system (A and a), the effects of the three genotypes (AA, Aa, a) are defined by two parameters: d (twice the difference between the homozygotes, AA and aa) and h (the effect of the heterozygote, Aa). The mean effect of the homozygotes is m. These parameters, d and h, are called genotypic effects (Neale and Cardon 2013).

Figure 3.1. Biometrical genetics. The d and h increments of the gene difference A – a. Aa may lie on either side of m and the sign of h will vary accordingly; in the case illustrated h would be negative (adapted from Mather and Jinks, 1977, p. 32).

Example

Imagine that we are talking about genes that influence adult stature. Let us assume that the normal range for males is from 1.47m to 2m: that is, about 0.56m. Let’s assume that each somatic chromosome has one gene of roughly equivalent effect. Then, in each locus, the homozygotes contribute ±d=±0.28 cm (from the midpoint, m), depending on whether they are AA, the increasing homozygote, or aa, the decreasing homozygote. While some loci may have larger effects, others are likely to have smaller contributions. This model suggests that the effect of any given gene is subtle and difficult to detect using classical genetic methods.

In GWAS, linear models have been typically used for continuous phenotypes (e.g. BMI, height, blood pressure) and logistic models for binary traits (e.g. disease presence). These models account for fixed effects (such as genotype) but also need to consider random effects, represented as an error term, \(e\), to minimize the influence of covariates, like sex of population structure.

Linear mixed models (LMMs) are increasingly popular as an alternative to standard linear models and have proven to be effective for analyzing complex traits. They adjust for confounding factors such as population stratification, family structure, and cryptic relatedness, resulting in more reliable test statistics. However, they are usually more computationally demanding.

Modelling

The linear regression model

The basic linear regression model is written as:

\[y = G\beta_G + X\beta_X + \epsilon\]

Here, \(y\) is the phenotype vector, \(G\) is the genotype/dosage matrix (the allele counts) for the current variant, \(X\) is the fixed-covariate matrix, and \(e\) is the error term subject to least-squares minimization.

PLINK supports SNP-trait association testing for both quantitative and binary traits:

  • -assoc: It performs chi-square tests for binary traits. For quantitative traits, it uses asymptotic methods (likelihood ratio test or Wald test) or empirical significance values. Covariates are not supported. It will automatically treat outcomes as quantitative when the phenotype column contains values other than 1, 2, 0, or missing.
  • --linear: Fits linear regression models for quantitative traits, allowing covariates.
  • --logistic: Fits or Firth logistic regression models for binary traits, supporting covariates and SNP-covariate interactions.

Both --linear and --logistic are more flexible than --assoc but may run slower. Further details are available in PLINK’s documentation. For detailed information on PLINK’s association methods, visit PLINK Association Analysis PLINK 1.9 or PLINK 2.0. Note that command options differ between versions.

Correction for multiple testing

Modern genotyping arrays can test up to 4 million markers, increasing the risk of false positives due to multiple testing. While a single comparison has a low error rate, analyzing millions of markers amplifies this risk. Common methods to address this include:

  • Bonferroni Correction: Adjusts the p-value threshold by dividing \(0.05\) by the number of tests, controlling false positives. It controls the probability of at least one false positive but may be overly strict due to SNP correlation caused by linkage disequilibrium (hence, increased FN).
  • False Discovery Rate (FDR): Minimizes the proportion of false positives among significant results (specified threshold) being less conservative than Bonferroni. However, it assumes SNP independence, which may not hold with LD.
  • Permutation Testing: Randomizes outcome labels to generate an empirical null distribution, enabling robust p-value adjustments. This process is repeated extensively (often millions of times) to eliminate true associations.

To learn more about statistical testing and false positives, look at this online book chapter.

Manhattan and QQ-plots

A common approach to identify high-association alleles is to plot the association results and look for peaks visually. One such method is the Manhattan plot, where each SNP is plotted against the negative log of its p-value from the association test (Gibson 2010). This can be done using the manhattan() function from the qqman package in R. To customize the plots (adjust colors, font sizes, etc.), check the package vignette here.

# Setup to avoid long messages and plot on screen
options(warn=-1)
options(jupyter.plot_mimetypes = 'image/png')

# Load GWAS package qqman
suppressMessages(library("qqman"))

# Manhattan plot using --logistic results
results_log <- read.table("Results/GWAS5/logistic_results.assoc_2.logistic", head=TRUE)
manhattan(results_log, main = "Manhattan plot: logistic", cex.axis=1.1)

# Manhattan plot using --assoc
results_as <- read.table("Results/GWAS5/assoc_results.assoc", head=TRUE)
manhattan(results_as, main = "Manhattan plot: assoc", cex.axis=1.1)  

# Zoom-in version in a significant region in chromosome 3
manhattan(subset(results_as, CHR==3), xlim = c(8.1e7, 8.5e7), chr="CHR", 
          main = "Manhattan plot: assoc, chr 3", cex.axis=1.1)

Stop - Read - Solve

The blue line represents the threshold for significance (in the two plots, \(10^{-5}\)). We see no significant SNPs associated with the phenotype when we use the --logistic command (first plot). However, when we use the --assoc command (second plot), we obtain significant SNPs.

Why is there a difference?

Recall from the beginning of this chapter that the --assoc command does not correct for covariates. So even though we have promising (and hopefully publishable!) results, this form of analysis may be flawed by the underlying population stratification, which is taken into account with the --logistic model.

The second method of visually determining significance is to use a QQ-plot. This plots the expected \(-\log_{10}p\) value against the observed \(-\log_{10}p\) value. It’s a good way to observe not only outliers that could have significant associations but also peculiarities within our data. For example, if a plot suggests an extreme deviation between the x- and y-axes, then there might be an error with our analyses or data.

We will create these plots using the qq() function from the qqman package in R.

# Setup to avoid long messages and plot on-screen
options(warn=-1)
options(jupyter.plot_mimetypes = 'image/png')

# Install and load GWAS package qqman
suppressMessages(library("qqman")) 

# QQ plot for --logistic
results_log <- read.table("Results/GWAS5/logistic_results.assoc_2.logistic", head=TRUE)
qq(results_log$P, main = "Q-Q plot of GWAS p-values (log) using --logistic")

# QQ plot for --assoc
results_as <- read.table("Results/GWAS5/assoc_results.assoc", head=TRUE)
qq(results_as$P, main = "Q-Q plot of GWAS p-values (log) using --assoc")

Stop - Read - Solve

Given the two Q-Q plots (based on logistic regression and basic association test), how do the observed p-values compare to the expected p-values under the null hypothesis? Write a short interpretation based on your observations

Warning

Because of the relatively small sample size of the HapMap data, the genetic effect sizes in these simulations were set higher than what is typically observed in genetic studies of complex traits. In reality, detecting genetic risk factors for complex traits requires much larger sample sizes—typically in the thousands, and often in the tens or even hundreds of thousands. Finally, it is important to remember that this is a simulated dataset designed to mimic real data but remains very limited in size.

Solution
  • In the first plot (--logistic), the observed values are lower than the expected ones (consistent with what we saw in the Manhattan plot). This suggests an unexpected lack of significant findings. This could be due to overly conservative corrections, biases in the dataset, or poor study power (small sample size or very small effect sizes).
  • On the other hand, in the assoc QQ-plot, some SNPs show stronger associations than expected (observed p-values lower than expected), deviating from the diagonal which suggests a potential true association. Although the top-right corner typically continues deviating upward with observed values higher than expected, here it follows an unexpected pattern. However, before trusting any result, we must conduct further analyses, particularly to assess population stratification. Even if we have confidence in our association test and sufficient power to detect associated variants, careful evaluation is necessary to rule out confounding factors.

Moreover, it is important to remember that, while this suggests an association between these SNPs and the studied phenotype, there isn’t sufficient information here to determine the causal variant. In fact, there could potentially be multiple causal variants and the causal variants could be in LD with some of the significant variants. Identifying the causal variants would require further investigation using biological methods. However, this analysis has significantly reduced the number of SNPs that need to be studied.

References

Gibson, Greg. 2010. “Hints of Hidden Heritability in GWAS.” Nature Genetics 42 (7): 558–60.
Joo, Jong Wha J., Farhad Hormozdiari, Buhm Han, and Eleazar Eskin. 2016. “Multiple Testing Correction in Linear Mixed Models.” Genome Biology 17 (1): 62. https://doi.org/10.1186/s13059-016-0903-6.
Marees, Andries T, Hilde De Kluiver, Sven Stringer, Florence Vorspan, Emmanuel Curis, Cynthia Marie-Claire, and Eske M Derks. 2018. “A Tutorial on Conducting Genome-Wide Association Studies: Quality Control and Statistical Analysis.” International Journal of Methods in Psychiatric Research 27 (2): e1608.
Neale, MCCL, and Lon R Cardon. 2013. Methodology for Genetic Studies of Twins and Families. Vol. 67. Springer Science & Business Media.
Uffelmann, Emil, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen, and Danielle Posthuma. 2021. “Genome-Wide Association Studies.” Nature Reviews Methods Primers 1 (1): 1–21. https://doi.org/10.1038/s43586-021-00056-9.