Quality Control: initial steps

Important notes for this notebook

After gathering your data and genotyping, there are many checks to do to determine the quality of the data. It is crucial to perform quality control before carrying out any GWAS, otherwise, there is the risk that some of the associations are spurious.

Learning outcomes

  • Distinguish the various QC steps
  • Discuss and choose thresholds on plink
  • Implement basic QC in plink and statistical plots in R
  • Hypothesize the effects of various plink commands and verify your hypothesis using the analysis in R

How to make this notebook work

  • In this notebook, we will use both R and bash command line programming languages. Remember to change the kernel whenever you transition from one language to the other (Kernel --> Change Kernel) indicated by the languages’ images. We will first run Bash commands.

Bash Choose the Bash kernel

The next critical step in any GWAS involves scrutinizing the data for potential issues. It’s crucial to ensure that the results are not merely artifacts of poor data quality discovered after the analysis.

After gathering your data and genotyping, there are many checks one can do to determine the quality of the data. PLINK provides several summary statistics for quality control (e.g. missing genotype rate, minor allele frequency, and Hardy-Weinberg equilibrium failures) which will serve as thresholds for subsequent analyses.

For this tutorial, we will look at the following topics:

  1. Individual Missingness
  2. Sex discrepancy
  3. Minor Allele Frequency (MAF)
  4. Hardy-Weinberg Equilibrium (HWE)
  5. Heterozygosity Rate

Individual and SNP Missingness

Missingness per SNP

Overall, the SNP genotyping platform is very reliable and delivers stable results when it comes to determining genotypes. Of course, it is not flawless. One of the most frequent problems is that some of the SNPs are just not well-genotyped in the entire population. These should be removed to improve the overall data quality.

Certainly, we can’t remove every SNP with any missing values, as that would result in losing a significant portion of our data. Instead, we adopt some thresholds. However, determining these thresholds isn’t governed by strict rules. You have the flexibility to set them within “reasonable” limits. To find out what constitutes “reasonable,” consulting relevant literature for your species of interest is advisable. For humans, a common starting point is a threshold of 0.2, meaning we exclude SNPs where 20% or more of the genotypes are missing in our population.

Missingness per individual

The reliability of SNP chips is also high when it comes to individual genotypes. In some cases, however, some of the individuals contain a large number of missing SNPs. The reason could be low DNA sample quality, the wrong chip type used (e.g. cattle chip for deer samples), or other technical issues. Regardless of the reason, you should remove the worst offenders from your data set, to not compromise the overall quality of your results.

Stop - Read - Solve

Have a look at the following toy example:

/ SNP1 SNP2 SNP3 SNP4 SNP5
IND1 22 00 11 12 22
IND2 22 00 11 12 22
IND3 11 12 11 22 21
IND4 00 00 11 11 00
IND5 22 00 11 22 22

Here, we have a data set of five individuals, each of them genotyped for five SNPs. The genotypes themselves are in numerical coding, 11 and 22 being the two homozygous, 12 the heterozygous, and 00 coded as missing. We want to apply two filters: one for the variants, and one for the individuals.

  • First remove variants with >= 40% missingness
  • Then remove individuals with >= 40% missingness

Which variants and individuals are we removing?

In practice, you want to remove the SNPs based on missingness before the individuals. This is simply because we generally have a lot more SNPs than individuals, and would thus lose less information by removing SNPs than individuals. This will remove “bad” SNPs first, leaving a lower rate of missingness for all the individuals.

Variant-level threshold: In this example, we would only remove SNP2 since it exceeds this threshold, with 80% missingness (4 out of 5 missing values).

Individual-level threshold: IND4 is removed from the data set, as there is missing genotype information for 2/4 = 50% SNPs (after filtering out SNP2).

Sex Discrepancy

One useful check is verifying if the indicated sex is correct. Using PLINK, you can calculate the inbreeding coefficient on the X chromosome under the assumption that it is an autosomal chromosome. This approach is insightful because PLINK treats haploid chromosomes (like the X chromosome in males) as homozygotes due to technical reasons. Consequently, assuming the X chromosome is autosomal makes males appear highly inbred on the X, whereas females do not (since they have two X chromosomes). As a result, the inbreeding coefficient estimates will be close to 1 for males and 0 for females.

Minor Allele Frequency (MAF)

Excluding SNPs based on minor allele frequency (MAF) is somewhat controversial. In a sense, it has little to do with quality control – there is no reason to think there are any errors in the data. The main justification is statistical:

  • If MAF is low, and the SNP is rare, then the power is low (i.e. don’t spend multiple testing corrections on tests that are unlikely to find anything anyway).
  • Some statistical methods perform badly with low MAF (e.g. the chi-squared-test).

An appropriate cutoff definitely depends on sample size – the larger the sample, the greater your ability to include rare SNPs. Typically, researchers utilize thresholds of 0.1 or 0.05 (Kanaka et al. (2023)).

Hardy-Weinberg Equilibrium (HWE)

The Hardy-Weinberg rule from population genetics states that genetic variation (thus, allele and genotype frequencies) in a population will remain constant unless certain disturbing factors are introduced. his implies that once we know the allele frequencies for \(p\) and \(q\), the genotype frequencies will be defined as \(p^2\), \(2pq\), and \(q^2\).

Let’s say the frequency of allele A (\(p\) in the equation) is 0.4, and that of allele B (\(q\)) is 0.6. This means for the H-W scenario the genotype frequencies will be 0.16 for AA, 0.48 for AB, and 0.36 for BB. In a population of e.g. 1000 individuals with the mentioned allele frequencies we expect to see 160 AA, 480 AB, and 360 BB individuals. Of course, we rarely see exact H-W distributions in real populations. The question then becomes, what is the difference between the expected H-W and observed proportions? There are typically two reasons why a SNP is not in HWE:

  • Genotyping error for this SNP
  • Mating is not random

In reality, mating is not random, which complicates excluding SNPs based on HWE. It is generally recommended to exclude an SNP only when HWE is significantly violated, such as when the p-value for a statistical test (e.g., assessing whether the data follow a binomial distribution) is extremely low, like p-value p<10e−10

HWE and binomial distribution

Why is HWE connected to the binomial distribution? What is the ground theory behind HWE?

You can find a clear explanation of HWE at (Lachance (2016)).

Heterozygosity Rate

We filter individuals based on their heterozygosity rate, which is similar to HWE but applied at the individual level. If an individual has an unusually high number of heterozygous calls (A/B) and no homozygous calls (A/A or B/B), or vice versa, it could indicate data issues.

Bibliography

Kanaka, K. K., Nidhi Sukhija, Rangasai Chandra Goli, Sanjeev Singh, Indrajit Ganguly, S. P. Dixit, Aishwarya Dash, and Anoop Anand Malik. 2023. “On the Concepts and Measures of Diversity in the Genomics Era.” Current Plant Biology 33 (January): 100278. https://doi.org/10.1016/j.cpb.2023.100278.
Lachance, J. 2016. “Hardy–Weinberg Equilibrium and Random Mating.” In Encyclopedia of Evolutionary Biology, edited by Richard M. Kliman, 208–11. Oxford: Academic Press. https://doi.org/10.1016/B978-0-12-800049-6.00022-6.