Quality Control: Relatedness

Important notes for this notebook

As we have already mentioned, quality control (QC) analyses are crucial to ensure the reliability and validity of the results. In this section, we will explore relatedness, exploring some ways of estimating it, and how it can be used to check suality of the data.”

Learning outcomes

  • Discuss Identity by descent and its estimators in PLINK and PLINK2
  • Detect and categorize related individuals
  • Identify and speculate what can cause unusual relatedness

How to make this notebook work

In this notebook, we will both use R and bash command line programming languages. Remember to change the kernel whenever you transition from one language to the other (Kernel --> Change Kernel) indicated by the languages’ images. * We will first run Bash commands.

Bash Choose the Bash kernel

Relatedness refers to the presence of genetically related individuals in a study sample, which can introduce confounding factors into association analyses. Since most statistical methods assume that samples are independent, including related individuals, like siblings, violates this assumption. Addressing relatedness is crucial as a quality control step to ensure that genomes in the dataset are no more similar than expected for unrelated individuals. This process is essential to preserve the validity of the study’s findings.

While all humans share some genetic relatedness, the degree of relatedness varies among individuals. In genome-wide association studies (GWAS), these variations must be accounted for to avoid biasing the statistical associations between genotypes and phenotypes. Cryptic relatedness (genetic relatedness of individuals not expected to be in a family relationship) can interfere with the association analysis (Voight and Pritchard 2005).

In a family-based study (e.g., parent-offspring), you do not need to remove related pairs, but the statistical analysis should account for family relatedness. However, for a population-based study, we suggest using the KING estimator (Manichaikul et al. 2010) with a threshold of 0.06.

Differently from the IBD estimator (called pi_hat \(\hat{\pi}\)), which assumes that individuals come from homogeneous population, KING derives a kinship estimate for a pair of individuals without reference to the population allele frequencies. Instead, it gets the relatedness information from the difference between the counts of loci where both individuals are heterozygotes and counts of loci where they are different homozygotes, normalized by the sum of the heterozygous loci of the individuals. This makes KING robust to population structure and useful to check genotyping quality.

Thus, values for KING are

KING values need to be multiplied by 2 to match the corresponding IBD probability

Bash

We use ln -sf to link the data folder and create a directory for output files.

ln -sf ../Data
mkdir -p Results/GWAS4

References

Manichaikul, Ani, Josyf C. Mychaleckyj, Stephen S. Rich, Kathy Daly, Michèle Sale, and Wei-Min Chen. 2010. “Robust Relationship Inference in Genome-Wide Association Studies.” Bioinformatics 26 (22): 2867–73. https://doi.org/10.1093/bioinformatics/btq559.
Voight, Benjamin F, and Jonathan K Pritchard. 2005. “Confounding from Cryptic Relatedness in Case-Control Association Studies.” PLoS Genetics 1 (3): e32.