ln -sf ../Data
mkdir -p Results/GWAS4
Quality Control: Relatedness & Population Stratification
B. Population Stratification
Population stratification presents a significant source of systematic bias in GWAS, arising when subpopulations exhibit systematic differences in allele frequencies. Research indicates that even subtle degrees of population stratification can exist within a single ethnic population (Abdellaoui et al. 2013). Thus, testing and controlling for the presence of population stratification is an essential QC step.
The population structure (or in other words, the ancestral relationship of the populations) is a so-called confounding factor. This means that it affects both the dependent and independent variables, as shown in the figure below, where both the genotype and traits are influenced by population structure (e.g., the distribution of north and south European individuals in the PCA space and the height of those individuals).
{fig-align=“center”, width=400px}
Why is a bias introduced? Population structure can influence allele frequencies and produce false positives/negatives when doing association testing. Graphically, consider the example in the figure below. Case and control have minor allele frequencies of 1/6 and 1/8 (population 1) and 1/2, 7/12 (population 2). If you remove population structure, case and control have MAFs of 3/10, and 2/5, and those new values depend on how many samples you have from each population in the two conditions, and the MAFs of each population.
{fig-align=“center”, width=400px}
The same problem arises in population studies without Case-control categories. Imagine having a population of randomly sampled individuals, each from a different ethnicity (the blue and red minor alleles in the example below). The final group of individuals will have a different proportion of MAFs depending on the sampling of various ethnicities.
{fig-align=“center”, width=400px}
There are several methods to correct for population stratification (Price et al. 2010, price_principal_2006). Here, we illustrate a method integrated into PLINK: the multidimensional scaling (MDS) approach. MDS calculates the genome-wide average proportion of shared alleles between any pair of individuals to generate quantitative indices (components) of the genetic variation for each individual. The individual component scores can be visualized to identify groups of genetically similar individuals. For instance, in a genetic study including subjects from Asia and Europe, MDS analysis would reveal that Asians are genetically more similar to each other than to Europeans and Africans. The figure below shows another example of MDS using HapMap, Genome diversity project, and authors’ data:
To investigate which individuals the generated component scores deviate from in the target population, plotting the scores of the dataset under investigation and a population of known ethnic structure (e.g., HapMap/1KG data) is helpful: this step is called anchoring (Rietveld et al. 2013). This enables the researcher to obtain ethnic information on their data and to determine possible ethnic outliers. For example, in the figure above, if TSI (Tuscans from Italy) is the anchor population, one can hypothesize that the yellow dots might be ethnically similar (as in the example).
Outliers identified based on MDS analysis should be excluded from further analyses. Following their removal, a new MDS analysis must be conducted, and its primary components are utilized as covariates in association tests to correct for any residual population stratification within the population. The number of components to include depends on the population structure and sample size (usually 10-20).
The MDS from Cortellari et al. (2021) shows a distinct goat population outlier. The second axis is dominated by this outlier, obscuring structure in the other populations. Removing the outlier reveals a clearer structure among the remaining populations.
It is also possible to correct for relatedness (family structure). Should we also do it?
Analysis
We aim to merge the HapMap and 1000GP datasets, using 1000GP Phase I as the anchor for HapMap. Our goal is to check if we can identify the ethnicity of the HapMap data based on the ethnicities in the 1000GP dataset. There are several steps to ensure compatibility between the datasets, so stay with us!
1000GP data download
Here are some commands to download and convert the 1000GP data for GWAS analysis. You don’t need to run them, as we’ve already processed the data.
- 1000 Genomes Project - Phase I: genetic information for 629 individuals from various ethnic groups (>60GB). Phase III is now available, and we recommend using it for research purposes.
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
The data was converted from the vcf
file (Variant Call Format) to plink format (bim
, fam
, bed
):
plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes
The 1000 Genomes data downloaded above is rather large so the commands are not executable and are shown for reference only. To save time, we’ve provided the .bed
, .bim
and .fam
files in the Data
folder.
Let’s unzip the files and see how many samples we have.
unzip -o Data/1000genomes.zip -d Results/GWAS4
# count lines in fam
wc -l Results/GWAS4/1000genomes.genotypesA.fam
Archive: Data/1000genomes.zip
inflating: Results/GWAS4/1000genomes.genotypesA.bed
inflating: Results/GWAS4/1000genomes.genotypesA.bim
inflating: Results/GWAS4/1000genomes.genotypesA.fam
inflating: Results/GWAS4/1000genomes.genotypesA.log
inflating: Results/GWAS4/1000genomes.genotypesA.nosex
37 Results/GWAS4/1000genomes.genotypesA.fam
We have a subset of 37 individuals. Now, let’s explore the bim
file.
cat Results/GWAS4/1000genomes.genotypesA.bim | head -5
1 rs112750067 0 10327 C T
1 . 0 11508 A G
1 . 0 12783 G A
1 . 0 13116 G T
1 . 0 14933 A G
One should note that the file 1000genomes.genotypes.bim
contains SNPs without an rs-identifier (or Reference SNP cluster ID). The missing rs-identifiers (noted as .
) are not a problem for this tutorial. However, for good practice, we will assign unique identifiers to the SNPs (using available information):
plink --bfile Results/GWAS4/1000genomes.genotypesA --set-missing-var-ids @:#[b37]\$1,\$2 --make-bed --out Results/GWAS4/1000genomes.genotypesA_no_missing_IDs --silent
Now, let’s visualize the data to check the SNP names assigned. These are derived from the format @:#[b37]\$1,\$2
in the command above, which PLINK interprets as chromosome:locus[b37]Allele1,Allele2
:
# Show changes on the bim file
cat Results/GWAS4/1000genomes.genotypesA_no_missing_IDs.bim | head -5
1 rs112750067 0 10327 C T
1 1:11508[b37]A,G 0 11508 A G
1 1:12783[b37]A,G 0 12783 G A
1 1:13116[b37]G,T 0 13116 G T
1 1:14933[b37]A,G 0 14933 A G
Pretty neat, right?
QC on 1000GP data
As we covered in the GWAS3 notebook, it’s important to account for missingness, sex discrepancies, and minor allele frequency. We’ll apply standard QC thresholds to the 1000 GP data before merging it with HapMap data.
plink --bfile Results/GWAS4/1000genomes.genotypesA_no_missing_IDs --geno 0.02 \
--allow-no-sex --make-bed --out Results/GWAS4/1kG_MDS --silent
plink --bfile Results/GWAS4/1kG_MDS --mind 0.02 \
--allow-no-sex --make-bed --out Results/GWAS4/1kG_MDS2 --silent
plink --bfile Results/GWAS4/1kG_MDS2 --maf 0.05 --allow-no-sex --make-bed \
--out Results/GWAS4/1kG_MDS3 --silent
SNPs matching between datasets
N.B: Ensure that the datasets you want to merge share the same genomic build! Otherwise, you’ll need to include a liftover step.
We want to only consider SNPs that both datasets have in common. First, extract SNP names from the HapMap data and filter the 1000GP data to include only matching SNPs.
#Print out SNPs from the HapMap data
awk '{print$2}' Results/GWAS4/HapMap_3_r3_9.bim > Results/GWAS4/HapMap_SNPs.txt
#Extract the HapMap SNPs from the 1000GP data
plink --bfile Results/GWAS4/1kG_MDS3 --extract Results/GWAS4/HapMap_SNPs.txt \
--out Results/GWAS4/1kG_MDS4 --silent --make-bed
This is how part of the list of SNP names looks like:
cat Results/GWAS4/HapMap_SNPs.txt | head -5
rs3131972
rs3131969
rs1048488
rs12562034
rs12124819
Now we take the variants from the reduced 1000GP data, and go the other way around. We extract 1000GP variants from the HapMap data. In other words, the two extraction passages will intersect the SNPs. Below is the code to use the SNPs of the 1000GP data to reduce the HapMap data.
#Print out SNPs from the HapMap data
awk '{print$2}' Results/GWAS4/1kG_MDS4.bim > Results/GWAS4/1kG_MDS4_SNPs.txt
#Extract the HapMap SNPs from the 1000GP data
plink --bfile Results/GWAS4/HapMap_3_r3_9 --extract Results/GWAS4/1kG_MDS4_SNPs.txt --recode \
--out Results/GWAS4/HapMap_MDS --silent --make-bed
Look at the SNP names. Now, they are matching between the two bim
files.
Look at the two outputs a bit more carefully. Is there any problem?
Hint:
- compare the chromosome and position across the 2 datasets.
- compare the two alleles
head Results/GWAS4/HapMap_MDS.bim
1 rs3131969 0 744045 A G
1 rs12562034 0 758311 A G
1 rs4970383 0 828418 A C
1 rs4475691 0 836671 T C
1 rs1806509 0 843817 C A
1 rs28576697 0 860508 C T
1 rs3748595 0 877423 A C
1 rs13303118 0 908247 G T
1 rs1891910 0 922320 A G
1 rs3128097 0 970323 G A
head Results/GWAS4/1kG_MDS4.bim
1 rs3131969 0 754182 A G
1 rs12562034 0 768448 A G
1 rs4970383 0 838555 A C
1 rs4475691 0 846808 T C
1 rs1806509 0 853954 A C
1 rs28576697 0 870645 C T
1 rs3748595 0 887560 A C
1 rs13303118 0 918384 T G
1 rs1891910 0 932457 A G
1 rs3128097 0 980460 G A
Build matching
Genomic data is based on a reference genome, and our datasets use different human reference versions. Since the reference genome improves over time, SNP positions may differ between datasets from different versions.
We extract SNP names and positions from the HapMap data and align the 1000GP data to match these SNPs using the --update-map
option in PLINK.
#Extract the HapMap variant coordinates
awk '{print$2,$4}' Results/GWAS4/HapMap_MDS.map > Results/GWAS4/buildhapmap.txt
This is how the list of SNPs look like:
cat Results/GWAS4/buildhapmap.txt | head -5
rs3131969 744045
rs12562034 758311
rs4970383 828418
rs4475691 836671
rs1806509 843817
We run PLINK to update the 1000GP variant coordinates based on HapMap, ignoring the warning about non-ascending positions:
plink --bfile Results/GWAS4/1kG_MDS4 --update-map Results/GWAS4/buildhapmap.txt --make-bed \
--silent --out Results/GWAS4/1kG_MDS5
Warning: Base-pair positions are now unsorted!
Merging datasets and performing MDS
Before merging the HapMap and 1000 Genomes datasets, we ensure compatibility through 3 steps:
- Verify the reference genome is compatible in both datasets.
- Align SNP orientations (strand) across datasets.
- Remove SNPs that still differ after these steps.
The next steps are technical but ensure the datasets correspond correctly.
1. We’ve matched SNP positions, but we also need to ensure the reference alleles align. Remember that most PLINK analyses consider the A1 allele (typically the minor allele) as the reference allele, which is logical when dealing exclusively with biallelic variants.
Below, we generate a list of SNPs ID and ‘reference alleles’ (corresponding to A1, column 5) from 1000GP.
#Extract variant coordinates and reference alleles from 1000GP data
awk '{print$2,$5}' Results/GWAS4/1kG_MDS5.bim > Results/GWAS4/1kg_ref-list.txt
How the list looks like:
head -5 Results/GWAS4/1kg_ref-list.txt
rs3131969 A
rs12562034 A
rs4970383 A
rs4475691 T
rs1806509 A
Then, we assign them to the HapMap data --reference-allele
option (aliases for --a1-allele
). We use &> /dev/null
to redirect many warning messages away from the screen (Warning: Impossible A1 allele assignment for variant rs11488462). This warning flags variants with genotype mismatches. To address these warnings, we will first check for strand orientation issues before excluding any problematic variants.
plink --bfile Results/GWAS4/HapMap_MDS --reference-allele Results/GWAS4/1kg_ref-list.txt --make-bed \
&> /dev/null --out Results/GWAS4/HapMap-adj
2. To resolve strand issues, we flip SNPs found in both datasets with complementary alleles (i.e. they were reported in opposite strands). We generate SNP lists (ID and alleles) for both datasets, identify unique SNPs, and visualize differences in allele reporting. If a SNP is unique but reports alleles differently, it will appear twice. Below are examples of SNPs with strand issues from the 1000GP and HapMap data:
awk '{print$2,$5,$6}' Results/GWAS4/1kG_MDS5.bim > Results/GWAS4/1kGMDS5_tmp
awk '{print$2,$5,$6}' Results/GWAS4/HapMap-adj.bim > Results/GWAS4/HapMap-adj_tmp
sort Results/GWAS4/1kGMDS5_tmp Results/GWAS4/HapMap-adj_tmp |uniq -u > Results/GWAS4/all_differences.txt
head -6 Results/GWAS4/all_differences.txt
rs10006274 C T
rs10006274 G A
rs1008660 A G
rs1008660 T C
rs10088098 C T
rs10088098 G A
How many of these differences are there?
wc -l Results/GWAS4/all_differences.txt
604 Results/GWAS4/all_differences.txt
Some of these differences might be might be due to strand issues.
Let’s look at this variant rs10006274
. Will it be flipped in the HapMap dataset?
The answer is yes! If we look at the reference allele in 1kg_ref-list.txt
, it shows C
. This means the SNP is on the forward strand in 1000GP (C/T) and on the reverse strand in HapMap (G/A).
grep rs10006274 Results/GWAS4/1kg_ref-list.txt
rs10006274 C
grep rs10006274 Results/GWAS4/all_differences.txt
rs10006274 C T
rs10006274 G A
Look at these other SNPs rs9614750
and rs10088098
.
- Which ones will have to be flipped?
- Is it always the same dataset that must be flipped?
# Write your code here
We will first print out the SNPs from the reference file to know which line corresponds to each dataset (since we know we used the 1000 Genomes Project as the reference). If there are strand issues, the SNP will need to be flipped in the dataset that wasn’t used as the reference.
grep rs9614750 Results/GWAS4/1kg_ref-list.txt
rs9614750 A
grep rs9614750 Results/GWAS4/all_differences.txt
rs9614750 A G
rs9614750 C G
For rs9614750
, the genotype is reported as A/G
in the 1000GP data, while in HapMap, it is C/G
. This discrepancy between the two datasets means that the SNP will need to be removed later.
grep rs10088098 Results/GWAS4/1kg_ref-list.txt
rs10088098 C
grep rs10088098 Results/GWAS4/all_differences.txt
rs10088098 C T
rs10088098 G A
For rs10088098
, the genotype is reported as C/T
in the 1000GP data, while in HapMap, it is G/A
. This means that PLINK will flip the allele, as they are complementary.
Now we take only the SNP names and give them to PLINK (option --flip
), together with the reference genome (option --reference-allele
):
## Flip SNPs for resolving strand issues.
# Print SNP-identifier and remove duplicates.
awk '{print$1}' Results/GWAS4/all_differences.txt | sort -u > Results/GWAS4/flip_list.txt
wc -l Results/GWAS4/flip_list.txt
302 Results/GWAS4/flip_list.txt
These are the SNP ID of non-corresponding SNPs (N=302) between the two files.
head -5 Results/GWAS4/flip_list.txt
rs10006274
rs1008660
rs10088098
rs1011297
rs1023098
302 Results/GWAS4/flip_list.txt
Apply the flipping option:
plink --bfile Results/GWAS4/HapMap-adj --flip Results/GWAS4/flip_list.txt --reference-allele Results/GWAS4/1kg_ref-list.txt \
--out Results/GWAS4/corrected_hapmap --silent --make-bed
Warning: Impossible A1 allele assignment for variant rs2581195.
Warning: Impossible A1 allele assignment for variant rs9614750.
There might still be problematic SNPs after flipping.
- Check if the expected allele flip occurred.
Hint: use grep
to find the rs10006274
and rs9614750
variants; then compare the alleles assignments in the HapMap file before and after flipping.
# Write your code here
PLINK attempts to flip all SNPs in the list, but an error occurs when A1 does not match the one in the reference.
Notice how this SNP has changed as we predicted:
grep rs10006274 Results/GWAS4/corrected_hapmap.bim
4 rs10006274 0 124165369 C T
grep rs10006274 Results/GWAS4/HapMap-adj.bim
4 rs10006274 0 124165369 G A
rs10006274
was flipped and has the same strand orientation in both datasets (same alleles in A1 and A2).
What happened to this one? PLINK attempts to resolve the mismatch by flipping the alleles but throws an error because the complementary alleles do not match the reference!
grep rs9614750 Results/GWAS4/corrected_hapmap.bim
22 rs9614750 0 44436371 G C
grep rs9614750 Results/GWAS4/HapMap-adj.bim
22 rs9614750 0 44436371 C G
You don’t need to flip the 1000GP data because the reference allele (A1) in the 1000GP data already matches the strand orientation used in the HapMap data.
3. After flipping SNPs, some differ in their alleles when comparing datasets to each other (e.g. SNP rs9614750
) and such SNPs must be removed.
We extract the SNPs from the corrected HapMap data and search for unique SNP (ID, A1, and A2), comparing them with those from the 1000GP data.
awk '{print$2,$5,$6}' Results/GWAS4/corrected_hapmap.bim > Results/GWAS4/corrected_hapmap_tmp
sort Results/GWAS4/1kGMDS5_tmp Results/GWAS4/corrected_hapmap_tmp | uniq -u > Results/GWAS4/uncorresponding_SNPs.txt
How many SNP missmatches are there?
wc -l Results/GWAS4/uncorresponding_SNPs.txt
24 Results/GWAS4/uncorresponding_SNPs.txt
This corresponds to 12 unique SNP IDs with mismatched information.
head Results/GWAS4/uncorresponding_SNPs.txt
rs11524965 T C
rs11524965 T G
rs12646999 G A
rs12646999 G T
rs17114359 C A
rs17114359 C T
rs17269854 C A
rs17269854 C T
rs2060424 G A
rs2060424 G C
We extract again the SNP IDs from the file above, and exclude them using the PLINK option --exclude
in both datasets
awk '{print$1}' Results/GWAS4/uncorresponding_SNPs.txt | sort -u > Results/GWAS4/SNPs_for_exclusion.txt
plink --bfile Results/GWAS4/corrected_hapmap --exclude Results/GWAS4/SNPs_for_exclusion.txt \
--out Results/GWAS4/HapMap_MDS3 --silent
--make-bed plink --bfile Results/GWAS4/1kG_MDS5 --exclude Results/GWAS4/SNPs_for_exclusion.txt \
--out Results/GWAS4/1kG_MDS6 --silent --make-bed
5. We can finally merge the data! We provide our dataset (-bfile
) and the one to add (--bmerge
option):
plink --bfile Results/GWAS4/corrected_hapmap --bmerge Results/GWAS4/1kG_MDS6.bed Results/GWAS4/1kG_MDS6.bim Results/GWAS4/1kG_MDS6.fam \
--make-bed --out Results/GWAS4/MDS_merge --silent --allow-no-sex
Perform MDS on HapMap-CEU data anchored by 1000 Genomes data.
MDS is typically performed on independent SNPs (pruned SNPs). We have previously identified such SNPs in this course, so we will extract only these SNPs for the analysis.
plink --bfile Results/GWAS4/MDS_merge --extract Results/GWAS3/indepSNP.prune.in \
--genome --out Results/GWAS4/MDS_merge --silent
Now, we can use PLINK to run MDS with the option ---mds-plot
specifying the number of components to calculate.
# mds-plot
plink --bfile Results/GWAS4/MDS_merge --read-genome Results/GWAS4/MDS_merge.genome \
--cluster --mds-plot 10 --out Results/GWAS4/MDS_merge --silent
For visualization purposes, we downloaded the 1000 Genomes Project (1000GP) panel, which includes individual names and their corresponding population information.
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel -P Results/GWAS4 -q
To determine the population origins of HapMap individuals, we created a new file that combines the 1000GP panel information with the HapMap data, labeling the population HapMap entries as ‘OWN’.
awk '{print$1,$2,"OWN"}' Results/GWAS4/HapMap_MDS.fam > Results/GWAS4/popfile_own.txt
awk '{print$1,$1,$2}' Results/GWAS4/20100804.ALL.panel > Results/GWAS4/20100804.ALL.panel.txt
cat <(echo "FID IID POP") Results/GWAS4/20100804.ALL.panel.txt Results/GWAS4/popfile_own.txt > Results/GWAS4/popfile.txt
The 1000 Genomes Project (1000GP) categorizes individuals into major continental groups—such as Europeans (EUR), Africans (AFR), Americans (AMR), East Asians (EAS), and South Asians (SAS)—each comprising various subpopulations. We will use this population structure information to visualize and determine the clusters our samples belong to.
Switch to the R kernel.
Let’s visualize population stratification using the multidimensional scaling (MDS) results.
options(repr.plot.width = 12, repr.plot.height = 6)
suppressMessages(suppressWarnings(library(ggplot2)))
# Read data into R
data <- read.table(file="Results/GWAS4/MDS_merge.mds",header=TRUE)
pop <- read.table(file="Results/GWAS4/popfile.txt",header=TRUE)
datafile <- merge(data,pop,by=c("FID","IID"))
# Metapopulation information for the population in the 1000GP dataset
superpop <- c(
"JPT" = "ASN",
"ASW" = "AFR",
"CEU" = "EUR",
"CHB" = "ASN",
"CHD" = "ASN",
"YRI" = "AFR",
"LWK" = "AFR",
"TSI" = "EUR",
"MXL" = "AMR",
"GBR" = "EUR",
"FIN" = "EUR",
"CHS" = "ASN",
"PUR" = "AMR",
"OWN" = "UN"
)
# add metapopulation info to the table
datafile$SUPERPOP <- superpop[datafile$POP]
# Plotting
scatter.mds <- ggplot(datafile, aes(x=C1, y=C2, color=SUPERPOP)) +
geom_point(size=5, alpha=.4) +
scale_color_manual(values=c("AFR" = "red", "AMR" = "springgreen4", "ASN" = "gold", "EUR" = "blue", "UN" = "grey" )) +
xlab("MD Component 1") +
ylab("MD Component 2") +
labs(color="Superpop") +
theme_bw() +
theme(axis.title = element_text(size = 14), legend.text = element_text(size = 15),
axis.text = element_text(size = 14), legend.title=element_text(size=15))
show(scatter.mds)
The HapMap data clusters closely with European populations such as CEU, TSI, IBS, GBR, and FIN, confirming its European composition. Additionally, the absence of distant points indicates no outliers in the HapMap dataset (grey datapoints cluster together).
Exclude ethnic outliers
Let’s run the scripts to filter population stratification outliers for educational purposes (e.g. imagine a HapMap individual clusters in the lower-right corner with African populations).
Switch to the Bash kernel.
To identify and exclude ethnic outliers in the HapMap dataset, select individuals falling within specific cut-off thresholds. These thresholds should be determined based on the visualization of the first two dimensions from the multidimensional scaling (MDS) analysis.
What values would you select or apply in this context, and why? Hint: look at the plot above.
Preview of the MDS results:
head -3 Results/GWAS4/MDS_merge.mds | cut -f1-6 -d$'\t'
FID IID SOL C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1328 NA06984 0 -0.0218368 -0.00620683 -0.00464884 0.00121512 -0.000413235 0.00221093 0.0104188 -0.00199382 0.0173257 -0.0131241
1328 NA06989 0 -0.0208157 -0.00581126 -0.00638572 0.00503345 -0.00116741 -0.00369152 0.0120005 -0.0229035 -0.00238946 0.0381386
Based on the position of the target population’s cluster in the MDS plot, we will exclude individuals with MDS component 1 values less than 0 and MDS component 2 values less than -0.05.
The selection has to be done for the individuals we want to keep, so the area to be chosen has to be mirrored into >0 and >-0.05. We provide the MDS component values in columns 4 and 5 of the file (corresponding to the first 2 components), and extract the individuals using --keep
awk '{ if ($4 >0 || $5 >-0.05) print $1,$2 }' Results/GWAS4/MDS_merge.mds > Results/GWAS4/EUR_MDS_merge
plink --bfile Results/GWAS4/HapMap_3_r3_9 --keep Results/GWAS4/EUR_MDS_merge --make-bed \
> /dev/null --out Results/GWAS4/HapMap_3_r3_10
Creating covariates for GWAS analysis
The 10 MDS dimensions will be used as covariates in the association analysis in the next tutorial to correct for population stratification. The covariate file is created by removing column 3 (SOL
, optional metadata) from the MDS output file.
Why are we computing the covariates again?
plink --bfile Results/GWAS4/HapMap_3_r3_10 --extract Results/GWAS3/indepSNP.prune.in --genome \
> /dev/null
--out Results/GWAS4/HapMap_3_r3_10 plink --bfile Results/GWAS4/HapMap_3_r3_10 --read-genome Results/GWAS4/HapMap_3_r3_10.genome \
--mds-plot 10 --out Results/GWAS4/HapMap_3_r3_10_mds > /dev/null
--cluster
# Change the format of the .mds file into a plink covariate file.
awk '{print $1, $2, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}' Results/GWAS4/HapMap_3_r3_10_mds.mds > Results/GWAS4/covar_mds.txt
The covar_mds.txt
file contains the covariates to adjust for residual population stratification.
head -5 Results/GWAS4/covar_mds.txt
FID IID C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1328 NA06989 0.0193503 -0.0481539 0.053743 -8.56966e-05 0.0124037 -0.00872553 0.0146577 -0.0266013 0.0064429 -0.0149862
1377 NA11891 0.0105154 -0.0241653 -0.00713715 -0.0276522 0.0272997 0.0044621 -0.00373271 -0.00561896 -0.0317101 0.0188078
1349 NA11843 -0.00217144 0.0120046 -0.00508388 0.0193558 -0.00738542 0.0285001 -0.00512723 0.00271516 0.0029428 0.000145191
1330 NA12341 -0.0124555 -0.0114929 0.0019162 -0.0252928 -0.0300722 -0.00411769 -0.0207568 -0.0122574 -0.017065 0.0353729
You have now successfully checked your data for relatedness population stratification. You filtered out the individuals with high relatedness and produced a summary of the population structure using the MDS projection. You will use the MDS coordinates as a proxy for the population structure you want your association testing to be corrected for.
In the next notebook on Association Testing, you will need the following files from the folder Results/GWAS4/
: - HapMap_3_r3_10
(the bfile, i.e., HapMap_3_r3_10.bed
, HapMap_3_r3_10.bim
, and HapMap_3_r3_10.fam
) - covar_mds.txt
which are the HapMap data and the MDS covariates highlighting the population stratification. Those are already available once you have been running this notebook.
Below is a cheat sheet of our new methods of QC. Again, it is important to remember that each method of QC should be justified, which will depend on the nature of the feature you are trying to analyze.
Step | Command | Function | Thresholds and explanation |
---|---|---|---|
6: Relatedness | –genome | Calculates identity by descent (IBD) of all sample pairs. | Use independent SNPs ( pruning) for this analysis and limit it to autosomal chromosomes only. |
- | –min | Sets threshold and creates a list of individuals with relatedness above the chosen threshold. This means that subjects who are related at, for example, pi‐hat >0.2 (i.e., second-degree relatives) can be detected. | Cryptic relatedness can interfere with the association analysis. If you have a family‐based sample (e.g., parent‐offspring), you do not need to remove related pairs but the statistical analysis should take family relatedness into account. However, for a population-based sample, we suggest using a pi‐hat threshold of 0.2, which is in line with the literature (Anderson et al., 2010; Guo et al., 2014). |
7: Population Stratification | –genome | Calculates identity by descent (IBD) of all sample pairs. | Use independent SNPs ( pruning) for this analysis and limit it to autosomal chromosomes only. |
- | –cluster –mds-plot k | Produces a k‐dimensional representation of any substructure in the data, based on IBS. | K is the number of dimensions, which needs to be defined (typically 10). This is an important step of the QC that consists of multiple proceedings but for reasons of completeness, we briefly refer to this step in the table. This step will be described in more detail in the section “Controlling for population stratification.” |