--bfile Results/GWAS4/1kG_MDS6 --extract Results/GWAS3/indepSNP.prune.in --pca 10 var-wts --out Results/GWAS4/PCA_1kg --silent plink
Population stratification
Correlation between PC loadings and effect sizes
We followed a similar approach from (Sohail et al. 2019) to examine GWAS stratification along different PCA axes of population structure. We began by performing a PCA on the genotype data from the 1000 Genomes Project used in the previous notebook. Then, we computed the correlation between the first 10 PCA loadings and the effect size estimates from the GWAS conducted on the Hapmap data. We visualized these PC-specific correlations to explore patterns of stratification along different axes of genetic variation. Additionally, it would be very useful to consider allele frequency differences across different populations when analyzing such stratification.
We use PLINK’s --pca var-wts
command on the set of independent variants (indepSNP.prune.in
) to obtain the PC loadings.
Switch to the R kernel.
Now we will plot the correlation between the loadings and the effect sizes for both GWAS results from GWAS5.
A. Summary statistics from --assoc
test
library(ggplot2)
# load the data
<- read.table("Results/GWAS5/assoc_results.assoc", head=TRUE)
results_as <- read.table(file="Results/GWAS4/PCA_1kg.eigenvec.var",header=FALSE)
data colnames(data) <- c("CHROM", "ID", "a1", "a2",
paste0("PC", 1:10)) # PC1 to PC10
# merge data
<- merge(results_as, data, by=2)
merged_tab # compute correlations
<- apply(merged_tab[, 14:23], 2, function(x) cor(merged_tab$OR, x, method="pearson", use="complete.obs"))
correlations
<- data.frame(
dataCor pc = 1:10,
corr = correlations)
ggplot(as.data.frame(dataCor), aes(pc, corr)) +
ylab(expression(rho(PC, OR))) +
geom_bar(stat='identity', position='dodge', col='black')+
xlab('PC') +
ylim(c(-0.1,0.1)) +
scale_x_continuous(breaks=seq(1,10,1), expand=c(0,0.1))+
theme(panel.grid.minor=element_blank(), panel.background = element_blank(),
axis.line = element_line(), axis.text= element_text(size=12),
axis.title= element_text(size=12))
B. Summary statistics from --logistic
test
Let’s calculate Pearson correlations using the logistic model results.
# load results logistic model
<- read.table("Results/GWAS5/logistic_results.assoc_2.logistic", head=TRUE)
results_log
<- merge(results_log, data, by=2)
merged_log # compute correlations
<- apply(merged_log[, 13:22], 2, function(x) cor(merged_log$OR, x, method="pearson", use="complete.obs"))
correlationsLog
<- data.frame(
dataCorLog pc = 1:10,
corr = correlationsLog)
ggplot(as.data.frame(dataCorLog), aes(pc, corr)) +
ylab(expression(rho(PC, OR))) +
geom_bar(stat='identity', position='dodge', col='black')+
xlab('PC') +
ylim(c(-0.1,0.1)) +
scale_x_continuous(breaks=seq(1,10,1), expand=c(0,0.1))+
theme(panel.grid.minor=element_blank(), panel.background = element_blank(),
axis.line = element_line(), axis.text= element_text(size=12),
axis.title= element_text(size=12))
There should be no correlation if we had corrected for population structure, as the effect size estimates wouldn’t be influenced by that. In contrast, for the model that doesn’t account for covariates, the correlations are higher than those observed in the logistic model. These correlations may help explain why the logistic model, which accounts for covariates, did not produce significant peaks in the Manhattan plot. Therefore, the lack of significant associations might not be due to being overly cautious but could rather indicate the true absence of strong signals. Further investigation is needed to better understand these results.
For interpretation, coloring the bar plot results based on the correlation between the PCs and the allele frequency difference between the populations represented in your target sample would help highlight any patterns or trends.
- Population structure can be addressed using principal components (PCs) and mixed models, though never perfectly. If your effect size estimates correlate with a PC, it is advisable to run a linear mixed model (e.g., BOLT-LMM). However, be cautious—these approaches mitigate, but do not entirely eliminate, the risk of residual confounding, so results should be interpreted carefully.
- Controlling for population structure is crucial because residual population stratification can introduce spurious associations. Researchers might perform meta-analyses to increase statistical power by combining data from multiple studies. However, population structure bias is unlikely to be consistent across studies. If not properly accounted for, this bias can distort results, leading to false associations rather than true genetic signals.