Polygenic scores II

Important notes for this notebook

In this notebook, we have data for a toy quantitative trait, so you’ll need to perform a basic polygenic risk score analysis and explore the results in a manner similar to what we did in the previous notebook.

Learning outcomes

Discuss and choose the PRS equation
Discuss PRS scores and biases

How to make this notebook work

We will use both R and bash command line programming languages. Remember to change the kernel whenever you transition from one language to the other (Kernel --> Change Kernel) indicated by the languages’ images. We will first run Bash commands.

Choose the Bash kernel

PRSice analysis II

We will be working with a new preprocessed simulated dataset that has already undergone quality control. Our analysis includes summary statistics from a powerful base GWAS (in this case, for height) and a target dataset consisting of European individuals in PLINK format. In this tutorial, we will incorporate covariates and principal components (PCs) in the polygenic score calculation.

Let’s create a folder for the output files.

mkdir -p Results/GWAS7

# Create two links to data and software
ln -sf ../Data

Stop - Read - Solve

You have already run the PRSice software for binary traits. Now, it’s your turn to do the same for height. What type of phenotype is this?

The data:

Height.QC.gz: post-QC summary statistics
EUR.QC.: prefix of plink files for the target sample
EUR.height: file containing measurements
EUR.covariate: this file contains the principal components and sex as covariates. Since PRSice only accepts a single covariate file, you may need to merge the .cov and .eigenvec files if you used PLINK for quality control.

Please, apply the following filter to the base GWAS:

Filter out SNPs with MAF < 0.01 in the GWAS summary statistics, using the information in the MAF column
Filter out SNPs with INFO < 0.8 in the GWAS summary statistics, using the information in the INFO column

Adjust the code from the previous notebooks to run PRSice software on the new dataset. Check out the user manual if you need extra help: https://choishingwan.github.io/PRSice/.

We recommend using the qqman library in R to visualize the Manhattan plot and QQ-plot of the base GWAS results to assess the distribution of association signals before computing the PRS.

# Write R code here
# Setup to avoid long messages and plot on screen
options(warn=-1)
options(jupyter.plot_mimetypes = 'image/png')

# Load GWAS package qqman
suppressMessages(library("qqman"))

# Write PRSice command here

Stop - Read - Solve

Once you have the PRS results, answer the following questions:

Which P-value threshold generated the “best-fit” PRS?
How much phenotypic variation does the “best-fit” PRS explain?

Hint: Check the <PREFIX>.summary file.

# Write your answer here

Stop - Read - Solve

Since height differs across sexes, let’s focus on visualizing the relationship between the “best-fit” PRS and the phenotype of interest, colored according to sex.

# Write your code for plotting here

Click to view answers

prs2-solutions.ipynb

Do you want to explore other post-GWAS analyses? Visit this GitHub repository for a step-by-step guide on eMAGMA, a framework that converts GWAS summary statistics into gene-level statistics by assigning risk variants to putative genes using tissue-specific eQTL information.

Copyright

CC-BY-SA 4.0 license