Estimating heritability using GCTA¶
Despite the great success of genome-wide association studies (GWAS), which have identified hundreds of SNPs conferring the genetic variation of human complex diseases and traits, the genetic architecture of human complex traits still remains largely unexplained. For most traits, the associated SNPs from GWAS only explain a small fraction of the heritability. There has not been any consensus on the explanation of the “missing heritability.” Possible explanations include a large number of common variants with small effects, rare variants with large effects, and DNA structural variation. In this tutorial we will apply GCTA
, a method of estimating the total amount of phenotypic variance captured by all SNPs.
Software:¶
In this exercise we will be using GCTA. You can see the documentation here. We will be also using plink
, you can see the documentation here.
How to make this notebook work¶
- In this notebook we will use both the
command line bash
commands andR
to setup the file folders. - Having to shift between two languages, you need to choose a kernel every time we shift from one language to another. A kernel contains a programming language and the necessary packages to run the course material. To choose a kernel, go on the menu on the top of the page and select
Kernel --> Change Kernel
, and then select the preferred one. We will shift between two kernels, and along the code in this notebook you will see a picture telling you to change kernel. The two pictures are below:
Shift to the Bash
kernel
Shift to the popgen course
kernel
- You can run the code in each cell by clicking on the run cell sign in the toolbar, or simply by pressing Shift+Enter. When the code is done running, a small green check sign will appear on the left side.
- You need to run the cells in sequential order to execute the analysis. Please do not run a cell until the one above is done running, and do not skip any cells
- The code goes along with textual descriptions to help you understand what happens. Please try not to focus on understanding the code itself in too much detail - but rather try to focus on the explanations and on the output of the commands
- You can create new code cells by pressing
+
in the Menu bar above or by pressing B after you select a cell.
Background material¶
- Yang et al 2011, the article about GCTA
- A Reference for GCTA
Data:¶
We will estimate the amount of variance explained by the SNPs in a GWAS dataset. You can find the data here:
GWAS_heritability/gwa.bim
GWAS_heritability/gwa.bed
GWAS_heritability/gwa.fam
GWAS_heritability/gwa.phen
We create the Results
folder and a link to the data
Shift to the Bash
kernel
ln -s ../../Data/GWAS/GWAS_heritability/
mkdir -p Results
Calculating the genetic relationship matrix¶
We will use plink to calculate the genetic relationship matrix (GRM) since it is faster than gcta. At the shell prompt, type:
plink --make-grm-gz --bfile GWAS_heritability/gwa --out Results/gwa
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/ (C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to Results/gwa.log. Options in effect: --bfile GWAS_heritability/gwa --make-grm-gz --out Results/gwa 385583 MB RAM detected; reserving 192791 MB for main workspace. 301676 variants loaded from .bim file. 1941 people (966 males, 975 females) loaded from .fam. 1941 phenotype values loaded from .fam. Using up to 63 threads (change this with --threads). Before main variant filters, 1941 founders and 0 nonfounders present. Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done. Total genotyping rate is 0.997015. 301676 variants and 1941 people pass filters and QC. Among remaining phenotypes, 988 are cases and 953 are controls. Relationship matrix calculation complete. Relationship matrix written to Results/gwa.grm.gz , and IDs written to Results/gwa.grm.id .
This will save the genetic relationship matrix in the zipped file gwa.grm.gz
. Try to read it into R
:
Shift to the popgen course
kernel
d <- read.table(gzfile('Results/gwa.grm.gz'))
names(d) <- c("sample1","sample2","variants","statistic")
head(d)
sample1 | sample2 | variants | statistic | |
---|---|---|---|---|
<int> | <int> | <int> | <dbl> | |
1 | 1 | 1 | 300386 | 1.0111770000 |
2 | 2 | 1 | 299996 | -0.0017383650 |
3 | 2 | 2 | 301282 | 0.9968653000 |
4 | 3 | 1 | 299965 | 0.0009196462 |
5 | 3 | 2 | 300861 | 0.0004230620 |
6 | 3 | 3 | 301251 | 1.0038210000 |
Q1) If you exclude the lines where an individual is compared to itself (column 1 is equal to column 2) what is the highest value in the GRM then?
Estimating variance components¶
Shift to the Bash
kernel
We can use gcta to estimate how much of the variance in the phenotype in gwa.phen is explained by the SNPs:
gcta64 --grm-gz Results/gwa --pheno GWAS_heritability/gwa.phen --reml --out Results/test
bash: gcta64: command not found
Q2) How much of the phenotypic variance (Vp) is explained by the genetic variance (V(G))?
Q3) Is this number larger or smaller than the narrow-sense heritability (h^2) of the trait?
Estimating variance components for groups of SNPs¶
The estimation of variance components can be used to answer questions about how much of the heritability is explained by different parts of the genome (for example different chromosomes or different functional annotations).
Q4) How much of the phenotypic variance can be explained by the genetic variants on chromosome 6? (You can use the --chr
flag in plink to build a GRM only using variants from a particular chromosome)
Q5) Does chromosome 6 contribute more to the heritability than would be expected? How many of the genetic variants in the data set are located on chr 6? (you can use the genetic map in gwa.bim to see the location of the variants).