1 What is a GWAS?¶

During the past two decades, there has been a growing interest in investigating the influence of genetic risk factors on variation in human behaviour. The technical and analytic tools needed to conduct genetic studies have become increasingly accessible. This increased accessibility offers great promise as researchers outside the field of genetics may bring new expertise to the field (e.g., more in‐depth knowledge of the nosology of psychiatric traits). However, performing genetic association studies in a correct manner requires specific knowledge of genetics, statistics, and (bio)informatics. This course aims to provide a guideline for conducting genetic analyses by introducing key concepts and by sharing scripts that can be used for data analysis.

1.1 Quick Review of Genetic Theory¶

We all carry two nuclear genomes (i.e. genomes located in cell nucleus), one inherited from each of our two parents. Additionally, we have a small mitochondrial genome, assumed to be inherited exclusively from the mother, but on this course the term 'genome' refers to the nuclear genome.

The human genome is a 3.2 billion nucleotide (or base pair or DNA letter A,C,G,T) long sequence (see yourgenome.org), that is divided into separate physical pieces called chromosomes (see yourgenome.org). There are 22 autosomal (non-sex related) chromosomes and two sex chromosomes (X chromosome and Y chromosome). Normally, humans have two copies of each autosome and individuals with one copy of X and one of Y are males whereas individuals who have two copies of X are females. Abnormal number of chromosomes (called aneuploidies) typically cause severe consequences or an early death if present in all cells of an individual. The most common non-lethal exception is the Down syndrome (3 copies of chr 21). Mosaicism, where some cells have abnormal chromosome numbers also exist and are often present in cancer cells.

There are three types of pairings that come up when we analyse genomes.

First, the DNA is most of the time

a double-stranded molecule whose two strands (i.e. the two DNA molecules) are glued together by the chemical base pairings A-T and C-G. This base pairing is a key to the copying mechanism of the DNA that is needed before any cell division (see yourgenome.org) and the DNA molecules that are connected through base pairing have the exact same information, just written in the reverse letters i.e. A <-> T and C <-> G. To make distinction between the two DNA molecules, it has been agreed that one of the two DNA strands is called the forward strand (or positive strand) and the other the reverse strand (or negative strand). Thus, e.g., when + strand contains base A, the corresponding base on - strand is T and vice versa.

Second, the two homologous choromosomes of an individual

(e.g. paternal chr 13 with maternal chr 13, or in a male, maternal X and paternal Y) can be thought of as a pair. Thus, we say that the human genome consists of 22 autosomes + X + Y, but each individual has two copies of each homologous chromosome, so has 46 unique chromosomes that are divided into 23 pairs of homologous chromosomes.

Third, before any cell division each of the 46 unique chromosomes of an individual

copies itself and the two copies (called sister chromatids) are paired with each other physically to make an X-like shape that is often used to show chromosomes in pictures. Such picture actually has 92 chromosomes since each unique chromosome is duplicated in it (but we typically say that there are 46 replicated chromosomes rather than that there are 92 chromosomes). This pairing after copying is important in cell division so that the resulting cells will get the correct set of choromosomes. In mitosis (ordinary cell division), each of the two new cells has one set of the 46 unique chromosomes. In meiosis, the gametes (sperm and eggs) are formed to have only one copy of each homologous chromosome and thus have 23 unique chromosomes. During meiosis, the process of recombination shuffles the homologous copies of the paternal and maternal chromosomes in such a way that each of the offspring's chromosomes will be a mixture of its grandparental chromosome segments.

Other important terms¶

Gene: The most obvious way how genetic variation can affect phenotypes is through

variation in how genes function. Genes are segements of DNA that code for proteins (see yourgenome.org) and variation in the physical structure of the protein or in the time and place where the protein is made can have phenotypic consequences. Therefore, we are very interested in how genetic variation can affect the function of genes, and a lot of this is still unknown. Protein coding genes cover less than 2% of the whole human genome, but the remaining 98% affects the regulation of genes in many ways.

Locus (pl. loci): A continuous region of the genome is called a locus (plural loci).

It can be of any size (e.g. a single nucleotide site of length 1 bp or a region of 10 milion base pairs, 10 Mbp).

GWAS loci: Regions that include a clear statistical association

with the phenotype of interest.

1.1.1 Genetic variants¶

At any one position of the genome (e.g. nucleotide site at position 13,475,383 of chromosome 1, denoted by chr1:13,475,383) variation can exist between the genomes in the population. For example, my paternal chromosome can have a base A and maternal chromosome can have a base G (on the +strand of the DNA) at that position. Such a one-nucleotide variation is called a single-nucleotide variant (SNV) and the two versions are called alleles. So in the example case, I would be carrying both an allele A and an allele G at that SNV, whereas you might be carrying two copies of allele A at the same SNV. My genotype would be AG and yours AA. An individual having different alleles on his/her two genomes is heterozygous at that locus, and an individual having two copies of the same allele is homozygous at that locus. If neither of the alleles is very rare in the population, say, the minor allele frequency (MAF) is > 1% in the population, the variant is called a polymorphism, single-nucleotide polymorphism (SNP). There are over 10 million SNPs in the human genome. More complex genetic variation include structural variation (SV) such as copy number variants (CNVs), that include duplications or deletions of genomic regions, or rearrangements of the genome, such as inversions or translocations of DNA segments (see yourgenome.org).

Figure 1: SNPs are DNA differences at a specific location (image source)

A predefined set of 500,000 - 1,000,000 SNPs can be measured reliably and fairly cheaply (< 50 euros/sample) by DNA microarrays, which has been the single most important factor making GWAS possible. On this course, we consider SNPs as the canonical type of genetic variation. Typically, the SNPs are biallelic, i.e., there are only two alleles present in the population and this is what we assume in the following. In principle, however, all four possible alleles of a SNP could be present in the population.

Ambiguous SNPs¶

If the two alleles of a SNP are either (C,G) or (A,T) we call the SNP ambiguous because the strand information must be available (and correct) in order to make sense of the genotypes at this SNP. This is because allele C on + strand would be called allele G on - strand and if this SNP is reported with respect to different strands in different studies, the results get mixed up. The same problem does not happen with the other SNPs, e.g., a SNP with alleles A,C, because this SNP contains alleles T,G on the opposite strand and we could unambiguously match A to T and C to G between the studies. Note that we can resolve most ambiguous SNPs reliably based on the allele frequencies as long as the minor allele frequency is not close to 50%. If we are combining several studies, we should always start by plotting the allele frequencies between the studies after the alleles should be matching each other in order to see that the frequencies indeed match across the studies.

1.1.2 Some catalogues of genetic variation¶

A large part of the genetics research over the last 30 years have been driven by international projects aiming to catalogue genetic variation in public domain.

Database	Year	Description
The Human Genome Project	1990-2003	Established a first draft of a human genome sequence
The HapMap project	2002-2009	Studied the correlation structure of the common SNPs
The 1000 Genomes project	2008-2015	Expanded HapMap to genome sequence information across the globe and currently remains a widely-used reference for global allele frequency information. 1000G project was able to characterize well common variation in different populations, but missed many rare variants of single individuals because the costs of very accurate sequencing were too high. The tremendous impact of the 1000G project stems from the fact that everyone can download the individual level genome data of the 1000G samples from the project's website and use it in their own research.
Exome Aggregation Consortium (ExAC)	2014-2016	Concentrated only on the protein coding parts of the genome, so called exons, that make up less than 2% of the genome and was able to provide accurate sequence data for the exomes of over 60,000 individuals. This effort has been particularly important for medical interpretation of rare variants seen in clinics that diagnose patients with severe disease. ExAC provides summary level information through browser and downloads but individual level data cannot be downloaded.
Genome Aggregation Database (gnomAD)	2016-2020	Is expanding the ExAC database and also includes additional whole genome sequencing information. It is the current state-of-the-art among the public genome variation databases.

1.2 What is a genome-wide association study?¶

Let's look at some recent examples of GWAS. Two main types of GWAS are studying quantitative traits or disease phenotypes.

Example 1 QT-GWAS¶

GWAS on body-mass index (BMI) by Locke et al. (2015) combined data of 339,000 individuals from 125 studies around the world to study the association of SNPs and BMI. It highlighted 97 regions of the genome with convincing statistical association with BMI. Pathway analyses provided support for a role of the central nervous system in obesity susceptibility and implicated new genes and pathways related to synaptic function, glutamate signalling, insulin secretion/action, energy metabolism, lipid biology and adipogenesis.

Figure 2: A Manhattan plot that shows the –log10 P-value of each SNP tested in GWAS in the BMI study (Locke et al. 2015). Manhattan plots will be explained later in the course, though for now, the idea that that after establishing a genome-wide significance level at P=5e-8 (which is equivalent to –log10(P) = 7.3), we can determine what variants are associated with the studied phenotype. Here, previously known loci are in blue, new findings are in red, and each locus in named by a nearby gene (but that gene is not necessarily causal.)

Example 1.2 Disease GWAS¶

GWAS on migraine by Gormley et al. (2016) combined genetic data on 60,000 cases (individuals with migraine) and 315,000 controls (individuals with no known migraine) originating from 22 studies. Genetic data was available on millions of genetic variants. At each variant, the genotype distribution between cases and controls were compared. 38 regions of the genome showed a convincing statistical association with migraine. Downstream analyses combined the genes into pathways and cell types and highlighted enrichment of signals near genes that are active in vascular system.

Figure 3: GWAS study on migraines (Gormley et al. 2016)

Important GWAS Terms:¶

Monogenic phenotype is determined by a single gene/locus.
Oligogenic phenotype is influenced by a handful of genes/loci.
Polygenic phenotype is influenced by many genes/loci.
Complex trait is a (quantitative) phenotype that is not monogenic. Typically polygenic and also influenced by many environmental factors.
Common disease is a disease/condition that is common in the population (say, prevalence of 0.1% or more). Examples: MS-disease (prevalence in the order of 0.1%), schizophrenia ($\sim 1\%$) or Type 2 diabetes ($\sim 10\%$).
Common variant has frequency of at least 1% (also 5% is used as the threshold).
Low-frequency variant has frequency of at least 0.1% and lower than a common variant.
Rare variant has frequency lower than a low-frequency variant.

GWAS have shown us that, very generally, complex traits and common diseases are highly polygenic, and many common variants with only small effects influence these phenotypes. We don't yet know which are the exact causal variants for each phenotype because of the correlation structure among genetic variants. We also don't yet know very accurately how rare variants affect each phenotype because that requires very large sample sizes interrogated by genome sequencing techniques, not only by SNP arrays.

1.3 Overview of GWAS Steps¶

The aim of genome‐wide association studies (GWAS) is to identify single nucleotide polymorphisms (SNPs) of which the allele frequencies vary systematically as a function of phenotypic trait values (e.g., between cases with schizophrenia and healthy controls, or between individuals with high vs. low scores on neuroticism). Identification of trait‐associated SNPs may subsequently reveal new insights into the biological mechanisms underlying these phenotypes. Technological advancements allow investigation of the impact of large numbers of SNPs distributed throughout the genome. Before we outline the procedure, we will offer a reminder on some key biological concepts. The typical steps of a GWAS are given below:

Figure 4: Overview of GWAS Steps (Uffelmann, E., Huang, Q.Q., Munung, N.S. et al. Genome-wide association studies. Nat Rev Methods Primers 1, 59 (2021). https://doi.org/10.1038/s43586-021-00056-9)

a) Data Collection¶

Data can be collected from study cohorts or available genetic and phenotypic information can be used from biobanks or repositories. Confounders need to be carefully considered and recruitment strategies must not introduce biases such as collider bias.

b) Genotyping¶

Genotypic data can be collected using microarrays to capture common variants, or next-generation sequencing methods for whole-genome sequencing (WGS) or whole-exome sequencing (WES). The data is saved in specific file formats that can be used for downstream analyses.

c) Quality control¶

Quality control includes steps at the wet-laboratory stage, such as genotype calling and DNA switches, and dry-laboratory stages on called genotypes, such as deletion of bad single-nucleotide polymorphisms (SNPs) and individuals, detection of population strata in the sample and calculation of principle components. Figure depicts clustering of individuals according to genetic substrata.

d) Imputation¶

Genotypic data can be phased, and untyped genotypes imputed using information from matched reference populations from repositories such as 1000 Genomes Project or TopMed. In the example given in the above figure, genotypes of SNP1 and SNP3 are imputed based on the directly assayed genotypes of other SNPs.

e) Association testing¶

Genetic association tests are run for each genetic variant, using an appropriate model (for example, additive, non-additive, linear or logistic regression). Confounders are corrected for, including population strata, and multiple testing needs to be controlled. Output is inspected for unusual patterns and summary statistics are generated.

f) Meta-analysis¶

To increase sample size, GWAS is typically carried out in the context of a consortium such as the Psychiatric Genomics Consortium, the Genetic Investigation of Anthropometric Traits (GIANT) consortium or the Global Lipids Genetics Consortium where data from multiple cohorts are analysed together using tools such as METAL.

g) Replication¶

Results can be replicated using internal replication or external replication in an independent cohort. For external replication, the independent cohort must be ancestrally matched and not share individuals or family members with the discovery cohort.

h) Post-GWAS analyses¶

In silico analysis of genome-wide association studies (GWAS), using information from external resources. This can include in silico fine-mapping, SNP to gene mapping, gene to function mapping, pathway analysis, genetic correlation analysis, Mendelian randomization and polygenic risk prediction. After GWAS, functional hypotheses can be tested using experimental techniques such as CRISPR or massively parallel reporter assays, or results can be validated in a human trait/disease model (not shown)

Scope of Course¶

In this course, we will give an introduction to data collection & genotyping, quality control (which will look at factors such as relatedness, population structure and summary statistics), and association testing.

1.4 GWAS software¶

As current GWAS consider 10,000s of individuals and millions of variants, those analyses are done with a specific software that read the specific file formats. The most popular software is PLINK. Its recent version 2.0, though because that is in beta version, we will be using version 1.9 in the duration of this course. Previous versions have their own GUI, though the later version is used via the command line.

How to install PLINK¶

You can install PLINK here to your device. It is best to install the most up-to-date stable version. This will give you a zip file which you will need to unzip. This will contain several files, of which only plink.exe is relevent for this course. Move this to the directory one above the Course Notebooks. If you have cloned the course repo, then it will already be installed.

NOTE: If you are running this course on our virtual environment on uCloud, then this will already have been installed.

To check if it is installed, we run the following command:

In [1]:

                
                    Copied!
                    
plink
plink

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3

  plink <input flag(s)...> [command flag(s)...] [other flag(s)...]
  plink --help [flag name(s)...]

Commands include --make-bed, --recode, --flip-scan, --merge-list,
--write-snplist, --list-duplicate-vars, --freqx, --missing, --test-mishap,
--hardy, --mendel, --ibc, --impute-sex, --indep-pairphase, --r2, --show-tags,
--blocks, --distance, --genome, --homozyg, --make-rel, --make-grm-gz,
--rel-cutoff, --cluster, --pca, --neighbour, --ibs-test, --regress-distance,
--model, --bd, --gxe, --logistic, --dosage, --lasso, --test-missing,
--make-perm-pheno, --tdt, --qfam, --annotate, --clump, --gene-report,
--meta-analysis, --epistasis, --fast-epistasis, and --score.

"plink --help | more" describes all functions (warning: long).

NOTE: Some machines may not have initial permission to run PLINK. If this applies to you, please consult the steps found here. Depending on what solution works for you, you might also have to apply this every time you start your computer.