1 What is a GWAS?¶
During the past two decades, there has been a growing interest in investigating the influence of genetic risk factors on variation in human behaviour. The technical and analytic tools needed to conduct genetic studies have become increasingly accessible. This increased accessibility offers great promise as researchers outside the field of genetics may bring new expertise to the field (e.g., more in‐depth knowledge of the nosology of psychiatric traits). However, performing genetic association studies in a correct manner requires specific knowledge of genetics, statistics, and (bio)informatics. This course aims to provide a guideline for conducting genetic analyses by introducing key concepts and by sharing scripts that can be used for data analysis.
We all carry two nuclear genomes (i.e. genomes located in cell nucleus), one inherited from each of our two parents. Additionally, we have a small mitochondrial genome, assumed to be inherited exclusively from the mother, but on this course the term 'genome' refers to the nuclear genome.
The human genome is a 3.2 billion nucleotide (or base pair or DNA letter A,C,G,T) long sequence (see yourgenome.org), that is divided into separate physical pieces called chromosomes (see yourgenome.org). There are 22 autosomal (non-sex related) chromosomes and two sex chromosomes (X chromosome and Y chromosome). Normally, humans have two copies of each autosome and individuals with one copy of X and one of Y are males whereas individuals who have two copies of X are females. Abnormal number of chromosomes (called aneuploidies) typically cause severe consequences or an early death if present in all cells of an individual. The most common non-lethal exception is the Down syndrome (3 copies of chr 21). Mosaicism, where some cells have abnormal chromosome numbers also exist and are often present in cancer cells.
There are three types of pairings that come up when we analyse genomes.
- First, the DNA is most of the time
a double-stranded molecule whose two strands (i.e. the two DNA molecules) are glued together by the chemical base pairings A-T and C-G. This base pairing is a key to the copying mechanism of the DNA that is needed before any cell division (see yourgenome.org) and the DNA molecules that are connected through base pairing have the exact same information, just written in the reverse letters i.e. A <-> T and C <-> G. To make distinction between the two DNA molecules, it has been agreed that one of the two DNA strands is called the forward strand (or positive strand) and the other the reverse strand (or negative strand). Thus, e.g., when + strand contains base A, the corresponding base on - strand is T and vice versa.
- Second, the two homologous choromosomes of an individual
(e.g. paternal chr 13 with maternal chr 13, or in a male, maternal X and paternal Y) can be thought of as a pair. Thus, we say that the human genome consists of 22 autosomes + X + Y, but each individual has two copies of each homologous chromosome, so has 46 unique chromosomes that are divided into 23 pairs of homologous chromosomes.
- Third, before any cell division each of the 46 unique chromosomes of an individual
copies itself and the two copies (called sister chromatids) are paired with each other physically to make an X-like shape that is often used to show chromosomes in pictures. Such picture actually has 92 chromosomes since each unique chromosome is duplicated in it (but we typically say that there are 46 replicated chromosomes rather than that there are 92 chromosomes). This pairing after copying is important in cell division so that the resulting cells will get the correct set of choromosomes. In mitosis (ordinary cell division), each of the two new cells has one set of the 46 unique chromosomes. In meiosis, the gametes (sperm and eggs) are formed to have only one copy of each homologous chromosome and thus have 23 unique chromosomes. During meiosis, the process of recombination shuffles the homologous copies of the paternal and maternal chromosomes in such a way that each of the offspring's chromosomes will be a mixture of its grandparental chromosome segments.
Other important terms¶
- Gene: The most obvious way how genetic variation can affect phenotypes is through
variation in how genes function. Genes are segements of DNA that code for proteins (see yourgenome.org) and variation in the physical structure of the protein or in the time and place where the protein is made can have phenotypic consequences. Therefore, we are very interested in how genetic variation can affect the function of genes, and a lot of this is still unknown. Protein coding genes cover less than 2% of the whole human genome, but the remaining 98% affects the regulation of genes in many ways.
- Locus (pl. loci): A continuous region of the genome is called a locus (plural loci).
It can be of any size (e.g. a single nucleotide site of length 1 bp or a region of 10 milion base pairs, 10 Mbp).
- GWAS loci: Regions that include a clear statistical association
with the phenotype of interest.
1.1.1 Genetic variants¶
At any one position of the genome (e.g. nucleotide site at position 13,475,383 of chromosome 1, denoted by chr1:13,475,383) variation can exist between the genomes in the population. For example, my paternal chromosome can have a base A and maternal chromosome can have a base G (on the +strand of the DNA) at that position. Such a one-nucleotide variation is called a single-nucleotide variant (SNV) and the two versions are called alleles. So in the example case, I would be carrying both an allele A and an allele G at that SNV, whereas you might be carrying two copies of allele A at the same SNV. My genotype would be AG and yours AA. An individual having different alleles on his/her two genomes is heterozygous at that locus, and an individual having two copies of the same allele is homozygous at that locus. If neither of the alleles is very rare in the population, say, the minor allele frequency (MAF) is > 1% in the population, the variant is called a polymorphism, single-nucleotide polymorphism (SNP). There are over 10 million SNPs in the human genome. More complex genetic variation include structural variation (SV) such as copy number variants (CNVs), that include duplications or deletions of genomic regions, or rearrangements of the genome, such as inversions or translocations of DNA segments (see yourgenome.org).