2 Data Collection¶

3.1 Data Collection¶

GWAS often require very large sample sizes to identify reproducible genome-wide significant associations, and one would need to conduct power calculations in order to determine the appropriate sample size. Generally speaking, there are two types of study design:

1) the inclusion of cases and controls when the trait of interest is dichotomous; or
2) quantitative measurements on the whole study sample when the trait is quantitative

Figure 2.1: Potential study designs

The outcome of collecting data should give you all the necessary information you need for downstream analyses. These areas of information could include:

Blood samples or similar (for later genotyping)
Information about the traits of interest
disease status e.g. have diabetes (case) vs do not have diabetes (control)
quantitative e.g. cholesterol, blood pressure or blood sugar
Additional potentially relevant info, like sex and age

In addition, one can choose between population-based and family-based designs. The choice generally depends on the data you have available - that is, if you have less samples, you can control for close relationships in your analyses rather than removing related individuals.

GWAS can be conducted using data from resources such as biobanks or cohorts with disease-focused or population-based recruitment, or through direct to consumer studies. There are several excellent public resources available that provide access to large cohorts with both genotypic and phenotypic information, and the majority of GWAS are conducted using these pre-existing resources. Two of the most prominent research initiatives which contains the relevant resources include the International HapMap Project and the 1000 Genomes project. The International HapMap Project (http://hapmap.ncbi.nlm.nih.gov/; Gibbs et al., 2003) described the patterns of common SNPs within the human DNA sequence whereas the 1000 Genomes (1KG) project (http://www.1000genomes.org/; Altshuler et al., 2012) provided a map of both common and rare SNPs.

For all study designs, recruitment strategies must be carefully considered as these can induce various forms of bias. For example, widely used research cohorts such as the UK Biobank recruit participants through a volunteer-based strategy. What this means in practice is that participants are, on average, healthier, wealthier and more educated than the general population. Furthermore, cohorts that enrol participants from hospitals based on their disease status (such as BioBank Japan) will have different selection biases to cohorts recruited from the general population. Different ethnicities, for example, can be included in the same study, as long as the population substructure is considered to avoid false positive results.

To emphasise the point, it is important to know that whatever method you use to obtain the required data, it will require informed consentand ethical approval. This can be very difficult to do in practice, which is why publicly available banks are so widely used.

Finally, study annotation is conducive to sound scientific practice. This involves laying out a description of the chosen study design, a thorough description of the study, and other specifics such as sample inclusion criteria (e.g. ancestry, clinical features), and the phenotype you are interested in in your association analysis.

3.2 Genotyping¶

The rule of thumb for genotyping method is that you want to collect relevant SNP data as cheaply as possible. Breaking this down, "relevant" SNPs might suggest that your SNP sampling does not contain those in high LD, as this makes your analyses more computationally expensive. Of course, selecting one SNP out of a high-LD group is not a trivial problem, and further investigation will need to be performed if it is this SNP that is flagged.

Microarray-based genotyping is the most commonly used method for obtaining genotypes for GWAS owing to the current cost of next-generation sequencing. But this cost is decreasing overtime, as is shown in Figure 3.2. This is why WGS (Whole Genome Seqeuncing) is expected to become the method of choice over the next couple of years with the increasing availability of low-cost WGS technology over WES (Whole Exome Sequencing) or microarrays. However, the choice of genotyping platform depends on many factors and tends to be guided by the purpose of the GWAS. For example, in a consortium-led GWAS, it is usually wise to have all individual cohorts genotyped on the same genotyping platform. This avoids problems with batch bias.

$Cost.jpg$

Figure 3.2 - The trend of cost of sequencing an entire genome through the years. This shows a development faster than what would be predicted by Moore's Law, which suggests that every two years, our computational power should double.

3.3 Data files¶

When raw or genotype level data are available, the genotyping parser extracts the following information for each sample from study annotation files:

Family ID
Sample ID
Paternal ID (for family-based studies)
Maternal ID (for family-based studies)
Sex
Phenotype (discrete or quantitative)
Group/cluster (eg, geographical region) to assess possible effects of population stratification

This might come in the form of a VCF file, which includes information about the location and quality of the variant, as well as some descriptions of the sample. We will be looking at similar formats, known as .bim, .bedand .fam file formats, though this is based on personal preference.

Description of the Data¶

To be able to illustrate all analysis steps using realistic genetic data, we took a simulated dataset (N = 165) with a binary outcome measure using the publicly available data from the International HapMap Project (http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-05_phaseIII/plink_format/; Gibbs et al., 2003). For this tutorial, in order to create an ethnically homogenous dataset, we only included Utah residents with ancestry from Northern and Western Europe (CEU). Because of the relatively small sample size of the HapMap data, genetic effect sizes in these simulations were set at values larger than usually observed in genetic studies of complex traits. It is important to note that larger sample sizes (e.g., at least in the order of thousands but likely even tens or hundreds of thousands) will be required to detect genetic risk factors of complex traits.

We link the data folder to easily access it

In [1]:

                
                    Copied!
                    
ln -sf ../Data
ln -sf ../Data

3.2.1 Fam file - Info on individuals¶

We being by looking at HapMap_3_r3_1.fam, which you can do by the following command:

In [2]:

                
                    Copied!
                    
less Data/HapMap_3_r3_1.fam | head -10
less Data/HapMap_3_r3_1.fam | head -10

1328 NA06989 0 0 2 2
1377 NA11891 0 0 1 2
1349 NA11843 0 0 1 1
1330 NA12341 0 0 2 2
1444 NA12739 NA12748 NA12749 1 -9
1344 NA10850 0 NA12058 2 -9
1328 NA06984 0 0 1 2
1463 NA12877 NA12889 NA12890 1 -9
1418 NA12275 0 0 2 1
13291 NA06986 0 0 1 1

It is just a plain text file with six columns and no header. Each column has a very specific function. What you need to know about the respective columns is the following:

The first column is the family identification abbreviated as FID. As the PLINK software was primarily developed for genomic analyses in humans, which is reflected in the naming terminology and default settings. In the AdaptMap data set, the use of this column is to specify breed identity. If you look into Table 1 of the linked publication, you will find that the ABR abbreviation belongs to the Abregelle goats from Ethiopia. Using the FID column to denote breed information is certainly popular, but you are not limited to it. You can use any meaningful grouping of your data as you see fit.

The second column is the "within family ID", which is just simply called "individual ID" and abbreviated as IID. Although this phrasing might suggest that the same IID could be used with different FID, it is a very good idea to have unique IIDs for each individual in your data set. For any specification of individuals within PLINK however the IID always goes together with FID, so you have to devote sufficient attention to both anyway.

The third column is the ID of the father and the fourth column is the ID of the mother of the animal whose IID is denoted in the second column. If any of the parents are genotyped their ID in the third and fourth columns and their IID should correspond. It is often the case that the parents are not known, or at the time of genotyping, there is not enough time to fill out the parent information and it is set to 0. It is not a problem in most cases, but in some types of analyses, we have to be aware of this and include appropriate options (e.g. when looking at the minor allele frequency - don't worry about this now though).

Column five contains the sex information of the individual in the IID column. According to the built-in coding, 1 is for males, 2 for females, and 0 is unknown. Similar to parent information, this is also many times missing, usually not a problem, but specific options need to be included in case of any issues come up.

The sixth column is to denote the phenotype of the individuals in the IID column. Because this is only a single column, only a single phenotype could be specified. In a case-control study the numerical value 1 is reserved for controls, and 2 for the cases. Alternatively, you can use any other number, including decimals for any other phenotype. In the case of missing values, you can use zero or minus nine.

We start by confirming the amount of individuals in our population. We can do this simply by counting the number of lines in the .fam file:

In [3]:

                
                    Copied!
                    
wc -l Data/HapMap_3_r3_1.fam
wc -l Data/HapMap_3_r3_1.fam

165 Data/HapMap_3_r3_1.fam

3.2.2 Bim file - SNP location info¶

The .bim file contains the locations of all SNPs in the data, and looks like this:

In [4]:

                
                    Copied!
                    
less Data/HapMap_3_r3_1.bim | head -10
less Data/HapMap_3_r3_1.bim | head -10

1	rs2185539	0	556738	T	C
1	rs11510103	0	557616	G	A
1	rs11240767	0	718814	T	C
1	rs3131972	0	742584	A	G
1	rs3131969	0	744045	A	G
1	rs1048488	0	750775	C	T
1	rs12562034	0	758311	A	G
1	rs12124819	0	766409	G	A
1	rs4040617	0	769185	G	A
1	rs2905036	0	782343	C	T

Similar to the .fam file, the bim file has six columns.

The first column contains the chromosome number where the SNP is located. You can see that the first 10 SNPs are located on Chromosome 1.

The second column is the SNP name. This name is predefined during the construction of the SNP chip and fixed. If you ever want to compare versions of different SNP chips for the same species, the SNP overlaps based on their names is an excellent way to start.

The third column is the position of the SNP in Morgans or centimorgans, with zero value if you do not know or do not care. For most of the analyses, this could be kept as zero.

The fourth column is the base pair coordinate of the SNP. In other words, you start to count from the beginning of the chromosome and for each SNP write down its exact location. At the beginning of each new chromosome, the counter resets and starts from one again.

In the remaining two columns five and six are the alleles for respective SNPs. All SNPs on the chips are biallelic by design, so you always see only two alleles in each row. So this means that for each SNP there could be only these two characters representing the genotype and the character representing the missing genotypes (not shown in the .bim file, coded as zero by default). This is a very important concept that will be relevant later on when it comes to merging SNP data sets. Genotypes in column five usually denote the minor allele and in column six the major allele (more about the allele frequency in the quality control in chapter Genotype data quality control). In some cases you also see the number 0 appearing. This means that in this case, all known genotypes are homozygous GG, i.e. fixed in the sample with no alternative allele.

Similar to .fam files, you can extract one more fairly useful piece of information just by looking at the file. Here again, the number of rows in the .bim file shows how many SNPs are in the data set you are looking at. For the HapMap data, this is 1457897 SNPs:

In [5]:

                
                    Copied!
                    
wc -l Data/HapMap_3_r3_1.bim
wc -l Data/HapMap_3_r3_1.bim

1242712 Data/HapMap_3_r3_1.bim

3.2.3 Bed file - Individual genotypes¶

Right now, we know that the files you downloaded contain genotypes for 165 individuals, each of them genotyped for 1457897 SNPs. But where are the genotypes for the individual animals?

These are located in the HapMap_3_r3_1.bed file, but unfortunately, opening it with a text editor looks the this:

In [7]:

                
                    Copied!
                    
xxd -b Data/HapMap_3_r3_1.bed | head -10
xxd -b Data/HapMap_3_r3_1.bed | head -10

00000000: 01101100 00011011 00000001 11111111 11111111 11111111  l.....
00000006: 11111111 11111111 11111111 11111111 11111111 11111111  ......
0000000c: 11111111 11111111 11111111 11111111 11111111 11111111  ......
00000012: 11111111 11111111 11111111 11111111 11111111 11111111  ......
00000018: 11111111 11111111 11111111 11111111 11111111 11111111  ......
0000001e: 11111111 11111111 11111111 11111111 11111111 11111111  ......
00000024: 11111111 11111111 11111111 11111111 11111111 11111111  ......
0000002a: 11111111 11111111 00000011 11111111 11111111 11111111  ......
00000030: 11111111 11111111 11111111 11111111 11111111 11111111  ......
00000036: 11110011 11111111 11111111 11111111 11111111 11111111  ......
xxd: Broken pipe

This is because the genotypes are coded in a so-called binary format. This not just takes a very small hard disk space, but it is processed quicker by the computer since it's already translated from human-readable to a machine-readable format. Of course, quite often it is very useful to check out the individual genotypes, so to have them in a text format is required. The non-binary file format for the genotype is stored in the so-called .ped and .map files. These are also some very well-known formats and widely used in various programs.