Modern humans and their archaic ancestry¶
In this session we will work on a dataset of called archaic fragments in modern humans. The modern humans are from the Simons genome diversity project. We will replicate some of the plots in the paper originating the data and answer some analytical questions.
Background material¶
The dataset used in this session is provided by Laurits Skov, which published the paper Skov et al 2018. Please read it before running this exercise, if you want to know more about archaic introgression and the method used.
How to make this notebook work¶
- In this notebook we will use only 'R'. You need to choose a kernel, that contains a programming language and the necessary packages to run the course material. To choose the right kernel, go on the menu on the top of the page and select
Kernel --> Change Kernel
, and then choosepopgen course
. - You can run the code in each cell by clicking on the run cell sign in the toolbar, or simply by pressing Shift+Enter. When the code is done running, a small green check sign will appear on the left side.
- You need to run the cells in sequential order to execute the analysis. Please do not run a cell until the one above is done running, and do not skip any cells
- The code goes along with textual descriptions to help you understand what happens. Please try not to focus on understanding the code itself in too much detail - but rather try to focus on the explanations and on the output of the commands
- You can create new code cells by pressing
+
in the Menu bar above or by pressing B after you select a cell.
Learning outcomes¶
At the end of this tutorial you will be able to
- Perform and discuss simple operations on dataframes such as grouping, averaging and identifying specific elements in the dataset
- Identify the technique used in the article to detect archaic fragments
- Answer to hypothesis using data analysis on a dataframe
Detecting Archaic ancestry¶
We load the libraries.
suppressMessages({ #removes warnings and other messages when loading
library(dplyr)
library(ggplot2)
library(data.table) })
We use the function fread
to read in the data. fread
is very fast and surpasses the R function read.table
when using large datasets.
#archaic_df = read.table('../../Data/archaic/ArchaicSegments2.txt',sep='\t', header = T) slower
archaic_df = fread('../../Data/archaic/ArchaicSegments2.txt',sep='\t', header = T, nThread=2, verbose=FALSE)
Some example commands¶
Some example commands that are quite useful when exploring data.
Look at the which columns there are in the table and print just the first few lines with head
to see some values for each column
colnames(archaic_df)
- 'name'
- 'chrom'
- 'start'
- 'end'
- 'length'
- 'snps'
- 'pop'
- 'country'
- 'region'
- 'MeanProb'
- 'Shared_with_Altai'
- 'Shared_with_Denisova'
- 'Shared_with_Vindija'
- 'outgroup'
head(archaic_df)
name | chrom | start | end | length | snps | pop | country | region | MeanProb | Shared_with_Altai | Shared_with_Denisova | Shared_with_Vindija | outgroup | closest_to |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <int> | <int> | <int> | <int> | <chr> | <chr> | <chr> | <dbl> | <int> | <int> | <int> | <chr> | <chr> |
ERS699811 | 1 | 2791000 | 2815000 | 24000 | 12 | Toto | India | EastAsia | 0.8304250 | 7 | 0 | 5 | SubAfricans | Neanderthal |
ERS699811 | 1 | 3058000 | 3113000 | 55000 | 36 | Toto | India | EastAsia | 0.9733212 | 10 | 6 | 12 | SubAfricans | Neanderthal |
ERS699811 | 1 | 3341000 | 3384000 | 43000 | 21 | Toto | India | EastAsia | 0.9102532 | 9 | 1 | 8 | SubAfricans | Neanderthal |
ERS699811 | 1 | 3433000 | 3453000 | 20000 | 13 | Toto | India | EastAsia | 0.9437119 | 8 | 0 | 7 | SubAfricans | Neanderthal |
ERS699811 | 1 | 3684000 | 3784000 | 100000 | 35 | Toto | India | EastAsia | 0.9745021 | 17 | 5 | 15 | SubAfricans | Neanderthal |
ERS699811 | 1 | 3984000 | 4091000 | 107000 | 19 | Toto | India | EastAsia | 0.8731331 | 5 | 4 | 5 | SubAfricans | Neanderthal |
How many individuals do we have? dim
gives you number of rows and columns>
dim(archaic_df)
- 293971
- 14
Count the different names of individuals using unique
, since the same name is repeated more than once
length(unique(archaic_df$name))
The same for the number of populations
length(unique(archaic_df$pop))
And to see the name of the regions
unique(archaic_df$region)
- 'EastAsia'
- 'WestEurasia'
- 'CentralAsiaSiberia'
- 'SouthAsia'
- 'Melanesia'
How to find the average archaic segment length by population. When using a dataframe, remember that this is piped through various commands. For example, mean_seg_pop
is created by piping archaic_df
(with the symbol %>%
) into grouping the table by region and population, which result is then piped into the command summarising through the mean of the segment length. Similarly, mean_seg_pop
gets its column arranged by region and then the result is piped into the plotting commands
#this resize your plot changing width and height
options(repr.plot.width=30, repr.plot.height=10)
# Average archaic segment length by population
mean_seg_pop <- archaic_df %>%
group_by(pop, region) %>%
summarise(`Mean segment length` = mean(length))
#arrange by region (color blocks in the plot) and do a barplot
mean_seg_pop %>%
ungroup() %>%
arrange(region) %>%
mutate(pop = factor(pop, pop)) %>%
ggplot(aes(x = pop, y = `Mean segment length`, fill = region)) +
geom_bar(position = "dodge", stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=20))
`summarise()` has grouped output by 'pop'. You can override using the `.groups` argument.
What is the population with the highest average segment length?
mean_seg_pop[ which.max(mean_seg_pop$`Mean segment length`), ]
pop | region | Mean segment length |
---|---|---|
<chr> | <chr> | <dbl> |
Papuans | Melanesia | 131784.3 |
What is the average length of segments by region?
mean_seg_region <- archaic_df %>%
group_by(region) %>%
summarise(`Mean segment length` = mean(length))
ggplot(mean_seg_region, aes(x = region, y = `Mean segment length`)) +
geom_bar(position = "dodge", stat="identity") +
theme_bw() +
coord_flip() +
theme(text = element_text(size=30))
Q1. Find the total lengths of Arcahic fragments in each individual¶
Calculate total archaic segment length by ind and grouped by population and region:
total_seg_ind <- archaic_df %>%
group_by(name, pop, region) %>%
summarise(`Total segment length` = sum(length))
head(total_seg_ind)
`summarise()` has grouped output by 'name', 'pop'. You can override using the `.groups` argument.
name | pop | region | Total segment length |
---|---|---|---|
<chr> | <chr> | <chr> | <int> |
1-Mar | Papuans | Melanesia | 159627000 |
13733_8 | Papuans | Melanesia | 168490000 |
13748_1 | Papuans | Melanesia | 175569000 |
13748_2 | Papuans | Melanesia | 172182000 |
13748_3 | Papuans | Melanesia | 165722000 |
13748_4 | Papuans | Melanesia | 160407000 |
Q2. Summarise the total length per population and per region¶
total_seg_ind %>%
group_by(pop,region) %>%
summarise(`Mean total segment length`=mean(`Total segment length`)) %>%
ungroup() %>%
arrange(region) %>%
mutate(pop = factor(pop, pop)) %>%
ggplot(aes(x = pop, y = `Mean total segment length`,
fill = region)) +
geom_bar(position = "dodge", stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size=20), text=element_text(size=30))
`summarise()` has grouped output by 'pop'. You can override using the `.groups` argument.
Q3. Which population has most archaic ancestry? Why?¶
Papuans. Denisovans.
Q4. What is the length distribution of fragments for the five different regions (hint: you can use facet_wrap
to plot all at once)?¶
we group archaic_df
by region and use ggplot
to make histograms of the fragment lengths. We use facet_wrap(~region)
to make a separate histogram for each region
options(repr.plot.width=20, repr.plot.height=20)
archaic_df %>%
group_by(region) %>%
ggplot(aes(length)) +
theme_bw() +
geom_histogram(bins = 50) +
facet_wrap(~region) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=15))
Similarly, you can create a boxplot for each region. In this way you do not need the additional command facet_wrap
archaic_df %>%
group_by(region) %>%
ggplot(aes(y=length,x=region)) +
theme_bw() +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=30))
Q5. What is the average length of fragments for each population and each region?¶
We calculate the average length of fragments in each population and then plot the table we obtained
options(repr.plot.width=30, repr.plot.height=10)
mean_seg_pop <- archaic_df %>%
group_by(pop, region) %>%
summarise(`Mean segment length` = mean(length))
mean_seg_pop %>%
ungroup() %>%
arrange(region) %>%
mutate(pop = factor(pop, pop)) %>%
ggplot(aes(x = pop, y = `Mean segment length`, fill = region)) +
geom_bar(position = "dodge", stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
`summarise()` has grouped output by 'pop'. You can override using the `.groups` argument.
Q6. What can cause different mean fragment lengths?¶
Differences in recombination rate, more than one wave of introgression in one of the regions. Recent gene flow/introgression from denisovans/neanderthal after migration out of Africa. Generation time.
The origin of archaic fragments¶
We now try to distinguish which populations fragments come from, and in which proportions.
Q1. For each individual, assign the archaic segments to origin and reconstruct a Figure in the same style as Figure 5 of the Cell paper (plot below).¶
We use mutate
to create three origins of the fragments in the new column closest_to
, by looking at where the highest proportion of fragments come from. Then we group by the columns name, closest_to, population
and average the total fragment lengths before plotting.
options(repr.plot.width=20, repr.plot.height=20)
archaic_df <- archaic_df %>%
mutate(closest_to = case_when(
Shared_with_Altai > Shared_with_Denisova | Shared_with_Vindija > Shared_with_Denisova ~ "Neanderthal",
Shared_with_Altai < Shared_with_Denisova & Shared_with_Vindija < Shared_with_Denisova ~ "Denisovan",
Shared_with_Altai + Shared_with_Vindija + Shared_with_Denisova == 0 ~ "Unclassified"
))
archaic_df %>%
group_by(name, closest_to, pop) %>%
summarise(total = sum(length)) %>%
ungroup() %>%
group_by(pop, closest_to) %>%
summarise(mean_sequences = mean(total)) %>%
ungroup() %>%
mutate(closest_to = factor(closest_to, c('Denisovan', 'Unclassified', 'Neanderthal'))) %>%
ggplot(aes(x = pop, y = mean_sequences/10^6, fill = closest_to)) +
geom_bar(stat="identity") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab('Mean detected archaic sequence per region (Mb)') +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=20))
`summarise()` has grouped output by 'name', 'closest_to'. You can override using the `.groups` argument. `summarise()` has grouped output by 'pop'. You can override using the `.groups` argument.
Q2. What are major differences? What can explain these differences?¶
Higher archaic proportions for Papuans. Differences in proportions and lengths between methods.
Q3. Summarize the results by averaging over region and plot these.¶
# Plotting by region
options(repr.plot.width=30, repr.plot.height=10)
archaic_df %>%
group_by(name, closest_to, region) %>%
summarise(total = sum(length)) %>%
ungroup() %>%
group_by(region, closest_to) %>%
summarise(mean_sequences = mean(total)) %>%
ungroup() %>%
mutate(closest_to = factor(closest_to, c('Denisovan', 'Unclassified', 'Neanderthal'))) %>%
ggplot(aes(x = region, y = mean_sequences/10^6, fill = closest_to)) +
geom_bar(stat="identity") + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab('Mean detected archaic sequence per region (Mb)') +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=30))
`summarise()` has grouped output by 'name', 'closest_to'. You can override using the `.groups` argument. `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
East Asians have been shown to contain more Neanderthal admixture (maybe because of a second admixture event after the European-East Asian split or maybe because of dilution of the Neanderthal component in the European population because of migrations from non-admixed populations). Additionally, it has been shown that East Asians contain higher admixture proportions of a different Denisovan origin, than South Asians and Papuans. The component most related to the Altai genome is the one present in a higher proportion in East Asians, compared to South Asians and Papuans. East Asians contain both components with similar proportions (two waves of admixture).
Q6. Determine the fragment length distribution of segment of Neanderthal and Denisova origin separately for each region. Compare the median of the distributions.¶
options(repr.plot.width=30, repr.plot.height=10)
archaic_df %>%
filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
ggplot() +
geom_density(aes(length, fill=closest_to),alpha=0.2) +
theme_bw() + facet_grid(~region) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
When interpreting these values, we should also take into account the number of fragments and samples for each population
archaic_df %>%
filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
group_by(closest_to,region) %>%
summarise(median_fragment_length = median(length), count = n(), samples = length(unique(name)))
`summarise()` has grouped output by 'closest_to'. You can override using the `.groups` argument.
closest_to | region | median_fragment_length | count | samples |
---|---|---|---|---|
<chr> | <chr> | <int> | <int> | <int> |
Denisovan | CentralAsiaSiberia | 76000 | 1301 | 27 |
Denisovan | EastAsia | 74000 | 8455 | 132 |
Denisovan | Melanesia | 102000 | 45153 | 89 |
Denisovan | SouthAsia | 70000 | 2103 | 39 |
Denisovan | WestEurasia | 61000 | 1733 | 71 |
Neanderthal | CentralAsiaSiberia | 91000 | 12619 | 27 |
Neanderthal | EastAsia | 89000 | 67005 | 132 |
Neanderthal | Melanesia | 96000 | 44713 | 89 |
Neanderthal | SouthAsia | 86000 | 16767 | 39 |
Neanderthal | WestEurasia | 86000 | 26101 | 71 |
We can see that the median fragment length (and the number of fragments) for the Denisovans is higher in Melanesian samples, consistent with a more recent admixture with Denisovan archaics. The Denisovan fragment lengths in EastAsians, CentralAsiaSiberia aand South Asians, are longer than WestEurasians, but not markedly longer, we also have less Denisovan fragments than for Papuans. Finally, we can see that the median length of Neanderthal fragments is similar, consistent with a common Neanderthal admixture event.
quantile((archaic_df %>%
filter(closest_to %in% c("Neanderthal","Denisovan")))$length)
- 0%
- 3000
- 25%
- 52000
- 50%
- 91000
- 75%
- 164000
- 100%
- 2375000
quantile((archaic_df %>%
filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija))$total_snps)
- 0%
- 1
- 25%
- 9
- 50%
- 19
- 75%
- 36
- 100%
- 561
So as to clean a bit the fragments, we can apply a constraint on the minimum number of overlapping SNPs. We'll choose as a theshold median number of overlapping SNPs.
options(repr.plot.width=30, repr.plot.height=10)
archaic_df %>%
filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija) %>%
filter(total_snps > 19) %>%
ggplot() +
geom_density(aes(length, fill=closest_to),alpha=0.2) +
theme_bw() + facet_grid(~region) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df %>%
filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija) %>%
filter(total_snps > 50) %>%
ggplot(aes(x=closest_to,y=length)) +
geom_boxplot() +
ggtitle('Denisovan fragment length per region') +
theme_bw() +
facet_grid(~region) +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
b. How does this compare to the fragment size distribution of the fragments that could not be assigned to archaic origin (these are removed from the Cell paper analyses)? Discuss reasons for the differences.¶
archaic_df %>%
filter(closest_to == "Neanderthal") %>%
group_by(region) %>%
ggplot(aes(length)) +
geom_histogram(bins = 50) +
ggtitle('Neanderthal fragment length per region') +
theme_bw() + facet_grid(~region) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot1
archaic_df %>%
filter(closest_to == "Denisovan") %>%
ggplot(aes(length)) +
geom_histogram(bins = 50) +
ggtitle('Denisovan fragment length per region') +
theme_bw() + facet_grid(~region) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot2
archaic_df %>%
filter(closest_to == "Unclassified") %>%
group_by(region) %>%
ggplot(aes(length)) +
geom_histogram(bins = 50) +
ggtitle('Unclassified fragment length per region') +
theme_bw() + facet_grid(~region) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot3
plot1
plot2
plot3
Comparison of chromosomes¶
Q1. Determine the amount of archaic introgression on each chromosome for each of the five regions.¶
archaic_df %>%
group_by(chrom, region) %>%
summarise(Frequency = sum(as.numeric(length))) %>%
ggplot(aes(x = chrom, y = Frequency, fill = region)) +
geom_bar(position = "dodge", stat="identity") +
theme_bw() +
scale_x_discrete(limits= c(seq(1, 22, 1),'X')) + # ordering from 1 to 22 +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
`summarise()` has grouped output by 'chrom'. You can override using the `.groups` argument.
Q2. Repeat this with assignment of regions to archaic species.¶
archaic_df %>%
group_by(chrom, closest_to) %>%
summarise(Frequency = sum(as.numeric(length))) %>%
ggplot(aes(x = chrom, y = Frequency, fill = closest_to)) +
geom_bar(position = "dodge", stat="identity") +
theme_bw() +
scale_x_discrete(limits= c(seq(1, 22, 1),'X')) + # ordering from 1 to 22
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
`summarise()` has grouped output by 'chrom'. You can override using the `.groups` argument.
Q3. Combine the Neanderthal fragments for all individuals and plot all the fragments on top of each other along chromosomes (hint use alpha = 0.1). Can you find “deserts” of archaic admixture and/or evidence for places where Neanderthal ancestry has reached very high frequency?¶
archaic_df %>%
arrange(region) %>%
ggplot() +
theme_bw() +
geom_segment(aes(x=start,xend=end,y=0, yend=1, col=closest_to), alpha=0.02) + facet_wrap(~chrom) +
theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.text.y=element_blank(),axis.ticks=element_blank(),axis.title.y=element_blank(),legend.position="none",text = element_text(size=50)) +
xlab("Position")
Q4. You will find that the X chromosome is an outlier (compare to a chormosome of a similar size - chr8). How and why?¶
archaic_df %>%
group_by(chrom, region) %>%
filter(chrom %in% c('8', 'X')) %>%
summarise(Frequency = sum(as.numeric(length))) %>%
ggplot(aes(x = chrom, y = Frequency, fill = region)) +
geom_bar(position = "dodge", stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
`summarise()` has grouped output by 'chrom'. You can override using the `.groups` argument.
archaic_df %>%
group_by(chrom, region) %>%
filter(chrom %in% c('8', 'X')) %>%
summarise(Frequency = sum(as.numeric(length))) %>%
ggplot(aes(x = chrom, y = Frequency)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
`summarise()` has grouped output by 'chrom'. You can override using the `.groups` argument.
Sexual antagonistic effects acting on this chromosome, with possible effects on hybrid incompatibilities and speciation.
Q6. Do you find regions that are devoid of introgression for both the Neanderthal and the Denisovan admixture events?¶
Looking at the EPAS1 as an example of adaptive introgression. This gene maps to chr2:46,520,806-46,613,842 in GRCh37 coordinates. Tibetan individuals have an increased amount of archaic polymorphism in this region.
archaic_df %>%
filter(closest_to == "Denisovan") %>%
filter(chrom == 2 & region == 'EastAsia' & start > 46500000 & end < 46700000 ) %>%
ggplot((aes(x = length))) + theme_bw() +
geom_histogram(bins = 50) + facet_grid(~country) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
Contrary, FOXP2 is an example of maladaptive introgression, consistent with the presence of deserts of archaic origin in around this gene.
archaic_df %>%
filter(chrom == "7") %>%
arrange(region) %>%
ggplot() +
theme_bw() +
geom_segment(aes(x=start,xend=end,y=0, yend=1, col=closest_to), alpha=0.02) + facet_wrap(~region) +
theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.text.y=element_blank(),axis.ticks=element_blank(),axis.title.y=element_blank(),legend.position="none") +
geom_vline(xintercept = 113726365, alpha=0.5) +
xlab("Position") + ggtitle(paste("Chromosome 7")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))