Modern humans and their archaic ancestry¶

In this session we will work on a dataset of called archaic fragments in modern humans. The modern humans are from the Simons genome diversity project. We will replicate some of the plots in the paper originating the data and answer some analytical questions.

Background material¶

The dataset used in this session is provided by Laurits Skov, which published the paper Skov et al 2018. Please read it before running this exercise, if you want to know more about archaic introgression and the method used.

How to make this notebook work¶

In this notebook we will use only 'R'. You need to choose a kernel, that contains a programming language and the necessary packages to run the course material. To choose the right kernel, go on the menu on the top of the page and select Kernel --> Change Kernel, and then choose popgen course.
You can run the code in each cell by clicking on the run cell sign in the toolbar, or simply by pressing Shift+Enter. When the code is done running, a small green check sign will appear on the left side.
You need to run the cells in sequential order to execute the analysis. Please do not run a cell until the one above is done running, and do not skip any cells
The code goes along with textual descriptions to help you understand what happens. Please try not to focus on understanding the code itself in too much detail - but rather try to focus on the explanations and on the output of the commands
You can create new code cells by pressing + in the Menu bar above or by pressing B after you select a cell.

Learning outcomes¶

At the end of this tutorial you will be able to

Perform and discuss simple operations on dataframes such as grouping, averaging and identifying specific elements in the dataset
Identify the technique used in the article to detect archaic fragments
Answer to hypothesis using data analysis on a dataframe

Detecting Archaic ancestry¶

We load the libraries.

In [42]:

            
                Copied!
                
suppressMessages({ #removes warnings and other messages when loading
    library(dplyr)
    library(ggplot2)
    library(data.table) })
suppressMessages({ #removes warnings and other messages when loading
    library(dplyr)
    library(ggplot2)
    library(data.table) })

We use the function fread to read in the data. fread is very fast and surpasses the R function read.table when using large datasets.

In [4]:

            
                Copied!
                
#archaic_df = read.table('../../Data/archaic/ArchaicSegments2.txt',sep='\t', header = T) slower
archaic_df = fread('../../Data/archaic/ArchaicSegments2.txt',sep='\t', header = T, nThread=2, verbose=FALSE)
#archaic_df = read.table('../../Data/archaic/ArchaicSegments2.txt',sep='\t', header = T) slower
archaic_df = fread('../../Data/archaic/ArchaicSegments2.txt',sep='\t', header = T, nThread=2, verbose=FALSE)

Some example commands¶

Some example commands that are quite useful when exploring data.

Look at the which columns there are in the table and print just the first few lines with head to see some values for each column

In [6]:

            
                Copied!
                
colnames(archaic_df)
colnames(archaic_df)

'name'
'chrom'
'start'
'end'
'length'
'snps'
'pop'
'country'
'region'
'MeanProb'
'Shared_with_Altai'
'Shared_with_Denisova'
'Shared_with_Vindija'
'outgroup'

In [43]:

            
                Copied!
                
head(archaic_df)
head(archaic_df)

A data.table: 6 × 15
name	chrom	start	end	length	snps	pop	country	region	MeanProb	Shared_with_Altai	Shared_with_Denisova	Shared_with_Vindija	outgroup	closest_to
<chr>	<chr>	<int>	<int>	<int>	<int>	<chr>	<chr>	<chr>	<dbl>	<int>	<int>	<int>	<chr>	<chr>
ERS699811	1	2791000	2815000	24000	12	Toto	India	EastAsia	0.8304250	7	0	5	SubAfricans	Neanderthal
ERS699811	1	3058000	3113000	55000	36	Toto	India	EastAsia	0.9733212	10	6	12	SubAfricans	Neanderthal
ERS699811	1	3341000	3384000	43000	21	Toto	India	EastAsia	0.9102532	9	1	8	SubAfricans	Neanderthal
ERS699811	1	3433000	3453000	20000	13	Toto	India	EastAsia	0.9437119	8	0	7	SubAfricans	Neanderthal
ERS699811	1	3684000	3784000	100000	35	Toto	India	EastAsia	0.9745021	17	5	15	SubAfricans	Neanderthal
ERS699811	1	3984000	4091000	107000	19	Toto	India	EastAsia	0.8731331	5	4	5	SubAfricans	Neanderthal

How many individuals do we have? dim gives you number of rows and columns>

In [7]:

            
                Copied!
                
dim(archaic_df)
dim(archaic_df)

293971
14

Count the different names of individuals using unique, since the same name is repeated more than once

In [8]:

            
                Copied!
                
length(unique(archaic_df$name))
length(unique(archaic_df$name))

358

The same for the number of populations

In [9]:

            
                Copied!
                
length(unique(archaic_df$pop))
length(unique(archaic_df$pop))

110

And to see the name of the regions

In [10]:

            
                Copied!
                
unique(archaic_df$region)
unique(archaic_df$region)

'EastAsia'
'WestEurasia'
'CentralAsiaSiberia'
'SouthAsia'
'Melanesia'

How to find the average archaic segment length by population. When using a dataframe, remember that this is piped through various commands. For example, mean_seg_pop is created by piping archaic_df (with the symbol %>%) into grouping the table by region and population, which result is then piped into the command summarising through the mean of the segment length. Similarly, mean_seg_pop gets its column arranged by region and then the result is piped into the plotting commands

In [44]:

            
                Copied!
                
                    
                    
                
                

        
#this resize your plot changing width and height
options(repr.plot.width=30, repr.plot.height=10)

# Average archaic segment length by population
mean_seg_pop <- archaic_df %>%
        group_by(pop, region) %>%
        summarise(`Mean segment length` = mean(length))

#arrange by region (color blocks in the plot) and do a barplot
mean_seg_pop %>%
 ungroup() %>%
  arrange(region) %>% 
    mutate(pop = factor(pop, pop)) %>%
        ggplot(aes(x = pop, y = `Mean segment length`, fill = region)) + 
            geom_bar(position = "dodge", stat="identity") + 
            theme_bw() +
            theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=20))
#this resize your plot changing width and height
options(repr.plot.width=30, repr.plot.height=10)

# Average archaic segment length by population
mean_seg_pop <- archaic_df %>%
        group_by(pop, region) %>%
        summarise(`Mean segment length` = mean(length))

#arrange by region (color blocks in the plot) and do a barplot
mean_seg_pop %>%
 ungroup() %>%
  arrange(region) %>% 
    mutate(pop = factor(pop, pop)) %>%
        ggplot(aes(x = pop, y = `Mean segment length`, fill = region)) + 
            geom_bar(position = "dodge", stat="identity") + 
            theme_bw() +
            theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=20)) 

`summarise()` has grouped output by 'pop'. You can override using the `.groups`
argument.

What is the population with the highest average segment length?

In [12]:

            
                Copied!
                
mean_seg_pop[ which.max(mean_seg_pop$`Mean segment length`), ]
mean_seg_pop[ which.max(mean_seg_pop$`Mean segment length`), ]

A grouped_df: 1 × 3
pop	region	Mean segment length
<chr>	<chr>	<dbl>
Papuans	Melanesia	131784.3

What is the average length of segments by region?

In [46]:

            
                Copied!
                
                    
                    
                
                

        
mean_seg_region <- archaic_df %>%
        group_by(region) %>%
        summarise(`Mean segment length` = mean(length)) 

ggplot(mean_seg_region, aes(x = region, y = `Mean segment length`)) +  
    geom_bar(position = "dodge", stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(text = element_text(size=30))
mean_seg_region <- archaic_df %>%
        group_by(region) %>%
        summarise(`Mean segment length` = mean(length)) 

ggplot(mean_seg_region, aes(x = region, y = `Mean segment length`)) +  
    geom_bar(position = "dodge", stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(text = element_text(size=30)) 

Q1. Find the total lengths of Arcahic fragments in each individual¶

Calculate total archaic segment length by ind and grouped by population and region:

In [47]:

            
                Copied!
                
total_seg_ind <- archaic_df %>%
        group_by(name, pop, region) %>%
        summarise(`Total segment length` = sum(length))
 
head(total_seg_ind)
total_seg_ind <- archaic_df %>%
        group_by(name, pop, region) %>%
        summarise(`Total segment length` = sum(length))
 
head(total_seg_ind)

`summarise()` has grouped output by 'name', 'pop'. You can override using the
`.groups` argument.

A grouped_df: 6 × 4
name	pop	region	Total segment length
<chr>	<chr>	<chr>	<int>
1-Mar	Papuans	Melanesia	159627000
13733_8	Papuans	Melanesia	168490000
13748_1	Papuans	Melanesia	175569000
13748_2	Papuans	Melanesia	172182000
13748_3	Papuans	Melanesia	165722000
13748_4	Papuans	Melanesia	160407000

Q2. Summarise the total length per population and per region¶

In [54]:

            
                Copied!
                
                    
                    
                
                

        
total_seg_ind %>%
  group_by(pop,region) %>%
  summarise(`Mean total segment length`=mean(`Total segment length`)) %>%
  ungroup() %>%
  arrange(region) %>% 
  mutate(pop = factor(pop, pop)) %>%
  ggplot(aes(x = pop, y = `Mean total segment length`,
        fill = region)) + 
        geom_bar(position = "dodge", stat="identity") + 
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size=20), text=element_text(size=30))
total_seg_ind %>%
  group_by(pop,region) %>%
  summarise(`Mean total segment length`=mean(`Total segment length`)) %>%
  ungroup() %>%
  arrange(region) %>% 
  mutate(pop = factor(pop, pop)) %>%
  ggplot(aes(x = pop, y = `Mean total segment length`,
        fill = region)) + 
        geom_bar(position = "dodge", stat="identity") + 
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size=20), text=element_text(size=30))

`summarise()` has grouped output by 'pop'. You can override using the `.groups`
argument.

Q3. Which population has most archaic ancestry? Why?¶

Papuans. Denisovans.

Q4. What is the length distribution of fragments for the five different regions (hint: you can use `facet_wrap` to plot all at once)?¶

we group archaic_df by region and use ggplot to make histograms of the fragment lengths. We use facet_wrap(~region) to make a separate histogram for each region

In [20]:

            
                Copied!
                
                    
                    
                
                

        
options(repr.plot.width=20, repr.plot.height=20)

archaic_df %>%
        group_by(region) %>%
        ggplot(aes(length)) + 
        theme_bw() +
        geom_histogram(bins = 50) + 
        facet_wrap(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=15))
options(repr.plot.width=20, repr.plot.height=20)

archaic_df %>%
        group_by(region) %>%
        ggplot(aes(length)) + 
        theme_bw() +
        geom_histogram(bins = 50) + 
        facet_wrap(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=15))

Similarly, you can create a boxplot for each region. In this way you do not need the additional command facet_wrap

In [55]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df %>%
group_by(region) %>%
        ggplot(aes(y=length,x=region)) + 
        theme_bw() +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=30))
archaic_df %>%
group_by(region) %>%
        ggplot(aes(y=length,x=region)) + 
        theme_bw() +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=30))

Q5. What is the average length of fragments for each population and each region?¶

We calculate the average length of fragments in each population and then plot the table we obtained

In [56]:

            
                Copied!
                
                    
                    
                
                

        
options(repr.plot.width=30, repr.plot.height=10)

mean_seg_pop <- archaic_df %>%
        group_by(pop, region) %>%
        summarise(`Mean segment length` = mean(length))
  
mean_seg_pop %>%
  ungroup() %>%
  arrange(region) %>% 
  mutate(pop = factor(pop, pop)) %>%
  ggplot(aes(x = pop, y = `Mean segment length`, fill = region)) + 
        geom_bar(position = "dodge", stat="identity") + 
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1))
options(repr.plot.width=30, repr.plot.height=10)

mean_seg_pop <- archaic_df %>%
        group_by(pop, region) %>%
        summarise(`Mean segment length` = mean(length))
  
mean_seg_pop %>%
  ungroup() %>%
  arrange(region) %>% 
  mutate(pop = factor(pop, pop)) %>%
  ggplot(aes(x = pop, y = `Mean segment length`, fill = region)) + 
        geom_bar(position = "dodge", stat="identity") + 
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

`summarise()` has grouped output by 'pop'. You can override using the `.groups`
argument.

Q6. What can cause different mean fragment lengths?¶

Differences in recombination rate, more than one wave of introgression in one of the regions. Recent gene flow/introgression from denisovans/neanderthal after migration out of Africa. Generation time.

The origin of archaic fragments¶

We now try to distinguish which populations fragments come from, and in which proportions.

Q1. For each individual, assign the archaic segments to origin and reconstruct a Figure in the same style as Figure 5 of the Cell paper (plot below).¶

We use mutate to create three origins of the fragments in the new column closest_to, by looking at where the highest proportion of fragments come from. Then we group by the columns name, closest_to, population and average the total fragment lengths before plotting.

In [24]:

            
                Copied!
                
                    
                    
                
                

        
options(repr.plot.width=20, repr.plot.height=20)

archaic_df <- archaic_df %>% 
    mutate(closest_to = case_when(
        Shared_with_Altai > Shared_with_Denisova | Shared_with_Vindija > Shared_with_Denisova ~ "Neanderthal",
        Shared_with_Altai < Shared_with_Denisova & Shared_with_Vindija < Shared_with_Denisova ~ "Denisovan",
        Shared_with_Altai + Shared_with_Vindija + Shared_with_Denisova == 0 ~ "Unclassified"
    ))

archaic_df %>%
    group_by(name, closest_to, pop) %>%
    summarise(total = sum(length)) %>%
    ungroup() %>%
    group_by(pop, closest_to) %>%
    summarise(mean_sequences = mean(total)) %>% 
    ungroup() %>%
    mutate(closest_to = factor(closest_to, c('Denisovan', 'Unclassified', 'Neanderthal'))) %>%
    ggplot(aes(x = pop, y = mean_sequences/10^6, fill = closest_to)) +
        geom_bar(stat="identity") + theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        ylab('Mean detected archaic sequence per region (Mb)') + 
        coord_flip() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=20))
options(repr.plot.width=20, repr.plot.height=20)

archaic_df <- archaic_df %>% 
    mutate(closest_to = case_when(
        Shared_with_Altai > Shared_with_Denisova | Shared_with_Vindija > Shared_with_Denisova ~ "Neanderthal",
        Shared_with_Altai < Shared_with_Denisova & Shared_with_Vindija < Shared_with_Denisova ~ "Denisovan",
        Shared_with_Altai + Shared_with_Vindija + Shared_with_Denisova == 0 ~ "Unclassified"
    ))

archaic_df %>%
    group_by(name, closest_to, pop) %>%
    summarise(total = sum(length)) %>%
    ungroup() %>%
    group_by(pop, closest_to) %>%
    summarise(mean_sequences = mean(total)) %>% 
    ungroup() %>%
    mutate(closest_to = factor(closest_to, c('Denisovan', 'Unclassified', 'Neanderthal'))) %>%
    ggplot(aes(x = pop, y = mean_sequences/10^6, fill = closest_to)) +
        geom_bar(stat="identity") + theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        ylab('Mean detected archaic sequence per region (Mb)') + 
        coord_flip() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=20))

`summarise()` has grouped output by 'name', 'closest_to'. You can override
using the `.groups` argument.
`summarise()` has grouped output by 'pop'. You can override using the `.groups`
argument.

Q2. What are major differences? What can explain these differences?¶

Higher archaic proportions for Papuans. Differences in proportions and lengths between methods.

Q3. Summarize the results by averaging over region and plot these.¶

In [57]:

            
                Copied!
                
                    
                    
                
                

        
# Plotting by region
options(repr.plot.width=30, repr.plot.height=10)

archaic_df %>% 
    group_by(name, closest_to, region) %>%
    summarise(total = sum(length)) %>%
    ungroup() %>%
    group_by(region, closest_to) %>%
    summarise(mean_sequences = mean(total)) %>% 
    ungroup() %>%
    mutate(closest_to = factor(closest_to, c('Denisovan', 'Unclassified', 'Neanderthal'))) %>%
    ggplot(aes(x = region, y = mean_sequences/10^6, fill = closest_to)) +
        geom_bar(stat="identity") + theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        ylab('Mean detected archaic sequence per region (Mb)') + 
        coord_flip() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=30))
# Plotting by region
options(repr.plot.width=30, repr.plot.height=10)

archaic_df %>% 
    group_by(name, closest_to, region) %>%
    summarise(total = sum(length)) %>%
    ungroup() %>%
    group_by(region, closest_to) %>%
    summarise(mean_sequences = mean(total)) %>% 
    ungroup() %>%
    mutate(closest_to = factor(closest_to, c('Denisovan', 'Unclassified', 'Neanderthal'))) %>%
    ggplot(aes(x = region, y = mean_sequences/10^6, fill = closest_to)) +
        geom_bar(stat="identity") + theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        ylab('Mean detected archaic sequence per region (Mb)') + 
        coord_flip() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=30))

`summarise()` has grouped output by 'name', 'closest_to'. You can override
using the `.groups` argument.
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.

East Asians have been shown to contain more Neanderthal admixture (maybe because of a second admixture event after the European-East Asian split or maybe because of dilution of the Neanderthal component in the European population because of migrations from non-admixed populations). Additionally, it has been shown that East Asians contain higher admixture proportions of a different Denisovan origin, than South Asians and Papuans. The component most related to the Altai genome is the one present in a higher proportion in East Asians, compared to South Asians and Papuans. East Asians contain both components with similar proportions (two waves of admixture).

Q6. Determine the fragment length distribution of segment of Neanderthal and Denisova origin separately for each region. Compare the median of the distributions.¶

In [26]:

            
                Copied!
                
                    
                    
                
                

        
options(repr.plot.width=30, repr.plot.height=10)

archaic_df %>%
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    ggplot() +
        geom_density(aes(length, fill=closest_to),alpha=0.2) +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
options(repr.plot.width=30, repr.plot.height=10)

archaic_df %>%
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    ggplot() +
        geom_density(aes(length, fill=closest_to),alpha=0.2) +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

When interpreting these values, we should also take into account the number of fragments and samples for each population

In [27]:

            
                Copied!
                
archaic_df %>%
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    group_by(closest_to,region) %>%
    summarise(median_fragment_length = median(length), count = n(), samples = length(unique(name)))
archaic_df %>%
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    group_by(closest_to,region) %>%
    summarise(median_fragment_length = median(length), count = n(), samples = length(unique(name)))

`summarise()` has grouped output by 'closest_to'. You can override using the
`.groups` argument.

A grouped_df: 10 × 5
closest_to	region	median_fragment_length	count	samples
<chr>	<chr>	<int>	<int>	<int>
Denisovan	CentralAsiaSiberia	76000	1301	27
Denisovan	EastAsia	74000	8455	132
Denisovan	Melanesia	102000	45153	89
Denisovan	SouthAsia	70000	2103	39
Denisovan	WestEurasia	61000	1733	71
Neanderthal	CentralAsiaSiberia	91000	12619	27
Neanderthal	EastAsia	89000	67005	132
Neanderthal	Melanesia	96000	44713	89
Neanderthal	SouthAsia	86000	16767	39
Neanderthal	WestEurasia	86000	26101	71

We can see that the median fragment length (and the number of fragments) for the Denisovans is higher in Melanesian samples, consistent with a more recent admixture with Denisovan archaics. The Denisovan fragment lengths in EastAsians, CentralAsiaSiberia aand South Asians, are longer than WestEurasians, but not markedly longer, we also have less Denisovan fragments than for Papuans. Finally, we can see that the median length of Neanderthal fragments is similar, consistent with a common Neanderthal admixture event.

In [28]:

            
                Copied!
                
quantile((archaic_df %>% 
         filter(closest_to %in% c("Neanderthal","Denisovan")))$length)
quantile((archaic_df %>% 
         filter(closest_to %in% c("Neanderthal","Denisovan")))$length)

0%: 3000
25%: 52000
50%: 91000
75%: 164000
100%: 2375000

In [20]:

            
                Copied!
                
quantile((archaic_df %>% 
         filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
         mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija))$total_snps)
quantile((archaic_df %>% 
         filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
         mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija))$total_snps)

0%: 1
25%: 9
50%: 19
75%: 36
100%: 561

So as to clean a bit the fragments, we can apply a constraint on the minimum number of overlapping SNPs. We'll choose as a theshold median number of overlapping SNPs.

In [29]:

            
                Copied!
                
                    
                    
                
                

        
options(repr.plot.width=30, repr.plot.height=10)

archaic_df %>%
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija) %>%
    filter(total_snps > 19) %>%
    ggplot() +
        geom_density(aes(length, fill=closest_to),alpha=0.2) +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
options(repr.plot.width=30, repr.plot.height=10)

archaic_df %>%
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija) %>%
    filter(total_snps > 19) %>%
    ggplot() +
        geom_density(aes(length, fill=closest_to),alpha=0.2) +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

In [30]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df %>% 
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija) %>%
    filter(total_snps > 50) %>%
    ggplot(aes(x=closest_to,y=length)) +
        geom_boxplot() +
        ggtitle('Denisovan fragment length per region') +
        theme_bw() + 
        facet_grid(~region) +
        scale_x_discrete() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df %>% 
    filter(closest_to %in% c("Neanderthal","Denisovan")) %>%
    mutate(total_snps = Shared_with_Altai + Shared_with_Denisova + Shared_with_Vindija) %>%
    filter(total_snps > 50) %>%
    ggplot(aes(x=closest_to,y=length)) +
        geom_boxplot() +
        ggtitle('Denisovan fragment length per region') +
        theme_bw() + 
        facet_grid(~region) +
        scale_x_discrete() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

b. How does this compare to the fragment size distribution of the fragments that could not be assigned to archaic origin (these are removed from the Cell paper analyses)? Discuss reasons for the differences.¶

In [58]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df  %>%
    filter(closest_to == "Neanderthal") %>%
    group_by(region) %>%
    ggplot(aes(length)) +
        geom_histogram(bins = 50) +
        ggtitle('Neanderthal fragment length per region') +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot1

archaic_df  %>%
    filter(closest_to == "Denisovan") %>%
    ggplot(aes(length)) +
        geom_histogram(bins = 50) +
        ggtitle('Denisovan fragment length per region') +
        theme_bw() + facet_grid(~region)  +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot2

archaic_df  %>%
    filter(closest_to == "Unclassified") %>%
    group_by(region) %>%
    ggplot(aes(length)) +
        geom_histogram(bins = 50) +
        ggtitle('Unclassified fragment length per region') +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot3
archaic_df  %>%
    filter(closest_to == "Neanderthal") %>%
    group_by(region) %>%
    ggplot(aes(length)) +
        geom_histogram(bins = 50) +
        ggtitle('Neanderthal fragment length per region') +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot1

archaic_df  %>%
    filter(closest_to == "Denisovan") %>%
    ggplot(aes(length)) +
        geom_histogram(bins = 50) +
        ggtitle('Denisovan fragment length per region') +
        theme_bw() + facet_grid(~region)  +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot2

archaic_df  %>%
    filter(closest_to == "Unclassified") %>%
    group_by(region) %>%
    ggplot(aes(length)) +
        geom_histogram(bins = 50) +
        ggtitle('Unclassified fragment length per region') +
        theme_bw() + facet_grid(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50)) -> plot3

In [59]:

            
                Copied!
                
plot1
plot1

In [60]:

            
                Copied!
                
plot2
plot2

In [61]:

            
                Copied!
                
plot3
plot3

Comparison of chromosomes¶

Q1. Determine the amount of archaic introgression on each chromosome for each of the five regions.¶

In [62]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df  %>%
    group_by(chrom, region) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency, fill = region)) +  
        geom_bar(position = "dodge", stat="identity") +
        theme_bw() +
        scale_x_discrete(limits= c(seq(1, 22, 1),'X')) + # ordering from 1 to 22 +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df  %>%
    group_by(chrom, region) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency, fill = region)) +  
        geom_bar(position = "dodge", stat="identity") +
        theme_bw() +
        scale_x_discrete(limits= c(seq(1, 22, 1),'X')) + # ordering from 1 to 22 +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

`summarise()` has grouped output by 'chrom'. You can override using the
`.groups` argument.

Q2. Repeat this with assignment of regions to archaic species.¶

In [63]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df  %>%
    group_by(chrom, closest_to) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency, fill = closest_to)) +  
        geom_bar(position = "dodge", stat="identity") +
        theme_bw() +
        scale_x_discrete(limits= c(seq(1, 22, 1),'X')) + # ordering from 1 to 22
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df  %>%
    group_by(chrom, closest_to) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency, fill = closest_to)) +  
        geom_bar(position = "dodge", stat="identity") +
        theme_bw() +
        scale_x_discrete(limits= c(seq(1, 22, 1),'X')) + # ordering from 1 to 22
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

`summarise()` has grouped output by 'chrom'. You can override using the
`.groups` argument.

Q3. Combine the Neanderthal fragments for all individuals and plot all the fragments on top of each other along chromosomes (hint use alpha = 0.1). Can you find “deserts” of archaic admixture and/or evidence for places where Neanderthal ancestry has reached very high frequency?¶

In [64]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df %>%
    arrange(region) %>%
    ggplot() +
      theme_bw() + 
      geom_segment(aes(x=start,xend=end,y=0, yend=1, col=closest_to), alpha=0.02) + facet_wrap(~chrom) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1),  axis.text.y=element_blank(),axis.ticks=element_blank(),axis.title.y=element_blank(),legend.position="none",text = element_text(size=50)) +
      xlab("Position")
archaic_df %>%
    arrange(region) %>%
    ggplot() +
      theme_bw() + 
      geom_segment(aes(x=start,xend=end,y=0, yend=1, col=closest_to), alpha=0.02) + facet_wrap(~chrom) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1),  axis.text.y=element_blank(),axis.ticks=element_blank(),axis.title.y=element_blank(),legend.position="none",text = element_text(size=50)) +
      xlab("Position")

Q4. You will find that the X chromosome is an outlier (compare to a chormosome of a similar size - chr8). How and why?¶

In [65]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df %>%
    group_by(chrom, region) %>%
    filter(chrom %in% c('8', 'X')) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency, fill = region)) +  
        geom_bar(position = "dodge", stat="identity") +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df %>%
    group_by(chrom, region) %>%
    filter(chrom %in% c('8', 'X')) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency, fill = region)) +  
        geom_bar(position = "dodge", stat="identity") +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

`summarise()` has grouped output by 'chrom'. You can override using the
`.groups` argument.

In [66]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df %>%
    group_by(chrom, region) %>%
    filter(chrom %in% c('8', 'X')) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency)) +  
        geom_boxplot() +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df %>%
    group_by(chrom, region) %>%
    filter(chrom %in% c('8', 'X')) %>%
    summarise(Frequency = sum(as.numeric(length))) %>%
    ggplot(aes(x = chrom, y = Frequency)) +  
        geom_boxplot() +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

`summarise()` has grouped output by 'chrom'. You can override using the
`.groups` argument.

Sexual antagonistic effects acting on this chromosome, with possible effects on hybrid incompatibilities and speciation.

Q6. Do you find regions that are devoid of introgression for both the Neanderthal and the Denisovan admixture events?¶

Looking at the EPAS1 as an example of adaptive introgression. This gene maps to chr2:46,520,806-46,613,842 in GRCh37 coordinates. Tibetan individuals have an increased amount of archaic polymorphism in this region.

In [67]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df %>%
    filter(closest_to == "Denisovan") %>%
    filter(chrom == 2 & region == 'EastAsia' & start > 46500000 & end < 46700000 ) %>%
    ggplot((aes(x = length))) + theme_bw() +
        geom_histogram(bins = 50) + facet_grid(~country) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df %>%
    filter(closest_to == "Denisovan") %>%
    filter(chrom == 2 & region == 'EastAsia' & start > 46500000 & end < 46700000 ) %>%
    ggplot((aes(x = length))) + theme_bw() +
        geom_histogram(bins = 50) + facet_grid(~country) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))

Contrary, FOXP2 is an example of maladaptive introgression, consistent with the presence of deserts of archaic origin in around this gene.

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
archaic_df %>%
    filter(chrom == "7") %>%
    arrange(region) %>%
    ggplot() +
        theme_bw() + 
        geom_segment(aes(x=start,xend=end,y=0, yend=1, col=closest_to), alpha=0.02) + facet_wrap(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),  axis.text.y=element_blank(),axis.ticks=element_blank(),axis.title.y=element_blank(),legend.position="none") +
        geom_vline(xintercept = 113726365, alpha=0.5) +
        xlab("Position") + ggtitle(paste("Chromosome 7")) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))
archaic_df %>%
    filter(chrom == "7") %>%
    arrange(region) %>%
    ggplot() +
        theme_bw() + 
        geom_segment(aes(x=start,xend=end,y=0, yend=1, col=closest_to), alpha=0.02) + facet_wrap(~region) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),  axis.text.y=element_blank(),axis.ticks=element_blank(),axis.title.y=element_blank(),legend.position="none") +
        geom_vline(xintercept = 113726365, alpha=0.5) +
        xlab("Position") + ggtitle(paste("Chromosome 7")) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),text = element_text(size=50))