# Raw counts for PD1
<- t(c(21, 58, 17, 97, 83, 10)) %>%
PD1 as_tibble() %>%
rename_all(~paste0("Sample", 1:6))
# Size factors for each sample
<- c(1.32, 0.70, 1.04, 1.27, 1.11, 0.85) size_factors
Count normalization with DESeq2
Approximate time: 40 minutes
Learning Objectives
- Become familiar with the
DESeqDataSet
object - Understand how to normalize counts using DESeq2
Normalization
The first step in the DE analysis workflow is count normalization, which is necessary to make accurate comparisons of gene expression between samples. Let’s try to run an easy example!
Exercise 1
Determine the normalized (median of ratios) counts for your gene of interest, PD1, given the raw counts and size factors below.
NOTE: You will need to run the code below to generate the raw counts dataframe (PD1) and the size factor vector (size_factors), then use these objects to determine the normalized counts values:
Your code here:
Count normalization of the Vampirium dataset using DESeq2
Now that we know the theory of count normalization, we will normalize the counts for the Vampirium dataset using DESeq2. This requires a few steps:
- Ensure the row names of the metadata dataframe are present and in the same order as the column names of the counts dataframe.
- Create a
DESeqDataSet
object - Generate the normalized counts
1. Match the metadata and counts data
We should always make sure that we have sample names that match between the two files, and that the samples are in the right order. DESeq2 will output an error if this is not the case. Since we built our txi
object from our metadata, everything should be OK.
### Check that sample names match in both files
all(colnames(txi$counts) %in% meta$sample)
[1] TRUE
all(colnames(txi$counts) == meta$sample)
[1] TRUE
If your data did not match, you could use the match()
function to rearrange them to be matching. match()
function will take two arguments and find in which order the indexes of the second argument match the first argument.
<- c("a","b","c")
a <- c("b","c","a")
b
<- match(a,b)
reorder reorder
[1] 3 1 2
b[reorder]
[1] "a" "b" "c"
Exercise 2
Suppose we had sample names matching in the txi object and metadata file, but they were out of order. Write the line(s) of code required make the meta_random
dataframe with rows ordered such that they were identical to the column names of the txi
.
# randomize metadata rownames
<- meta[sample(1:nrow(meta)),] meta_random
Your code here:
#your code here
2. Create DESEq2 object
Let’s start by creating the DESeqDataSet
object, and then we can talk a bit more about what is stored inside it. To create the object, we will need the txi object and the metadata table as input (colData
argument). We will also need to specify a design formula. The design formula specifies which column(s) of our metadata we want to use for statistical testing and modeling (more about that later!). For our dataset we only have one column we are interested in, which is condition
. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.
It is very important to establish beforehand which sample type will be our “base” or “reference” level. If nothing is changed, DESeq2 will assume that our reference samples will be the first sample type (in alphabetical order). You can check this using the factor()
function.
factor(meta$condition)
[1] control control control vampirium vampirium vampirium garlicum
[8] garlicum
Levels: control garlicum vampirium
While in a normal experiment we would use control samples as our reference, in our case we are interested in both checking the differences between control vs. vampirium and garlicum vs. vampirium. Thus, it would be much more convenient to reorganize our factor base level to vampirium
. We can do this also with the factor()
function, using the levels =
argument.
$condition = factor(meta$condition, levels = c("vampirium", "control", "garlicum"))
metafactor(meta$condition)
[1] control control control vampirium vampirium vampirium garlicum
[8] garlicum
Levels: vampirium control garlicum
We can see now that vampirium is the first factor! Meaning that it will be interpreted by DESeq as our reference sample type.
Our count matrix input is stored in the txi
list object. So we need to specify that using the DESeqDataSetFromTximport()
function, which will extract the counts component and round the values to the nearest whole number.
# colData argument requires rownames in order to assess matching sample names
# meta is a tibble object from tidyverse, so we neeed to add rownames.
# If you do not do this and the samples do not match, you will add wrong info!
<- DESeqDataSetFromTximport(txi,
dds colData = meta %>% column_to_rownames("sample"),
design = ~ condition)
NOTE: The warning from the chunk before is telling us that we have setup our vampirium samples as reference, instead of control! This is exactly what we wanted.
NOTE: If you did not create pseudocounts, but a count matrix from aligned BAM files and tools such as
featurecounts
, you would want to use theDESeqDataSetFromMatrix()
function.
## DO NOT RUN!
## Create DESeq2Dataset object from traditional count matrix
<- DESeqDataSetFromMatrix(countData = "../Data/Vampirium_counts_traditional.tsv",
dds colData = meta %>% column_to_rownames("sample"),
design = ~ condition)
You can use DESeq-specific functions to access the different slots and retrieve information, if you wish. For example, suppose we wanted the original count matrix we would use counts()
:
head(counts(dds))
control_3 control_2 control_1 vampirium_3 vampirium_2
ENSG00000000005 24 28 28 22 38
ENSG00000000419 1009 1421 1634 1311 1896
ENSG00000000457 341 400 525 342 584
ENSG00000000938 0 1 1 0 0
ENSG00000000971 10 21 1 4 22
ENSG00000001036 1887 2550 3078 1659 2850
vampirium_1 garlicum_3 garlicum_2
ENSG00000000005 24 13 34
ENSG00000000419 2108 1484 2544
ENSG00000000457 630 549 920
ENSG00000000938 0 0 2
ENSG00000000971 1 2 4
ENSG00000001036 3013 2624 4353
As we go through the workflow we will use the relevant functions to check what information gets stored inside our object.
Pre-filtering
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful:
- By removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2.
- It can also improve visualizations, as features with no information for differential expression are not plotted.
Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.
<- rowSums(counts(dds)) >= 10
keep <- dds[keep,] dds
3. Generate the Vampirium normalized counts
The next step is to normalize the count data in order to be able to make fair gene comparisons between samples.
To perform the median of ratios method of normalization, DESeq2 has a single estimateSizeFactors()
function that will generate size factors for us. We will use the function in the example below, but in a typical RNA-seq analysis this step is automatically performed by the DESeq()
function, which we will see later.
<- estimateSizeFactors(dds) dds
By assigning the results back to the dds
object we are filling in the slots of the DESeqDataSet
object with the appropriate information. We can take a look at the normalization factor applied to each sample using:
sizeFactors(dds)
control_3 control_2 control_1 vampirium_3 vampirium_2 vampirium_1
0.7500980 0.9580056 1.1134270 0.6565847 1.1418466 1.2185305
garlicum_3 garlicum_2
0.9327493 1.5527685
Now, to retrieve the normalized counts matrix from dds
, we use the counts()
function and add the argument normalized=TRUE
.
<- counts(dds, normalized=TRUE)
normalized_counts head(normalized_counts)
control_3 control_2 control_1 vampirium_3 vampirium_2
ENSG00000000005 31.99582 29.22739 25.1475854 33.506721 33.27943
ENSG00000000419 1345.15751 1483.28994 1467.5412362 1996.695993 1660.46823
ENSG00000000457 454.60725 417.53411 471.5172271 520.877216 511.45224
ENSG00000000971 13.33159 21.92054 0.8981281 6.092131 19.26704
ENSG00000001036 2515.67119 2661.77997 2764.4381426 2526.711406 2495.95699
ENSG00000001084 2852.96044 2820.44293 3023.9971495 2613.524275 2751.68311
vampirium_1 garlicum_3 garlicum_2
ENSG00000000005 19.6958544 13.937293 21.896374
ENSG00000000419 1729.9525449 1590.995601 1638.363998
ENSG00000000457 517.0161780 588.582605 592.490125
ENSG00000000971 0.8206606 2.144199 2.576044
ENSG00000001036 2472.6503879 2813.188988 2803.379907
ENSG00000001084 2516.1453997 3677.301154 3390.717944
We can save this normalized data matrix to file for later use:
write.table(normalized_counts, file="../Results/normalized_counts.txt", sep="\t", quote=F)