Count normalization with DESeq2

Author

You!

Published

January 29, 2025

Approximate time: 40 minutes

Learning Objectives

  • Become familiar with the DESeqDataSet object
  • Understand how to normalize counts using DESeq2

Normalization

The first step in the DE analysis workflow is count normalization, which is necessary to make accurate comparisons of gene expression between samples. Let’s try to run an easy example!


Exercise 1

Determine the normalized (median of ratios) counts for your gene of interest, PD1, given the raw counts and size factors below.

NOTE: You will need to run the code below to generate the raw counts dataframe (PD1) and the size factor vector (size_factors), then use these objects to determine the normalized counts values:

# Raw counts for PD1
PD1 <- t(c(21, 58, 17, 97, 83, 10)) %>% 
  as_tibble() %>%
  rename_all(~paste0("Sample", 1:6))


# Size factors for each sample
size_factors <- c(1.32, 0.70, 1.04, 1.27, 1.11, 0.85)

Your code here:


Count normalization of the Vampirium dataset using DESeq2

Now that we know the theory of count normalization, we will normalize the counts for the Vampirium dataset using DESeq2. This requires a few steps:

  1. Ensure the row names of the metadata dataframe are present and in the same order as the column names of the counts dataframe.
  2. Create a DESeqDataSet object
  3. Generate the normalized counts

1. Match the metadata and counts data

We should always make sure that we have sample names that match between the two files, and that the samples are in the right order. DESeq2 will output an error if this is not the case. Since we built our txi object from our metadata, everything should be OK.

### Check that sample names match in both files
all(colnames(txi$counts) %in% meta$sample)
[1] TRUE
all(colnames(txi$counts) == meta$sample)
[1] TRUE

If your data did not match, you could use the match() function to rearrange them to be matching. match() function will take two arguments and find in which order the indexes of the second argument match the first argument.

a <- c("a","b","c")
b <- c("b","c","a")

reorder <- match(a,b)
reorder
[1] 3 1 2
b[reorder]
[1] "a" "b" "c"

Exercise 2

Suppose we had sample names matching in the txi object and metadata file, but they were out of order. Write the line(s) of code required make the meta_random dataframe with rows ordered such that they were identical to the column names of the txi.

# randomize metadata rownames
meta_random <- meta[sample(1:nrow(meta)),]

Your code here:

#your code here

2. Create DESEq2 object

Let’s start by creating the DESeqDataSet object, and then we can talk a bit more about what is stored inside it. To create the object, we will need the txi object and the metadata table as input (colData argument). We will also need to specify a design formula. The design formula specifies which column(s) of our metadata we want to use for statistical testing and modeling (more about that later!). For our dataset we only have one column we are interested in, which is condition. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.

It is very important to establish beforehand which sample type will be our “base” or “reference” level. If nothing is changed, DESeq2 will assume that our reference samples will be the first sample type (in alphabetical order). You can check this using the factor() function.

factor(meta$condition)
[1] control   control   control   vampirium vampirium vampirium garlicum 
[8] garlicum 
Levels: control garlicum vampirium

While in a normal experiment we would use control samples as our reference, in our case we are interested in both checking the differences between control vs. vampirium and garlicum vs. vampirium. Thus, it would be much more convenient to reorganize our factor base level to vampirium. We can do this also with the factor() function, using the levels = argument.

meta$condition = factor(meta$condition, levels = c("vampirium", "control", "garlicum"))
factor(meta$condition)
[1] control   control   control   vampirium vampirium vampirium garlicum 
[8] garlicum 
Levels: vampirium control garlicum

We can see now that vampirium is the first factor! Meaning that it will be interpreted by DESeq as our reference sample type.

Our count matrix input is stored in the txi list object. So we need to specify that using the DESeqDataSetFromTximport() function, which will extract the counts component and round the values to the nearest whole number.

# colData argument requires rownames in order to assess matching sample names
# meta is a tibble object from tidyverse, so we neeed to add rownames.
# If you do not do this and the samples do not match, you will add wrong info!

dds <- DESeqDataSetFromTximport(txi,
                                   colData = meta %>% column_to_rownames("sample"), 
                              design = ~ condition)

NOTE: The warning from the chunk before is telling us that we have setup our vampirium samples as reference, instead of control! This is exactly what we wanted.

NOTE: If you did not create pseudocounts, but a count matrix from aligned BAM files and tools such as featurecounts, you would want to use the DESeqDataSetFromMatrix() function.

## DO NOT RUN!
## Create DESeq2Dataset object from traditional count matrix
dds <- DESeqDataSetFromMatrix(countData = "../Data/Vampirium_counts_traditional.tsv", 
                              colData = meta %>% column_to_rownames("sample"), 
                              design = ~ condition)

You can use DESeq-specific functions to access the different slots and retrieve information, if you wish. For example, suppose we wanted the original count matrix we would use counts():

head(counts(dds))
                control_3 control_2 control_1 vampirium_3 vampirium_2
ENSG00000000005        24        28        28          22          38
ENSG00000000419      1009      1421      1634        1311        1896
ENSG00000000457       341       400       525         342         584
ENSG00000000938         0         1         1           0           0
ENSG00000000971        10        21         1           4          22
ENSG00000001036      1887      2550      3078        1659        2850
                vampirium_1 garlicum_3 garlicum_2
ENSG00000000005          24         13         34
ENSG00000000419        2108       1484       2544
ENSG00000000457         630        549        920
ENSG00000000938           0          0          2
ENSG00000000971           1          2          4
ENSG00000001036        3013       2624       4353

As we go through the workflow we will use the relevant functions to check what information gets stored inside our object.

Pre-filtering

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful:

  • By removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2.
  • It can also improve visualizations, as features with no information for differential expression are not plotted.

Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

3. Generate the Vampirium normalized counts

The next step is to normalize the count data in order to be able to make fair gene comparisons between samples.

To perform the median of ratios method of normalization, DESeq2 has a single estimateSizeFactors() function that will generate size factors for us. We will use the function in the example below, but in a typical RNA-seq analysis this step is automatically performed by the DESeq() function, which we will see later.

dds <- estimateSizeFactors(dds)

By assigning the results back to the dds object we are filling in the slots of the DESeqDataSet object with the appropriate information. We can take a look at the normalization factor applied to each sample using:

sizeFactors(dds)
  control_3   control_2   control_1 vampirium_3 vampirium_2 vampirium_1 
  0.7500980   0.9580056   1.1134270   0.6565847   1.1418466   1.2185305 
 garlicum_3  garlicum_2 
  0.9327493   1.5527685 

Now, to retrieve the normalized counts matrix from dds, we use the counts() function and add the argument normalized=TRUE.

normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)
                 control_3  control_2    control_1 vampirium_3 vampirium_2
ENSG00000000005   31.99582   29.22739   25.1475854   33.506721    33.27943
ENSG00000000419 1345.15751 1483.28994 1467.5412362 1996.695993  1660.46823
ENSG00000000457  454.60725  417.53411  471.5172271  520.877216   511.45224
ENSG00000000971   13.33159   21.92054    0.8981281    6.092131    19.26704
ENSG00000001036 2515.67119 2661.77997 2764.4381426 2526.711406  2495.95699
ENSG00000001084 2852.96044 2820.44293 3023.9971495 2613.524275  2751.68311
                 vampirium_1  garlicum_3  garlicum_2
ENSG00000000005   19.6958544   13.937293   21.896374
ENSG00000000419 1729.9525449 1590.995601 1638.363998
ENSG00000000457  517.0161780  588.582605  592.490125
ENSG00000000971    0.8206606    2.144199    2.576044
ENSG00000001036 2472.6503879 2813.188988 2803.379907
ENSG00000001084 2516.1453997 3677.301154 3390.717944

We can save this normalized data matrix to file for later use:

write.table(normalized_counts, file="../Results/normalized_counts.txt", sep="\t", quote=F)