Demonstrate the use of the design formula with simple and complex designs
Construct R code to execute the differential expression analysis workflow with DESeq2
Design formula
A design formula tells the statistical software the known sources of variation to control for, as well as, the factor of interest to test for during differential expression testing. For example, if you know that sex is a significant source of variation in your data, then sex should be included in your model. The design formula should have all of the factors in your metadata that account for major sources of variation in your data. The last factor entered in the formula should be the condition of interest.
For example, suppose you have the following metadata:
condition bloodtype patient
sample1 control bloodO patient1
sample2 control bloodO patient2
sample3 control bloodO patient3
sample4 treat bloodO patient1
sample5 treat bloodO patient2
sample6 treat bloodO patient3
sample7 control bloodA patient1
sample8 control bloodA patient2
sample9 control bloodA patient3
sample10 treat bloodA patient1
sample11 treat bloodA patient2
sample12 treat bloodA patient3
If you want to examine the expression differences between condition, and you know that major sources of variation include bloodtype and patient, then your design formula would be:
design = ~ bloodtype + patient + condition
The tilde (~) should always precede your factors and tells DESeq2 to model the counts using the following formula. Note the factors included in the design formula need to match the column names in the metadata.
In this tutorial we show a general and flexible way to define contrasts, and is often useful for more complex contrasts or when the design of the experiment is imbalanced (e.g. different number of replicates in each group). Although we focus on DESeq2, the approach can also be used with the other popular package edgeR.
Each section below covers a particular experimental design, from simpler to more complex ones. The first chunk of code in each section is to simulate data, which has no particular meaning and is only done in order to have a DESeqDataSet object with the right kind of variables for each example. In practice, users can ignore this step as they should have created a DESeqDataSet object from their own data following the instructions in the vignette.
One factor, two levels
Code
# simulate datadds <-makeExampleDESeqDataSet(n =1000, m =6, betaSD =2)dds$condition <-factor(rep(c("control", "treat"), each =3))
First we can look at our sample information:
Code
colData(dds)
DataFrame with 6 rows and 1 column
condition
<factor>
sample1 control
sample2 control
sample3 control
sample4 treat
sample5 treat
sample6 treat
Our factor of interest is condition and so we define our design and run the DESeq model fitting routine:
Code
design(dds) <-~1+ condition # or just `~ condition`dds <-DESeq(dds) # equivalent to edgeR::glmFit()
Then check what coefficients DESeq estimated:
Code
resultsNames(dds)
[1] "Intercept" "condition_treat_vs_control"
We can see that we have a coefficient for our intercept and coefficient for the effect of treat (i.e. differences between treat versus control).
Using the more standard syntax, we can obtain the results for the effect of treat as such:
The above is a simple way to obtain the results of interest. But it is worth understanding how DESeq is getting to these results by looking at the model’s matrix. DESeq defines the model matrix using base R functionality:
We can see that R coded condition as a dummy variable, with an intercept (common to all samples) and a “conditiontreat” variable, which adds the effect of treat to samples 4-6.
We can actually set our contrasts in DESeq2::results() using a numeric vector. The way it works is to define a vector of “weights” for the coefficient(s) we want to test for. In this case, we have (Intercept) and conditiontreat as our coefficients (see model matrix above), and we want to test for the effect of treat, so our contrast vector would be c(0, 1). In other words, we don’t care about the value of (Intercept) (so it has a weight of 0), and we’re only interested in the effect of treat (so we give it a weight of 1).
In this case the design is very simple, so we could define our contrast vector “manually”. But for complex designs this can get more difficult to do, so it’s worth mentioning the general way in which we can define this. For any contrast of interest, we can follow three steps:
Get the model matrix
Subset the matrix for each group of interest and calculate its column means - this results in a vector of coefficients for each group
Subtract the group vectors from each other according to the comparison we’re interested in
Let’s see this example in action:
Code
# get the model matrixmod_mat <-model.matrix(design(dds), colData(dds))mod_mat
# calculate the vector of coefficient weights in the treattreat <-colMeans(mod_mat[dds$condition =="treat", ])treat
(Intercept) conditiontreat
1 1
Code
# calculate the vector of coefficient weights in the controlcontrol <-colMeans(mod_mat[dds$condition =="control", ])control
(Intercept) conditiontreat
1 0
Code
# The contrast we are interested in is the difference between treat and controltreat - control
(Intercept) conditiontreat
0 1
That last step is where we define our contrast vector, and we can give this directly to the results function:
Code
# get the results for this contrastres2 <-results(dds, contrast = treat - control)
This gives us exactly the same results as before, which we can check for example by plotting the log-fold-changes between the first and second approach:
Code
plot(res1$log2FoldChange, res2$log2FoldChange)
Recoding the design
Often, we can use different model matrices that essentially correspond to the same design. For example, we could recode our design above by removing the intercept:
In this case we get a coefficient corresponding to the average expression in control and the average expression in the treat (rather than the difference between treat and control).
If we use the same contrast trick as before (using the model matrix), we can see the result is the same:
Code
# get the model matrixmod_mat <-model.matrix(design(dds), colData(dds))mod_mat
# calculate weights for coefficients in each conditiontreat <-colMeans(mod_mat[which(dds$condition =="treat"), ])control <-colMeans(mod_mat[which(dds$condition =="control"), ])# get the results for our contrastres3 <-results(dds, contrast = treat - control)
Again, the results are essentially the same:
Code
plot(res1$log2FoldChange, res3$log2FoldChange)
In theory there’s no difference between these two ways of defining our design. The design with an intercept is more common, but for the purposes of understanding what’s going on, it’s sometimes easier to look at models without intercept.
One factor, three levels
Code
# simulate datadds <-makeExampleDESeqDataSet(n =1000, m =9, betaSD =2)dds$condition <-NULLdds$bloodtype <-factor(rep(c("bloodA", "bloodB", "bloodO"), each =3))dds$bloodtype <-relevel(dds$bloodtype, "bloodO")
For comparing bloodA vs bloodB, however, we need to compare two coefficients with each other to check whether they are themselves different (check the slide to see the illustration). This is how the standard DESeq syntax would be:
# calculate coefficient vectors for each groupbloodA <-colMeans(mod_mat[dds$bloodtype =="bloodA", ])bloodB <-colMeans(mod_mat[dds$bloodtype =="bloodB", ])bloodO <-colMeans(mod_mat[dds$bloodtype =="bloodO", ])
And we can now define any contrasts we want:
Code
# obtain results for each pairwise contrastres2_bloodA_bloodO <-results(dds, contrast = bloodA - bloodO)res2_bloodB_bloodO <-results(dds, contrast = bloodB - bloodO)res2_bloodA_bloodB <-results(dds, contrast = bloodA - bloodB)# plot the results from the two approaches to check that they are identicalplot(res1_bloodA_bloodO$log2FoldChange, res2_bloodA_bloodO$log2FoldChange)
Notice the contrast vector in this case assigns a “weight” of 0.5 to each of bloodtypebloodA and bloodtypebloodB. This is equivalent to saying that we want to consider the average of bloodA and bloodB expression. In fact, we could have also defined our contrast vector like this:
Code
# average of bloodA and bloodB minus bloodO(bloodA + bloodB)/2- bloodO
However, in this model the gene dispersion is estimated together for bloodA and bloodB samples as if they were replicates of each other, which may result in inflated/deflated estimates. Instead, our approach above estimates the error within each of those groups.
To check the difference one could compare the two approaches visually:
Code
# compare the log-fold-changes between the two approachesplot(res1_A_B$log2FoldChange, res2_AB$log2FoldChange)abline(0, 1, col ="brown", lwd =2)
Code
# compare the errors between the two approachesplot(res1_A_B$lfcSE, res2_AB$lfcSE)abline(0, 1, col ="brown", lwd =2)
DataFrame with 12 rows and 2 columns
condition bloodtype
<factor> <factor>
sample1 control bloodO
sample2 control bloodO
sample3 control bloodO
sample4 treat bloodO
sample5 treat bloodO
... ... ...
sample8 control bloodA
sample9 control bloodA
sample10 treat bloodA
sample11 treat bloodA
sample12 treat bloodA
This time we have two factors of interest, and we want to model both with an interaction (i.e. we assume that bloodA and bloodO samples may respond differently to treat/control). We define our design accordingly and fit the model:
Because we have two factors and an interaction, the number of comparisons we can do is larger. Using our three-step approach from the model matrix, we do things exactly as we’ve been doing so far:
Code
# get the model matrixmod_mat <-model.matrix(design(dds), colData(dds))# Define coefficient vectors for each conditionbloodO_control <-colMeans(mod_mat[dds$bloodtype =="bloodO"& dds$condition =="control", ])bloodO_treat <-colMeans(mod_mat[dds$bloodtype =="bloodO"& dds$condition =="treat", ])bloodA_control <-colMeans(mod_mat[dds$bloodtype =="bloodA"& dds$condition =="control", ])bloodA_treat <-colMeans(mod_mat[dds$bloodtype =="bloodA"& dds$condition =="treat", ])
We are now ready to define any contrast of interest from these vectors (for completeness we show the equivalent syntax using the coefficient’s names from DESeq).
In conclusion, although we can define these contrasts using DESeq coefficient names, it is somewhat more explicit (and perhaps intuitive?) what it is we’re comparing using matrix-based contrasts.
Three factors, with nesting
Code
# simulate datadds <-makeExampleDESeqDataSet(n =1000, m =24, betaSD =2)dds$bloodtype <-factor(rep(c("bloodA", "bloodO"), each =12))dds$bloodtype <-relevel(dds$bloodtype, "bloodO")dds$patient <-factor(rep(LETTERS[1:4], each =6))dds$condition <-factor(rep(c("treat", "control"), 12))dds <- dds[, order(dds$bloodtype, dds$patient, dds$condition)]colnames(dds) <-paste0("sample", 1:ncol(dds))
First let’s look at our sample information:
Code
colData(dds)
DataFrame with 24 rows and 3 columns
condition bloodtype patient
<factor> <factor> <factor>
sample1 control bloodO C
sample2 control bloodO C
sample3 control bloodO C
sample4 treat bloodO C
sample5 treat bloodO C
... ... ... ...
sample20 control bloodA B
sample21 control bloodA B
sample22 treat bloodA B
sample23 treat bloodA B
sample24 treat bloodA B
Now we have three factors, but patient is nested within bloodtype (i.e. a patient is either bloodA or bloodO, it cannot be both). Therefore, bloodtype is a linear combination with patient (or, another way to think about it is that bloodtype is redundant with patient). Because of this, we will define our design without including “bloodtype”, although later we can compare groups of patient of the same bloodtype with each other.
Now it’s harder to define contrasts between groups of patient of the same bloodtype using DESeq’s coefficient names (although still possible). But using the model matrix approach, we do it in exactly the same way we have done so far!
Again, let’s define our groups from the model matrix:
Code
# get the model matrixmod_mat <-model.matrix(design(dds), colData(dds))# define coefficient vectors for each groupbloodO_control <-colMeans(mod_mat[dds$bloodtype =="bloodO"& dds$condition =="control", ])bloodA_control <-colMeans(mod_mat[dds$bloodtype =="bloodA"& dds$condition =="control", ])bloodO_treat <-colMeans(mod_mat[dds$bloodtype =="bloodO"& dds$condition =="treat", ])bloodA_treat <-colMeans(mod_mat[dds$bloodtype =="bloodA"& dds$condition =="treat", ])
It’s worth looking at some of these vectors, to see that they are composed of weighted coefficients from different patient. For example, for “bloodO” patient, we have equal contribution from “patientC” and “patientD”:
We can set our contrasts in exactly the same way as we did in the previous section (for completeness, we also give the contrasts using DESeq’s named coefficients).
bloodA vs bloodO (in the control):
Code
res1_bloodA_bloodO_control <-results(dds, contrast = bloodA_control - bloodO_control)# or equivalentlyres2_bloodA_bloodO_control <-results(dds, contrast =list(c("patient_B_vs_A"), # Blood type Ac("patient_C_vs_A", # Blood type O"patient_D_vs_A"))) # Blood type O
bloodA vs bloodO (in the treat):
Code
res1_bloodO_bloodA_treat <-results(dds, contrast = bloodO_treat - bloodA_treat)# or equivalentlyres2_bloodO_bloodA_treat <-results(dds, contrast =list(c("patient_B_vs_A", # Blood type A"patientB.conditiontreat"), # Interaction of patient B with treatmentc("patient_C_vs_A", # Blood type O"patient_D_vs_A", # Blood type O"patientC.conditiontreat", # Interaction of patient C with treatment"patientD.conditiontreat"))) # Interaction of patient B with treatment
And so on, for other contrasts of interest…
Extra: imbalanced design
Let’s take our previous example, but drop one of the samples from the data, so that we only have 2 replicates for it.
Code
dds <- dds[, -1] # drop one of the patient C samplesdds <-DESeq(dds)resultsNames(dds)
Notice that whereas before “patientC” and “patientD” had each a weight of 0.5, now they have different weights. That’s because for patientC there’s only 2 replicates. So, we have a total of 5 bloodtype O individuals in the control (2 from patient C and 3 from D). Therefore, when we calculate the average coefficients for bloodOs, we need to do it as 0.4 x patientC + 0.6 x patientD.
The nice thing about this approach is that we do not need to worry about any of this, the weights come from our colMeans() call automatically. And now, any contrasts that we make will take these weights into account:
Code
# bloodA vs bloodO (in the control)bloodA_control - bloodO_control