DESeq samples comparison: contrast designs

Author

You!

Published

December 20, 2024

Approximate time: 40 minutes

Learning Objectives

  • Demonstrate the use of the design formula with simple and complex designs
  • Construct R code to execute the differential expression analysis workflow with DESeq2

Design formula

A design formula tells the statistical software the known sources of variation to control for, as well as, the factor of interest to test for during differential expression testing. For example, if you know that sex is a significant source of variation in your data, then sex should be included in your model. The design formula should have all of the factors in your metadata that account for major sources of variation in your data. The last factor entered in the formula should be the condition of interest.

For example, suppose you have the following metadata:

Code
metadata <- data.frame(row.names = paste0("sample", c(1:12)), 
           condition = rep(c(rep("control",3), rep("treat",3)), 2),
           bloodtype = c(rep("bloodO",6), rep("bloodA",6)),
           patient = rep(paste0("patient",c(1:3)), 4)
)

metadata
         condition bloodtype  patient
sample1    control    bloodO patient1
sample2    control    bloodO patient2
sample3    control    bloodO patient3
sample4      treat    bloodO patient1
sample5      treat    bloodO patient2
sample6      treat    bloodO patient3
sample7    control    bloodA patient1
sample8    control    bloodA patient2
sample9    control    bloodA patient3
sample10     treat    bloodA patient1
sample11     treat    bloodA patient2
sample12     treat    bloodA patient3

If you want to examine the expression differences between condition, and you know that major sources of variation include bloodtype and patient, then your design formula would be:

design = ~ bloodtype + patient + condition

The tilde (~) should always precede your factors and tells DESeq2 to model the counts using the following formula. Note the factors included in the design formula need to match the column names in the metadata.

In this tutorial we show a general and flexible way to define contrasts, and is often useful for more complex contrasts or when the design of the experiment is imbalanced (e.g. different number of replicates in each group). Although we focus on DESeq2, the approach can also be used with the other popular package edgeR.

Each section below covers a particular experimental design, from simpler to more complex ones. The first chunk of code in each section is to simulate data, which has no particular meaning and is only done in order to have a DESeqDataSet object with the right kind of variables for each example. In practice, users can ignore this step as they should have created a DESeqDataSet object from their own data following the instructions in the vignette.

One factor, two levels

Code
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 6, betaSD = 2)
dds$condition <- factor(rep(c("control", "treat"), each = 3))

First we can look at our sample information:

Code
colData(dds)
DataFrame with 6 rows and 1 column
        condition
         <factor>
sample1   control
sample2   control
sample3   control
sample4   treat  
sample5   treat  
sample6   treat  

Our factor of interest is condition and so we define our design and run the DESeq model fitting routine:

Code
design(dds) <- ~ 1 + condition # or just `~ condition`
dds <- DESeq(dds) # equivalent to edgeR::glmFit()

Then check what coefficients DESeq estimated:

Code
resultsNames(dds)
[1] "Intercept"                  "condition_treat_vs_control"

We can see that we have a coefficient for our intercept and coefficient for the effect of treat (i.e. differences between treat versus control).

Using the more standard syntax, we can obtain the results for the effect of treat as such:

Code
res1 <- results(dds, contrast = list("condition_treat_vs_control"))
res1
log2 fold change (MLE): condition_treat_vs_control effect 
Wald test p-value: condition_treat_vs_control effect 
DataFrame with 1000 rows and 6 columns
          baseMean log2FoldChange     lfcSE       stat      pvalue        padj
         <numeric>      <numeric> <numeric>  <numeric>   <numeric>   <numeric>
gene1     67.06433    -0.00689895  0.507917 -0.0135828 0.989162808  0.99609846
gene2     32.46454     0.88426700  0.803057  1.1011264 0.270841627  0.42995565
gene3     41.70107    -2.46894026  0.710799 -3.4734721 0.000513771  0.00243119
gene4     48.52745    -2.35826158  0.672725 -3.5055340 0.000455692  0.00222010
gene5      6.70551    -3.80902735  1.360556 -2.7996107 0.005116428  0.01794378
...            ...            ...       ...        ...         ...         ...
gene996   10.50074       1.079985  0.961352   1.123403 2.61266e-01 4.17435e-01
gene997   59.50013      -0.319877  0.616176  -0.519132 6.03668e-01 7.41125e-01
gene998  160.34766       4.105860  0.619911   6.623306 3.51252e-11 8.90738e-10
gene999   39.29958       1.066834  0.615995   1.731886 8.32938e-02 1.73062e-01
gene1000   6.07362      -2.555036  1.291543  -1.978282 4.78970e-02 1.09907e-01

The above is a simple way to obtain the results of interest. But it is worth understanding how DESeq is getting to these results by looking at the model’s matrix. DESeq defines the model matrix using base R functionality:

Code
model.matrix(design(dds), colData(dds))
        (Intercept) conditiontreat
sample1           1              0
sample2           1              0
sample3           1              0
sample4           1              1
sample5           1              1
sample6           1              1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$condition
[1] "contr.treatment"

We can see that R coded condition as a dummy variable, with an intercept (common to all samples) and a “conditiontreat” variable, which adds the effect of treat to samples 4-6.

We can actually set our contrasts in DESeq2::results() using a numeric vector. The way it works is to define a vector of “weights” for the coefficient(s) we want to test for. In this case, we have (Intercept) and conditiontreat as our coefficients (see model matrix above), and we want to test for the effect of treat, so our contrast vector would be c(0, 1). In other words, we don’t care about the value of (Intercept) (so it has a weight of 0), and we’re only interested in the effect of treat (so we give it a weight of 1).

In this case the design is very simple, so we could define our contrast vector “manually”. But for complex designs this can get more difficult to do, so it’s worth mentioning the general way in which we can define this. For any contrast of interest, we can follow three steps:

  • Get the model matrix
  • Subset the matrix for each group of interest and calculate its column means - this results in a vector of coefficients for each group
  • Subtract the group vectors from each other according to the comparison we’re interested in

Let’s see this example in action:

Code
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
        (Intercept) conditiontreat
sample1           1              0
sample2           1              0
sample3           1              0
sample4           1              1
sample5           1              1
sample6           1              1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$condition
[1] "contr.treatment"
Code
# calculate the vector of coefficient weights in the treat
treat <- colMeans(mod_mat[dds$condition == "treat", ])
treat
   (Intercept) conditiontreat 
             1              1 
Code
# calculate the vector of coefficient weights in the control
control <- colMeans(mod_mat[dds$condition == "control", ])
control
   (Intercept) conditiontreat 
             1              0 
Code
# The contrast we are interested in is the difference between treat and control
treat - control
   (Intercept) conditiontreat 
             0              1 

That last step is where we define our contrast vector, and we can give this directly to the results function:

Code
# get the results for this contrast
res2 <- results(dds, contrast = treat - control)

This gives us exactly the same results as before, which we can check for example by plotting the log-fold-changes between the first and second approach:

Code
plot(res1$log2FoldChange, res2$log2FoldChange)

Recoding the design

Often, we can use different model matrices that essentially correspond to the same design. For example, we could recode our design above by removing the intercept:

Code
design(dds) <- ~ 0 + condition
dds <- DESeq(dds)
resultsNames(dds)
[1] "conditioncontrol" "conditiontreat"  

In this case we get a coefficient corresponding to the average expression in control and the average expression in the treat (rather than the difference between treat and control).

If we use the same contrast trick as before (using the model matrix), we can see the result is the same:

Code
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
        conditioncontrol conditiontreat
sample1                1              0
sample2                1              0
sample3                1              0
sample4                0              1
sample5                0              1
sample6                0              1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$condition
[1] "contr.treatment"
Code
# calculate weights for coefficients in each condition
treat <- colMeans(mod_mat[which(dds$condition == "treat"), ])
control <- colMeans(mod_mat[which(dds$condition == "control"), ])
# get the results for our contrast
res3 <- results(dds, contrast = treat - control)

Again, the results are essentially the same:

Code
plot(res1$log2FoldChange, res3$log2FoldChange)

In theory there’s no difference between these two ways of defining our design. The design with an intercept is more common, but for the purposes of understanding what’s going on, it’s sometimes easier to look at models without intercept.

One factor, three levels

Code
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 9, betaSD = 2)
dds$condition <- NULL
dds$bloodtype <- factor(rep(c("bloodA", "bloodB", "bloodO"), each = 3))
dds$bloodtype <- relevel(dds$bloodtype, "bloodO")

First we can look at our sample information:

Code
colData(dds)
DataFrame with 9 rows and 1 column
        bloodtype
         <factor>
sample1    bloodA
sample2    bloodA
sample3    bloodA
sample4    bloodB
sample5    bloodB
sample6    bloodB
sample7    bloodO
sample8    bloodO
sample9    bloodO

As in the previous example, we only have one factor of interest, bloodtype, and so we define our design and run the DESeq as before:

Code
design(dds) <- ~ 1 + bloodtype
dds <- DESeq(dds)
# check the coefficients estimated by DEseq
resultsNames(dds)
[1] "Intercept"                  "bloodtype_bloodA_vs_bloodO"
[3] "bloodtype_bloodB_vs_bloodO"

We see that now we have 3 coefficients:

  • “Intercept” corresponds to bloodO bloodtype (our reference level)
  • “bloodtype_bloodA_vs_bloodO” corresponds to the difference between the reference level and bloodA
  • “bloodtype_bloodB_vs_bloodO” corresponds to the difference between the reference level and bloodB

We could obtain the difference between bloodO and any of the two bloodtypes easily:

Code
res1_bloodA_bloodO <- results(dds, contrast = list("bloodtype_bloodA_vs_bloodO"))
res1_bloodB_bloodO <- results(dds, contrast = list("bloodtype_bloodB_vs_bloodO"))

For comparing bloodA vs bloodB, however, we need to compare two coefficients with each other to check whether they are themselves different (check the slide to see the illustration). This is how the standard DESeq syntax would be:

Code
res1_bloodA_bloodB <- results(dds, contrast = list("bloodtype_bloodA_vs_bloodO", 
                                                 "bloodtype_bloodB_vs_bloodO"))

However, following our three steps detailed in the first section, we can define our comparisons from the design matrix:

Code
# define the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
        (Intercept) bloodtypebloodA bloodtypebloodB
sample1           1               1               0
sample2           1               1               0
sample3           1               1               0
sample4           1               0               1
sample5           1               0               1
sample6           1               0               1
sample7           1               0               0
sample8           1               0               0
sample9           1               0               0
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$bloodtype
[1] "contr.treatment"
Code
# calculate coefficient vectors for each group
bloodA <- colMeans(mod_mat[dds$bloodtype == "bloodA", ])
bloodB <- colMeans(mod_mat[dds$bloodtype == "bloodB", ])
bloodO <- colMeans(mod_mat[dds$bloodtype == "bloodO", ])

And we can now define any contrasts we want:

Code
# obtain results for each pairwise contrast
res2_bloodA_bloodO <- results(dds, contrast = bloodA - bloodO)
res2_bloodB_bloodO <- results(dds, contrast = bloodB - bloodO)
res2_bloodA_bloodB <- results(dds, contrast = bloodA - bloodB)
# plot the results from the two approaches to check that they are identical
plot(res1_bloodA_bloodO$log2FoldChange, res2_bloodA_bloodO$log2FoldChange)

Code
plot(res1_bloodB_bloodO$log2FoldChange, res2_bloodB_bloodO$log2FoldChange)

Code
plot(res1_bloodA_bloodB$log2FoldChange, res2_bloodA_bloodB$log2FoldChange)

A and B against O

With this approach, we could even define a more unusual contrast, for example to find genes that differ between A and B against and O samples:

Code
# define vector of coefficients for A_B samples
A_B <- colMeans(mod_mat[dds$bloodtype %in% c("bloodA", "bloodB"),])
# Our contrast of interest is
A_B - bloodO
    (Intercept) bloodtypebloodA bloodtypebloodB 
            0.0             0.5             0.5 

Notice the contrast vector in this case assigns a “weight” of 0.5 to each of bloodtypebloodA and bloodtypebloodB. This is equivalent to saying that we want to consider the average of bloodA and bloodB expression. In fact, we could have also defined our contrast vector like this:

Code
# average of bloodA and bloodB minus bloodO
(bloodA + bloodB)/2 - bloodO
    (Intercept) bloodtypebloodA bloodtypebloodB 
            0.0             0.5             0.5 

To obtain our results, we use the results() function as before:

Code
# get the results between A_B and bloodA
res2_AB <- results(dds, contrast = A_B - bloodO)

Extra: why not define a new group in our design matrix?

For this last example (A_B vs bloodO), we may have considered creating a new variable in our column data:

Code
dds$A_B <- factor(dds$bloodtype %in% c("bloodA", "bloodB"))
colData(dds)
DataFrame with 9 rows and 3 columns
        bloodtype sizeFactor      A_B
         <factor>  <numeric> <factor>
sample1    bloodA   0.933042    TRUE 
sample2    bloodA   0.945626    TRUE 
sample3    bloodA   1.003623    TRUE 
sample4    bloodB   1.015760    TRUE 
sample5    bloodB   0.956143    TRUE 
sample6    bloodB   1.116913    TRUE 
sample7    bloodO   1.082968    FALSE
sample8    bloodO   1.144347    FALSE
sample9    bloodO   1.118454    FALSE

and then re-run DESeq with a new design:

Code
design(dds) <- ~ 1 + A_B
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept"         "A_B_TRUE_vs_FALSE"
Code
res1_A_B <- results(dds, contrast = list("A_B_TRUE_vs_FALSE"))

However, in this model the gene dispersion is estimated together for bloodA and bloodB samples as if they were replicates of each other, which may result in inflated/deflated estimates. Instead, our approach above estimates the error within each of those groups.

To check the difference one could compare the two approaches visually:

Code
# compare the log-fold-changes between the two approaches
plot(res1_A_B$log2FoldChange, res2_AB$log2FoldChange)
abline(0, 1, col = "brown", lwd = 2)

Code
# compare the errors between the two approaches
plot(res1_A_B$lfcSE, res2_AB$lfcSE)
abline(0, 1, col = "brown", lwd = 2)

Two factors with interaction

Code
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 12, betaSD = 2)
dds$bloodtype <- factor(rep(c("bloodO", "bloodA"), each = 6))
dds$bloodtype <- relevel(dds$bloodtype, "bloodO")
dds$condition <- factor(rep(c("treat", "control"), 6))
dds <- dds[, order(dds$bloodtype, dds$condition)]
colnames(dds) <- paste0("sample", 1:ncol(dds))

First let’s look at our sample information:

Code
colData(dds)
DataFrame with 12 rows and 2 columns
         condition bloodtype
          <factor>  <factor>
sample1    control    bloodO
sample2    control    bloodO
sample3    control    bloodO
sample4    treat      bloodO
sample5    treat      bloodO
...            ...       ...
sample8    control    bloodA
sample9    control    bloodA
sample10   treat      bloodA
sample11   treat      bloodA
sample12   treat      bloodA

This time we have two factors of interest, and we want to model both with an interaction (i.e. we assume that bloodA and bloodO samples may respond differently to treat/control). We define our design accordingly and fit the model:

Code
design(dds) <- ~ 1 + bloodtype + condition + bloodtype:condition
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept"                      "bloodtype_bloodA_vs_bloodO"    
[3] "condition_treat_vs_control"     "bloodtypebloodA.conditiontreat"

Because we have two factors and an interaction, the number of comparisons we can do is larger. Using our three-step approach from the model matrix, we do things exactly as we’ve been doing so far:

Code
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
# Define coefficient vectors for each condition
bloodO_control <- colMeans(mod_mat[dds$bloodtype == "bloodO" & dds$condition == "control", ])
bloodO_treat <- colMeans(mod_mat[dds$bloodtype == "bloodO" & dds$condition == "treat", ])
bloodA_control <- colMeans(mod_mat[dds$bloodtype == "bloodA" & dds$condition == "control", ])
bloodA_treat <- colMeans(mod_mat[dds$bloodtype == "bloodA" & dds$condition == "treat", ])

We are now ready to define any contrast of interest from these vectors (for completeness we show the equivalent syntax using the coefficient’s names from DESeq).

bloodA vs bloodO (in the control):

Code
res1 <- results(dds, contrast = bloodA_control - bloodO_control)
# or equivalently
res2 <- results(dds, contrast = list("bloodtype_bloodA_vs_bloodO"))

bloodA vs bloodO (in the treatment):

Code
res1 <- results(dds, contrast = bloodO_treat - bloodA_treat)
# or equivalently
res2 <- results(dds, contrast = list(c("bloodtype_bloodA_vs_bloodO",
                                       "bloodtypebloodA.conditiontreat")))

treat vs control (for bloodtypes O):

Code
res1 <- results(dds, contrast = bloodO_treat - bloodO_control)
# or equivalently
res2 <- results(dds, contrast = list(c("condition_treat_vs_control")))

treat vs control (for bloodtypes A):

Code
res1 <- results(dds, contrast = bloodA_treat - bloodA_control)
# or equivalently
res2 <- results(dds, contrast = list(c("condition_treat_vs_control", 
                                       "bloodtypebloodA.conditiontreat")))

Interaction between bloodtype and condition

I.e. do bloodAs and bloodOs respond differently to the treatment?

Code
res1 <- results(dds, 
                contrast = (bloodA_treat - bloodA_control) - (bloodO_treat - bloodO_control))
# or equivalently
res2 <- results(dds, contrast = list("bloodtypebloodA.conditiontreat"))

In conclusion, although we can define these contrasts using DESeq coefficient names, it is somewhat more explicit (and perhaps intuitive?) what it is we’re comparing using matrix-based contrasts.

Three factors, with nesting

Code
# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 24, betaSD = 2)
dds$bloodtype <- factor(rep(c("bloodA", "bloodO"), each = 12))
dds$bloodtype <- relevel(dds$bloodtype, "bloodO")
dds$patient <- factor(rep(LETTERS[1:4], each = 6))
dds$condition <- factor(rep(c("treat", "control"), 12))
dds <- dds[, order(dds$bloodtype, dds$patient, dds$condition)]
colnames(dds) <- paste0("sample", 1:ncol(dds))

First let’s look at our sample information:

Code
colData(dds)
DataFrame with 24 rows and 3 columns
         condition bloodtype  patient
          <factor>  <factor> <factor>
sample1    control    bloodO        C
sample2    control    bloodO        C
sample3    control    bloodO        C
sample4    treat      bloodO        C
sample5    treat      bloodO        C
...            ...       ...      ...
sample20   control    bloodA        B
sample21   control    bloodA        B
sample22   treat      bloodA        B
sample23   treat      bloodA        B
sample24   treat      bloodA        B

Now we have three factors, but patient is nested within bloodtype (i.e. a patient is either bloodA or bloodO, it cannot be both). Therefore, bloodtype is a linear combination with patient (or, another way to think about it is that bloodtype is redundant with patient). Because of this, we will define our design without including “bloodtype”, although later we can compare groups of patient of the same bloodtype with each other.

Code
design(dds) <- ~ 1 + patient + condition + patient:condition
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept"                  "patient_B_vs_A"            
[3] "patient_C_vs_A"             "patient_D_vs_A"            
[5] "condition_treat_vs_control" "patientB.conditiontreat"   
[7] "patientC.conditiontreat"    "patientD.conditiontreat"   

Now it’s harder to define contrasts between groups of patient of the same bloodtype using DESeq’s coefficient names (although still possible). But using the model matrix approach, we do it in exactly the same way we have done so far!

Again, let’s define our groups from the model matrix:

Code
# get the model matrix
mod_mat <- model.matrix(design(dds), colData(dds))
# define coefficient vectors for each group
bloodO_control <- colMeans(mod_mat[dds$bloodtype == "bloodO" & dds$condition == "control", ])
bloodA_control <- colMeans(mod_mat[dds$bloodtype == "bloodA" & dds$condition == "control", ])
bloodO_treat <- colMeans(mod_mat[dds$bloodtype == "bloodO" & dds$condition == "treat", ])
bloodA_treat <- colMeans(mod_mat[dds$bloodtype == "bloodA" & dds$condition == "treat", ])

It’s worth looking at some of these vectors, to see that they are composed of weighted coefficients from different patient. For example, for “bloodO” patient, we have equal contribution from “patientC” and “patientD”:

Code
bloodO_control
            (Intercept)                patientB                patientC 
                    1.0                     0.0                     0.5 
               patientD          conditiontreat patientB:conditiontreat 
                    0.5                     0.0                     0.0 
patientC:conditiontreat patientD:conditiontreat 
                    0.0                     0.0 

And so, when we define our contrasts, each patient will be correctly weighted:

Code
bloodO_treat - bloodO_control
            (Intercept)                patientB                patientC 
                    0.0                     0.0                     0.0 
               patientD          conditiontreat patientB:conditiontreat 
                    0.0                     1.0                     0.0 
patientC:conditiontreat patientD:conditiontreat 
                    0.5                     0.5 

We can set our contrasts in exactly the same way as we did in the previous section (for completeness, we also give the contrasts using DESeq’s named coefficients).

bloodA vs bloodO (in the control):

Code
res1_bloodA_bloodO_control <- results(dds, contrast = bloodA_control - bloodO_control)
# or equivalently
res2_bloodA_bloodO_control <- results(dds, 
                                 contrast = list(c("patient_B_vs_A"), # Blood type A
                                                 c("patient_C_vs_A", # Blood type O
                                                   "patient_D_vs_A"))) # Blood type O

bloodA vs bloodO (in the treat):

Code
res1_bloodO_bloodA_treat <- results(dds, contrast = bloodO_treat - bloodA_treat)
# or equivalently
res2_bloodO_bloodA_treat <- results(dds, 
                           contrast = list(c("patient_B_vs_A", # Blood type A
                                             "patientB.conditiontreat"), # Interaction of patient B with treatment
                                           c("patient_C_vs_A", # Blood type O
                                             "patient_D_vs_A", # Blood type O
                                             "patientC.conditiontreat", # Interaction of patient C with treatment
                                             "patientD.conditiontreat"))) # Interaction of patient B with treatment

And so on, for other contrasts of interest…


Extra: imbalanced design

Let’s take our previous example, but drop one of the samples from the data, so that we only have 2 replicates for it.

Code
dds <- dds[, -1] # drop one of the patient C samples
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept"                  "patient_B_vs_A"            
[3] "patient_C_vs_A"             "patient_D_vs_A"            
[5] "condition_treat_vs_control" "patientB.conditiontreat"   
[7] "patientC.conditiontreat"    "patientD.conditiontreat"   

Define our model matrix and coefficient vectors:

Code
mod_mat <- model.matrix(design(dds), colData(dds))
mod_mat
         (Intercept) patientB patientC patientD conditiontreat
sample2            1        0        1        0              0
sample3            1        0        1        0              0
sample4            1        0        1        0              1
sample5            1        0        1        0              1
sample6            1        0        1        0              1
sample7            1        0        0        1              0
sample8            1        0        0        1              0
sample9            1        0        0        1              0
sample10           1        0        0        1              1
sample11           1        0        0        1              1
sample12           1        0        0        1              1
sample13           1        0        0        0              0
sample14           1        0        0        0              0
sample15           1        0        0        0              0
sample16           1        0        0        0              1
sample17           1        0        0        0              1
sample18           1        0        0        0              1
sample19           1        1        0        0              0
sample20           1        1        0        0              0
sample21           1        1        0        0              0
sample22           1        1        0        0              1
sample23           1        1        0        0              1
sample24           1        1        0        0              1
         patientB:conditiontreat patientC:conditiontreat
sample2                        0                       0
sample3                        0                       0
sample4                        0                       1
sample5                        0                       1
sample6                        0                       1
sample7                        0                       0
sample8                        0                       0
sample9                        0                       0
sample10                       0                       0
sample11                       0                       0
sample12                       0                       0
sample13                       0                       0
sample14                       0                       0
sample15                       0                       0
sample16                       0                       0
sample17                       0                       0
sample18                       0                       0
sample19                       0                       0
sample20                       0                       0
sample21                       0                       0
sample22                       1                       0
sample23                       1                       0
sample24                       1                       0
         patientD:conditiontreat
sample2                        0
sample3                        0
sample4                        0
sample5                        0
sample6                        0
sample7                        0
sample8                        0
sample9                        0
sample10                       1
sample11                       1
sample12                       1
sample13                       0
sample14                       0
sample15                       0
sample16                       0
sample17                       0
sample18                       0
sample19                       0
sample20                       0
sample21                       0
sample22                       0
sample23                       0
sample24                       0
attr(,"assign")
[1] 0 1 1 1 2 3 3 3
attr(,"contrasts")
attr(,"contrasts")$patient
[1] "contr.treatment"

attr(,"contrasts")$condition
[1] "contr.treatment"
Code
# define coefficient vectors for each group
bloodO_control <- colMeans(mod_mat[dds$bloodtype == "bloodO" & dds$condition == "control", ])
bloodA_control <- colMeans(mod_mat[dds$bloodtype == "bloodA" & dds$condition == "control", ])
bloodO_treat <- colMeans(mod_mat[dds$bloodtype == "bloodO" & dds$condition == "treat", ])
bloodA_treat <- colMeans(mod_mat[dds$bloodtype == "bloodA" & dds$condition == "treat", ])

Now let’s check what happens to the bloodO_control group:

Code
bloodO_control
            (Intercept)                patientB                patientC 
                    1.0                     0.0                     0.4 
               patientD          conditiontreat patientB:conditiontreat 
                    0.6                     0.0                     0.0 
patientC:conditiontreat patientD:conditiontreat 
                    0.0                     0.0 

Notice that whereas before “patientC” and “patientD” had each a weight of 0.5, now they have different weights. That’s because for patientC there’s only 2 replicates. So, we have a total of 5 bloodtype O individuals in the control (2 from patient C and 3 from D). Therefore, when we calculate the average coefficients for bloodOs, we need to do it as 0.4 x patientC + 0.6 x patientD.

The nice thing about this approach is that we do not need to worry about any of this, the weights come from our colMeans() call automatically. And now, any contrasts that we make will take these weights into account:

Code
# bloodA vs bloodO (in the control)
bloodA_control - bloodO_control
            (Intercept)                patientB                patientC 
                    0.0                     0.5                    -0.4 
               patientD          conditiontreat patientB:conditiontreat 
                   -0.6                     0.0                     0.0 
patientC:conditiontreat patientD:conditiontreat 
                    0.0                     0.0 
Code
# interaction
(bloodA_treat - bloodA_control) - (bloodO_treat - bloodO_control)
            (Intercept)                patientB                patientC 
                    0.0                     0.0                    -0.1 
               patientD          conditiontreat patientB:conditiontreat 
                    0.1                     0.0                     0.5 
patientC:conditiontreat patientD:conditiontreat 
                   -0.5                    -0.5