Modelling experimental data

A completely randomised design (CRD) as a linear model

As we’ve seen in the previous module that we can write a linear model with a single explanatory variable as

$Y_i = \alpha + \beta_1x_i + \epsilon_i$

When dealing with factor variables we use dummy variables and can write the above as

$Y_{ik} = \alpha + \tau_k + \epsilon_{ik}$ where $\tau_k$ is called an effect and represents the difference between the overall average, $\alpha$ , and the average at the $k_{th}$ treatment level. The errors $\epsilon_{ik}$ are again assumed to be normally distributed and independent due to the randomisation (i.e., $\epsilon_{ik} \sim N(0, \sigma^2)$ .

Or you might think of the model as

$Y_{ik} = \mu_k + \epsilon_{ik}$

where $Y_{ik}$ is the response (i.e., observed coffee opacity) for the $i^{th}$ experimental unit (i.e., coffee cup) subjected to the $k^{th}$ level of the treatment factor (i.e., coffee type). Here $\mu_k$ are the different (cell) means for each level of the treatment factor. See below for an illustration of this for three factor treatment levels (as in the coffee example above).

Analysis of a CRD in `R`

In this section we will again consider the rats data containing logAUC values for 12 rats subjected to three different treatments (Surgery), C, P, and S:

## Rows: 12
## Columns: 3
## $ Surgery <fct> C, C, C, C, P, P, P, P, S, S, S, S
## $ Rat     <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
## $ logAUC  <dbl> 8.49, 8.20, 9.08, 8.07, 10.24, 7.72, 9.34, 8.50, 11.31, 12.69,…

We could analyse these data using aov():

rats_aov <- aov(logAUC ~ Surgery, data = rats)
summary(rats_aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Surgery      2 22.026  11.013   16.36 0.00101 **
## Residuals    9  6.059   0.673                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis: We test the Null hypothesis, $H_0$ , population (Surgery) means are the same on average verses the alternative hypothesis, $H_1$ , that at least one differs from the others!

Probability of getting an F-statistic at least as extreme as the one we observe (think of the area under the tails of the curve below) p-value Pr(>F)= 0.001 tells us we have sufficient evidence to reject $H_0$ at the 1% level of significance

Alternatively, we could use lm():

rats_lm <- lm(logAUC ~ Surgery, data = rats)
summary(rats_lm)$coef
##             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)   8.4600  0.4102531 20.6214144 6.930903e-09
## SurgeryP      0.4900  0.5801856  0.8445574 4.202408e-01
## SurgeryS      3.0875  0.5801856  5.3215734 4.799872e-04

So, what does this tell us about which pairs of means are different?

To carry out a pair-wise comparisons of means we can use two-sample t-tests, calculating our observed t-value where $\text{t-value} = \frac{\text{Sample Difference}_{ij} - \text{Difference assuming } H_0 \text{ is true}_{ij}}{\text{SE of } \text{Sample Difference}_{ij}}$ . Here, $\text{Sample Difference}_{ij}$ = Difference between pair of sample means. We can then compute the p-value for observed t-value.

The output above has already done this for us:

(Intercept) = $\text{mean}_C$ = 8.46

SE of (Intercept) = SE of $\text{mean}_C$ = SEM = 0.4102531

$\text{Surgery}_P$ = $\text{mean}_P$ – $\text{mean}_C$ = 0.49

SE of $\text{Surgery}_P$ = SE of ( $\text{mean}_P$ - $\text{mean}_C$ ) = SED = 0.5801856

Hypotheses being tested

The t value and Pr (>|t|) are the t-statistic and p-value for testing the null hypotheses listed below 1. Mean abundance is zero for C population 2. No difference between the population means of P and C 3. No difference between the population means of S and C

We’re interested in 2 and 3, but not necessarily 1!

F-test:

anova(rats_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Surgery    2 22.0263 11.0132  16.359 0.001006 **
## Residuals  9  6.0591  0.6732                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The same as aov() in fact aov() is calling lm() in the background.

Diagnostic plots

Carrying out any linear regression recall that we have some key assumptions

Independence
There is a linear relationship between the response and the explanatory variables
The residuals have constant variance
The residuals are normally distributed

TASK Check and comment on the assumptions of this model using the code below.

gglm(rats_lm) # Plot the four main diagnostic plots

A Factorial experiment (as a CRD)

Equal replications (balanced design)

The factorial data contain observations of global metabolic profiling and comparison of relative abundances of proteins (logAUC) in the inner and outer left ventricle (innerLV and outerLV) wall of diabetic and healthy male Wistar rats:

glimpse(factorial)

Organ	Diabetic	Healthy
innerLV	n = 3	n = 3
outerLV	n = 3	n = 3

Fitting models with interactions using lm()

## change to factors (saves errors later on)
factorial$Disease <- as.factor(factorial$Disease)
factorial$Organ <- as.factor(factorial$Organ)
fac_lm <- lm(logAUC ~ Disease*Organ, data = factorial)
summary(fac_lm)$coefficients
##                              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                  7.873333  0.5405835 14.564508 4.841215e-07
## DiseaseHealthy               1.950000  0.7645006  2.550685 3.413826e-02
## OrganouterLV                 2.116667  0.7645006  2.768692 2.434579e-02
## DiseaseHealthy:OrganouterLV -1.840000  1.0811671 -1.701865 1.271934e-01

So, the full model is

$\begin{aligned} \operatorname{\widehat{logAUC}} &= 7.87 + 1.95(\operatorname{Disease}_{\operatorname{Healthy}}) + 2.12(\operatorname{Organ}_{\operatorname{outerLV}}) - 1.84(\operatorname{Disease}_{\operatorname{Healthy}} \times \operatorname{Organ}_{\operatorname{outerLV}}) \end{aligned}$

The three gobal null hypotheses being tested are

$H_0: \hat{\mu}_{\text{Diabetic}} = \hat{\mu}_{\text{Healthy}}$
$H_0: \hat{\mu}_{\text{innerLV}} = \hat{\mu}_{\text{outerLV}}$
$H_0: \hat{\mu}_{\text{Diabetic,innerLV}} = \hat{\mu}_{\text{Diabetic,outerLV}} = \hat{\mu}_{\text{Healthy,innerLV}} = \hat{\mu}_{\text{Healthy,outerLV}}$

anova(fac_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 3.1827  3.1827  3.6304 0.09320 .
## Organ          1 4.2960  4.2960  4.9003 0.05775 .
## Disease:Organ  1 2.5392  2.5392  2.8963 0.12719  
## Residuals      8 7.0135  0.8767                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TASK Using the outputs above state your inference regarding the listed hypotheses.

Note with a balanced design ordering of term doesn’t matter. For example,

fac_lm <- lm(logAUC ~ Disease*Organ, data = factorial)
anova(fac_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 3.1827  3.1827  3.6304 0.09320 .
## Organ          1 4.2960  4.2960  4.9003 0.05775 .
## Disease:Organ  1 2.5392  2.5392  2.8963 0.12719  
## Residuals      8 7.0135  0.8767                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fac_lm_2 <- lm(logAUC ~ Organ*Disease, data = factorial)
anova(fac_lm_2)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Organ          1 4.2960  4.2960  4.9003 0.05775 .
## Disease        1 3.1827  3.1827  3.6304 0.09320 .
## Organ:Disease  1 2.5392  2.5392  2.8963 0.12719  
## Residuals      8 7.0135  0.8767                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unqual replications (unbalanced design)

Here, we will consider a subset of the data above (i.e., an artificially unbalanced study).

unbalanced <- factorial[- c(1:3,10), ]
glimpse(unbalanced)
## Rows: 8
## Columns: 5
## $ Disease <fct> Healthy, Healthy, Healthy, Diabetic, Diabetic, Diabetic, Diabe…
## $ Organ   <fct> outerLV, innerLV, outerLV, innerLV, outerLV, innerLV, innerLV,…
## $ Animal  <dbl> 4, 5, 6, 7, 8, 9, 11, 12
## $ Sample  <dbl> 2, 1, 2, 1, 2, 1, 1, 2
## $ logAUC  <dbl> 10.49, 9.74, 10.98, 7.92, 9.37, 8.69, 7.01, 9.29

Fitting models with interactions using lm()

Note with an unbalanced design ordering of the terms DOES matter. For example,

fac_lm <- lm(logAUC ~ Disease*Organ, data = unbalanced)
anova(fac_lm)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 7.1102  7.1102 18.4955 0.01264 *
## Organ          1 3.1149  3.1149  8.1027 0.04656 *
## Disease:Organ  1 0.0913  0.0913  0.2376 0.65145  
## Residuals      4 1.5377  0.3844                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fac_lm_2 <- lm(logAUC ~ Organ*Disease, data = unbalanced)
anova(fac_lm_2)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Organ          1 5.7291  5.7291 14.9029 0.01814 *
## Disease        1 4.4960  4.4960 11.6953 0.02678 *
## Organ:Disease  1 0.0913  0.0913  0.2376 0.65145  
## Residuals      4 1.5377  0.3844                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The three global null hypotheses being tested are (essentially) the same

$H_0: \hat{\mu}_{\text{Diabetic}} = \hat{\mu}_{\text{Healthy}}$
$H_0: \hat{\mu}_{\text{innerLV}} = \hat{\mu}_{\text{outerLV}}$
$H_0: \hat{\mu}_{\text{Diabetic,innerLV}} = \hat{\mu}_{\text{Diabetic,outerLV}} = \hat{\mu}_{\text{Healthy,innerLV}} = \hat{\mu}_{\text{Healthy,outerLV}}$

However, now the order the terms affects the estimation. Look carefully at the anova() outputs above; what to you notice about the sums of squares (Sum Sq) values?

For a balanced two-factor experiment (e.g., in the previous section) partitioning of the sums of squares ( $SS$ ) is additive:

$SS_\text{Treatment} = SS_\text{Disease} + SS_\text{Organ} + SS_\text{Disease:Organ}.$

The order in which the main effects are added to the model does not matter. However, for an unbalanced design this is not the case as the sums of squares are calculated sequentially. Note that sequentially calculated $SS$ are known as Type I $SS$ .

Sums of squares ( $SS$ )

Sequential (Type I $SS$ )

As a term enters the model its $SS$ is calculated, which is then subtracted from the total $SS$ . This then reduces the available $SS$ for the next term entering the model. When treatment combinations in a factorial experiment are unequally replicated, their effects are not mutually independent, so that the order in which terms enter the model matters.

Considering the data above if we include Disease as a main effect first (i.e., Disease*Organ) then the $SS_\text{Disease}$ is calculated first ignoring the Organ main effect. Here, some Organ information is confounded with Disease information (i.e., variation due to Organ confounded by the variation due to Disease). Then $SS_\text{Organ}$ are calculated having been adjusted for the Diease main effect. This now only contains Organ information (i.e., variation due to Organ effect) since all the Disease information was eliminated in previous step. Finally, $SS_\text{Disease:Organ}$ are calculated adjusted for both $SS_\text{Disease}$ and $SS_\text{Organ}$ . Here, there is no information left relating to Disease or Organ main effects.

What does this look like?

For Disease*Organ we calculate $SS_\text{Disease}$ ignoring any Organ effect (if you are unsure what each line of code is doing I suggest you run it line by line to see what’s being done at each step).

unbalanced %>% 
  mutate(grand_mean = mean(logAUC)) %>%
  group_by(Disease) %>%
  summarise(n = n(),
         treatment_mean = mean(logAUC),
         grand_mean = mean(grand_mean)) %>%
  mutate(ss_disease = n * (treatment_mean - grand_mean)^2) %>%
  pull(ss_disease) %>%
  sum()
## [1] 7.110201

This matches, as we’d expect, $SS_\text{Disease}$ calculated from the Disease*Organ model above.

But, that about the sequential $SS$ , a simple way to think about this is in terms of sequential models:

## Null model.
fit.null <- lm(logAUC ~ 1,  data = unbalanced)
## Only Factor Disease.
fit.a <- lm(logAUC ~ Disease,  data = unbalanced)
## Factors Disease and Organ.
fit.ab <- lm(logAUC ~ Disease + Organ,  data = unbalanced)
## Factors Disease and Organ with interaction.
fit.abi <- lm(logAUC ~ Disease*Organ,  data = unbalanced)
## ANOVA table, as above
anova(fit.abi)
## Analysis of Variance Table
## 
## Response: logAUC
##               Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease        1 7.1102  7.1102 18.4955 0.01264 *
## Organ          1 3.1149  3.1149  8.1027 0.04656 *
## Disease:Organ  1 0.0913  0.0913  0.2376 0.65145  
## Residuals      4 1.5377  0.3844                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## First line.
sum(residuals(fit.null)^2) - sum(residuals(fit.a)^2)
## [1] 7.110201
## Second line.
sum(residuals(fit.a)^2) - sum(residuals(fit.ab)^2)
## [1] 3.114926
## Third line.
sum(residuals(fit.ab)^2) - sum(residuals(fit.abi)^2)
## [1] 0.09134405
## Final line.
sum(residuals(fit.abi)^2)
## [1] 1.537717

Type II $SS$

Rather than calculating $SS$ sequentially we can calculate the $SS$ for a given effect adjusting for all other effects listed in the model. This means that the $SS_\text{A}$ and $SS_\text{B}$ main effects will both be adjusted for each other (since neither contains the other), but will not be adjusted for $SS_\text{A:B}$ (since it contains both A and B). $SS_\text{A:B}$ will be adjusted for both main effects.

In our example above $SS_\text{Disease}$ and $SS_\text{Organ}$ main effects will both be adjusted for each other, but will not be adjusted for $SS_\text{A:B}$ and $SS_\text{Disease:Organ}$ will be adjusted for both main effects.

We can calculate the Type II $SS$ table in R buy either

Extracting the main effect rows that have been adjusted for the other model terms from two Type I $SS$ tables (each one having the terms specified in a different order):

## Type II Organ SS
anova(fac_lm)[2, ]
## Analysis of Variance Table
## 
## Response: logAUC
##       Df Sum Sq Mean Sq F value  Pr(>F)  
## Organ  1 3.1149  3.1149  8.1027 0.04656 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Type II Disease SS
anova(fac_lm_2)[2, ]
## Analysis of Variance Table
## 
## Response: logAUC
##         Df Sum Sq Mean Sq F value  Pr(>F)  
## Disease  1  4.496   4.496  11.695 0.02678 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Type II Organ:Disease/Disease:Organ
anova(fac_lm_2)[3, ]
## Analysis of Variance Table
## 
## Response: logAUC
##               Df   Sum Sq  Mean Sq F value Pr(>F)
## Organ:Disease  1 0.091344 0.091344  0.2376 0.6514

or,

Using an inbuilt R function (does not matter which order the model has the terms specified):

car::Anova(fac_lm, type = 2)
## Anova Table (Type II tests)
## 
## Response: logAUC
##               Sum Sq Df F value  Pr(>F)  
## Disease       4.4960  1 11.6953 0.02678 *
## Organ         3.1149  1  8.1027 0.04656 *
## Disease:Organ 0.0913  1  0.2376 0.65145  
## Residuals     1.5377  4                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: Type III $SS$ and beyond are not covered in this course but I would recommend reading this recent paper for an intuitive overview.

Marginal means

Balanced design

Table 1: Table 2: logAUC data from a balanced design
Disease	Organ	logAUC
Healthy	innerLV	9.40
Healthy	outerLV	8.83
Healthy	innerLV	10.33
Healthy	outerLV	10.49
Healthy	innerLV	9.74
Healthy	outerLV	10.98
Diabetic	innerLV	7.92
Diabetic	outerLV	9.37
Diabetic	innerLV	8.69
Diabetic	outerLV	11.31
Diabetic	innerLV	7.01
Diabetic	outerLV	9.29

Grand mean
9.447

Table 1: Table 1: Cell means
Disease	Organ	log AUC mean
Diabetic	innerLV	7.873
Diabetic	outerLV	9.990
Healthy	innerLV	9.823
Healthy	outerLV	10.100

Table 1: Table 1: Marginal (organ) mean
Organ	log AUC mean
innerLV	8.848
outerLV	10.045

Table 1: Table 1: Marginal (disease) mean
Disease	log AUC mean
Diabetic	8.932
Healthy	9.962

Unbalanced design (Unequal replication due to missing data)

Table 3: Table 4: logAUC data from an unbalanced design
Disease	Organ	logAUC
Healthy	outerLV	10.49
Healthy	innerLV	9.74
Healthy	outerLV	10.98
Diabetic	innerLV	7.92
Diabetic	outerLV	9.37
Diabetic	innerLV	8.69
Diabetic	innerLV	7.01
Diabetic	outerLV	9.29

Grand mean
9.186

Table 5: Table 6: Cell means
Disease	Organ	log AUC mean
Diabetic	innerLV	7.873
Diabetic	outerLV	9.330
Healthy	innerLV	9.740
Healthy	outerLV	10.735

Everything is as we’d expect up until now, but what about the marginal means? The, perhaps, most obvious way would be to do the following (i.e., ignore subgroups, hence give all observations equal weight)?

unbalanced %>%
   dplyr::select(c(Disease, logAUC)) %>%
   group_by(Disease) %>%
   summarise(Mean = mean(logAUC))
## # A tibble: 2 × 2
##   Disease   Mean
##   <fct>    <dbl>
## 1 Diabetic  8.46
## 2 Healthy  10.4

However, as a result of this the means are biased towards groups with greater replication! To avoid this we give the subgroups (Organ) cell means equal weight (this is sometimes called the least squares mean):

unbalanced %>%
   dplyr::select(c(Disease, Organ, logAUC)) %>%
   group_by(Organ, Disease) %>%
     mutate(n = n()) %>% ## count observations in each group
    ## then calculate weighted mean based on the no. observations
   summarise(weighted_mean = weighted.mean(logAUC, w = 1/n)) %>%
     group_by(Disease) %>% ## calculate mean of weighted means
     summarise(mean = mean(weighted_mean))
## # A tibble: 2 × 2
##   Disease   mean
##   <fct>    <dbl>
## 1 Diabetic  8.60
## 2 Healthy  10.2

Now see if you can do the same across Organ to get

Table 7: Table 8: Marginal (organ) means
Organ	mean
innerLV	8.807
outerLV	10.032

What does all of this tell us? When calculating marginal means for unbalanced designs we need to be careful! Luckily there are inbuilt functions in R that do this correctly for us! See the next section for an example using the predictmeans function from the predictmeans package.