1 Read a dataset of animal coagulation time vs diet

coag diet
62 A
60 A
63 A
59 A
63 B
67 B
71 B
64 B
65 B
66 B
68 C
66 C
71 C
67 C
68 C
68 C
56 D
62 D
60 D
61 D
63 D
64 D
63 D
59 D

2 Visualize the data using scatterplot and boxplot

3 Fit a Linear Model

3.1 Fit a linear model with cell means

coag dietA dietB dietC dietD
62 1 0 0 0
60 1 0 0 0
63 1 0 0 0
59 1 0 0 0
63 0 1 0 0
67 0 1 0 0
71 0 1 0 0
64 0 1 0 0
65 0 1 0 0
66 0 1 0 0
68 0 0 1 0
66 0 0 1 0
71 0 0 1 0
67 0 0 1 0
68 0 0 1 0
68 0 0 1 0
56 0 0 0 1
62 0 0 0 1
60 0 0 0 1
61 0 0 0 1
63 0 0 0 1
64 0 0 0 1
63 0 0 0 1
59 0 0 0 1

This model is written as: \[ y_k = \mu_A x_k^A+\mu_B x_k^B+\mu_C x_k^C+\mu_D x_k^D+\epsilon_k \] where \(x_k^i=I(x_k=i)\) for \(i=A,B,C,D\).

# 
# Call:
# lm(formula = coag ~ 0 + diet)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -5.00  -1.25   0.00   1.25   5.00 
# 
# Coefficients:
#       Estimate Std. Error t value Pr(>|t|)    
# dietA   61.000      1.183    51.5   <2e-16 ***
# dietB   66.000      0.966    68.3   <2e-16 ***
# dietC   68.000      0.966    70.4   <2e-16 ***
# dietD   61.000      0.837    72.9   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.37 on 20 degrees of freedom
# Multiple R-squared:  0.999,   Adjusted R-squared:  0.999 
# F-statistic: 4.4e+03 on 4 and 20 DF,  p-value: <2e-16

3.2 Fit a linear model with centralized effects

coag X.Intercept. diet1 diet2 diet3
62 1 1 0 0
60 1 1 0 0
63 1 1 0 0
59 1 1 0 0
63 1 0 1 0
67 1 0 1 0
71 1 0 1 0
64 1 0 1 0
65 1 0 1 0
66 1 0 1 0
68 1 0 0 1
66 1 0 0 1
71 1 0 0 1
67 1 0 0 1
68 1 0 0 1
68 1 0 0 1
56 1 -1 -1 -1
62 1 -1 -1 -1
60 1 -1 -1 -1
61 1 -1 -1 -1
63 1 -1 -1 -1
64 1 -1 -1 -1
63 1 -1 -1 -1
59 1 -1 -1 -1

This model is written as: \[ y_k = \mu+\tau_A (x_k^A-x_k^D)+\tau_B (x_k^B-x_k^D)+\tau_C (x_k^C-x_k^D)+\epsilon_k \] where \[\mu=(\mu_A+\ldots+\mu_D)/4, \tau_A=\mu_A-\mu,\ldots,\tau_D=\mu_D-\mu\]

# 
# Call:
# lm(formula = coag ~ diet, contrasts = list(diet = contr.sum))
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -5.00  -1.25   0.00   1.25   5.00 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   64.000      0.498  128.54  < 2e-16 ***
# diet1         -3.000      0.974   -3.08  0.00589 ** 
# diet2          2.000      0.845    2.37  0.02819 *  
# diet3          4.000      0.845    4.73  0.00013 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.37 on 20 degrees of freedom
# Multiple R-squared:  0.671,   Adjusted R-squared:  0.621 
# F-statistic: 13.6 on 3 and 20 DF,  p-value: 4.66e-05

3.3 Fit a linear model with treatment differences

X.Intercept. diet2 diet3 diet4
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1

This model is written as: \[ y_k = \beta_0+\beta_B x_k^B+\beta_Cx_k^C+\beta_D x_k^D+\epsilon_k \] where, \[\beta_0=\mu_A, \beta_B=\mu_B-\mu_A, \beta_C=\mu_C-\mu_A, \beta_D=\mu_D-\mu_A.\] Note that \[\beta_B=\tau_B-\tau_A,\beta_C=\tau_C-\tau_A,\beta_D=\tau_D-\tau_A.\]

# 
# Call:
# lm(formula = coag ~ diet, contrasts = list(diet = contr.treatment(n = 4, 
#     base = 1)))
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -5.00  -1.25   0.00   1.25   5.00 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 6.10e+01   1.18e+00   51.55  < 2e-16 ***
# diet2       5.00e+00   1.53e+00    3.27  0.00380 ** 
# diet3       7.00e+00   1.53e+00    4.58  0.00018 ***
# diet4       2.99e-15   1.45e+00    0.00  1.00000    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.37 on 20 degrees of freedom
# Multiple R-squared:  0.671,   Adjusted R-squared:  0.621 
# F-statistic: 13.6 on 3 and 20 DF,  p-value: 4.66e-05
# 
# Call:
# lm(formula = coag ~ diet)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -5.00  -1.25   0.00   1.25   5.00 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 6.10e+01   1.18e+00   51.55  < 2e-16 ***
# dietB       5.00e+00   1.53e+00    3.27  0.00380 ** 
# dietC       7.00e+00   1.53e+00    4.58  0.00018 ***
# dietD       2.99e-15   1.45e+00    0.00  1.00000    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.37 on 20 degrees of freedom
# Multiple R-squared:  0.671,   Adjusted R-squared:  0.621 
# F-statistic: 13.6 on 3 and 20 DF,  p-value: 4.66e-05

4 Analysis of Variance

4.1 Fitted values, Residuals, and Sum Squares

This section is for illustrating prediction and residuals

#  [1] 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
# 61 61 61 61 66 66 66 66 66 66 68 68 68 68 68 68 61 61 61 61 61 61 61 61
diet y y_hat_h0 resid_h0 y_hat_h1 resid_h1 diff.residual
A 62 64 -2 61 1 -3
A 60 64 -4 61 -1 -3
A 63 64 -1 61 2 -3
A 59 64 -5 61 -2 -3
B 63 64 -1 66 -3 2
B 67 64 3 66 1 2
B 71 64 7 66 5 2
B 64 64 0 66 -2 2
B 65 64 1 66 -1 2
B 66 64 2 66 0 2
C 68 64 4 68 0 4
C 66 64 2 68 -2 4
C 71 64 7 68 3 4
C 67 64 3 68 -1 4
C 68 64 4 68 0 4
C 68 64 4 68 0 4
D 56 64 -8 61 -5 -3
D 62 64 -2 61 1 -3
D 60 64 -4 61 -1 -3
D 61 64 -3 61 0 -3
D 63 64 -1 61 2 -3
D 64 64 0 61 3 -3
D 63 64 -1 61 2 -3
D 59 64 -5 61 -2 -3

Visulize the predictive values and residuals

# [1] 340
# [1] 112
# [1] 228
# [1] 228

Plot of Sum Square Against Degree Freedom

4.2 ANOVA by Hand (Short-cut Formula)

Let \(n=n_1+\ldots,n_a\) with \(n_i\) is the number of observation in group \(i\). Formulae for computing SS: \[RSS_0 = SST = \sum_{i,j} y_{ij}^2-n\bar{y_{..}}^2 \] \[RSS_0-RSS_1 = SS_{tr} = \sum_i^a n_i\bar{y}_{i.}^2-n\bar{y_{..}}^2\] \[RSS_1 = SSE=SST-SS_{tr}=\sum_{i,j} y_{ij}^2 - \sum_i^a n_i\bar{y}_{i.}^2 = \sum_i^a SS_i\]

#  A  B  C  D 
# 61 66 68 61
# [1] 64
# diet
# A B C D 
# 4 6 6 8
# [1] 340
# diet
#     A     B     C     D 
# 14884 26136 27744 29768
# [1] 228
# [1] 112
# [1] 4
# [1] 24
# [1] 4.658e-05
Df SS MS F pvalue
diet 3 228 76.0 13.57 0
Residuals 20 112 5.6 NA NA

4.3 ANOVA with R Function

Compute ANOVA with the fitting result with intercept

Df Sum Sq Mean Sq F value Pr(>F)
diet 3 228 76.0 13.57 0
Residuals 20 112 5.6 NA NA

4.4 Understanding Sampling Distribution of SS and F

We will use simulation to understand the sampling distributions of sum squares and F statistics in ANOVA.

When treat difference is 0

When treat difference is non-zero

5 Model Checking

# 
# Call:
# lm(formula = coag ~ diet)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -5.00  -1.25   0.00   1.25   5.00 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 6.10e+01   1.18e+00   51.55  < 2e-16 ***
# dietB       5.00e+00   1.53e+00    3.27  0.00380 ** 
# dietC       7.00e+00   1.53e+00    4.58  0.00018 ***
# dietD       2.99e-15   1.45e+00    0.00  1.00000    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.37 on 20 degrees of freedom
# Multiple R-squared:  0.671,   Adjusted R-squared:  0.621 
# F-statistic: 13.6 on 3 and 20 DF,  p-value: 4.66e-05

# 
#   Shapiro-Wilk normality test
# 
# data:  g$residuals
# W = 0.98, p-value = 0.9

Df Sum Sq Mean Sq F value Pr(>F)
diet 3 4.333 1.444 0.7046 0.5604
Residuals 20 41.000 2.050 NA NA
# 
#   Classical Levene's test based on the absolute deviations from the mean
#   ( none not applied because the location is not set to median )
# 
# data:  coagulation$coag
# Test Statistic = 0.7, p-value = 0.6
# 
#   Modified robust Brown-Forsythe Levene-type test based on the absolute
#   deviations from the median
# 
# data:  coagulation$coag
# Test Statistic = 0.65, p-value = 0.6
# 
#   Bartlett test of homogeneity of variances
# 
# data:  coag by diet
# Bartlett's K-squared = 1.7, df = 3, p-value = 0.6

6 Comparing Effects

6.1 Estimate Treatment Means

#       Estimate Std. Error t value  Pr(>|t|) 2.5 % 97.5 %
# dietA       61     1.1832   51.55 9.548e-23 58.53  63.47
# dietB       66     0.9661   68.32 3.532e-25 63.98  68.02
# dietC       68     0.9661   70.39 1.949e-25 65.98  70.02
# dietD       61     0.8367   72.91 9.663e-26 59.25  62.75

Computing formular for standard error: \[\hat\sigma=\sqrt{MSE}\] \[se(\hat \mu_i)=\hat\sigma\times \sqrt{1/n_i}\]

Df Sum Sq Mean Sq F value Pr(>F)
diet 4 98532 24633.0 4399 0
Residuals 20 112 5.6 NA NA
# diet
# A B C D 
# 4 6 6 8
# [1] 2.366
# [1] 58.53 63.47
# [1] 63.98 68.02

6.2 Effect Comparison with Least Significant Difference

Simply using confidence intervals from lm model, which uses t quantile without using multiple comparison adjustment, and use “A” as the baseline level (intercept)

#              Estimate Std. Error   t value  Pr(>|t|)  2.5 % 97.5 %
# (Intercept) 6.100e+01      1.183 5.155e+01 9.548e-23 58.532 63.468
# dietB       5.000e+00      1.528 3.273e+00 3.803e-03  1.814  8.186
# dietC       7.000e+00      1.528 4.583e+00 1.805e-04  3.814 10.186
# dietD       2.991e-15      1.449 2.064e-15 1.000e+00 -3.023  3.023

Computing formula for standard error: \[se(\hat \mu_i-\hat \mu_j)=\hat\sigma\times \sqrt{1/n_i+1/n_j}\]

Df Sum Sq Mean Sq F value Pr(>F)
diet 3 228 76.0 13.57 0
Residuals 20 112 5.6 NA NA
# diet
# A B C D 
# 4 6 6 8
# [1] 2.366
# [1] 1.814 8.186
# [1] -3.023  3.023

6.3 A Simulation Illustration for Multiple Comparison Bias

#           [,1]    [,2]      [,3]     [,4]    [,5]     [,6]
#  [1,]  0.07936 -1.1155 -0.345271  0.52322  1.0142 -1.40514
#  [2,]  0.73488  1.7361  0.007894  0.24220  0.5333  1.17966
#  [3,] -0.89140 -0.7483 -0.201237 -0.85505 -0.3195 -0.02718
#  [4,] -0.31502 -1.5270 -0.097061  0.67787 -0.3831 -0.70941
#  [5,]  0.52082  1.3273  0.075378  1.46701  0.7047 -0.13058
#  [6,] -0.21127  1.2140 -1.113065  0.69077 -0.7042  1.80166
#  [7,]  0.32672 -2.3090  0.892029  1.25368 -0.1084  1.57682
#  [8,] -0.24332 -0.8896  0.350715 -3.32672 -0.2094 -0.03835
#  [9,]  0.30136  1.1280  0.429686 -0.42425  0.5074 -0.20134
# [10,]  0.48439  1.4437 -1.053625  0.03308  0.7985 -0.97384

6.5 Procedure for Computing Tukey Confidence Intervals

# [1] 2.366
# diet
# A B C D 
# 4 6 6 8

Formula for Tukey Margin Error with confidence level \(1-\alpha\):

\[\hat\sigma \times \sqrt {1/n_i+1/n_j} \times q_{\alpha, k, df}/\sqrt{2}\] where \(q_{\alpha, k, df}\) is the upper tukey quantile with \(k\) means and \(df\) degrees freedom.

#  lower  upper 
# 0.7246 9.2754
#   Tukey multiple comparisons of means
#     95% family-wise confidence level
# 
# Fit: aov(formula = coag ~ diet)
# 
# $diet
#     diff      lwr    upr  p adj
# B-A    5   0.7246  9.275 0.0183
# C-A    7   2.7246 11.275 0.0010
# D-A    0  -4.0560  4.056 1.0000
# C-B    2  -1.8241  5.824 0.4766
# D-B   -5  -8.5771 -1.423 0.0044
# D-C   -7 -10.5771 -3.423 0.0001
#              Estimate Std. Error   t value  Pr(>|t|)  2.5 % 97.5 %
# (Intercept) 6.100e+01      1.183 5.155e+01 9.548e-23 58.532 63.468
# dietB       5.000e+00      1.528 3.273e+00 3.803e-03  1.814  8.186
# dietC       7.000e+00      1.528 4.583e+00 1.805e-04  3.814 10.186
# dietD       2.991e-15      1.449 2.064e-15 1.000e+00 -3.023  3.023

7 A complete R Procedure for Example 3.1 from Textbook

etch power
575 160
542 160
530 160
539 160
570 160
565 180
593 180
590 180
579 180
610 180
600 200
651 200
610 200
637 200
629 200
725 220
700 220
715 220
685 220
710 220

# 
# Call:
# lm(formula = etch ~ power, data = poweretch)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -25.4  -13.0    2.8   13.2   25.6 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   551.20       8.17   67.47  < 2e-16 ***
# power180       36.20      11.55    3.13   0.0064 ** 
# power200       74.20      11.55    6.42  8.4e-06 ***
# power220      155.80      11.55   13.49  3.7e-10 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 18.3 on 16 degrees of freedom
# Multiple R-squared:  0.926,   Adjusted R-squared:  0.912 
# F-statistic: 66.8 on 3 and 16 DF,  p-value: 2.88e-09
Df Sum Sq Mean Sq F value Pr(>F)
power 3 66871 22290.2 66.8 0
Residuals 16 5339 333.7 NA NA

# 
#   Shapiro-Wilk normality test
# 
# data:  fit.etch$residuals
# W = 0.94, p-value = 0.2
# 
#   Modified robust Brown-Forsythe Levene-type test based on the absolute
#   deviations from the median
# 
# data:  etch
# Test Statistic = 0.2, p-value = 0.9
# 
#   Bartlett test of homogeneity of variances
# 
# data:  etch by power
# Bartlett's K-squared = 0.43, df = 3, p-value = 0.9

#              2.5 % 97.5 %
# (Intercept) 533.88 568.52
# power180     11.71  60.69
# power200     49.71  98.69
# power220    131.31 180.29
# 
# Call:
# lm(formula = etch ~ n.power)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -25.4  -13.0    2.8   13.2   25.6 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   625.40       8.17   76.55  < 2e-16 ***
# n.power180    -38.00      11.55   -3.29   0.0046 ** 
# n.power160    -74.20      11.55   -6.42  8.4e-06 ***
# n.power220     81.60      11.55    7.06  2.7e-06 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 18.3 on 16 degrees of freedom
# Multiple R-squared:  0.926,   Adjusted R-squared:  0.912 
# F-statistic: 66.8 on 3 and 16 DF,  p-value: 2.88e-09
#              2.5 % 97.5 %
# (Intercept) 608.08 642.72
# n.power180  -62.49 -13.51
# n.power160  -98.69 -49.71
# n.power220   57.11 106.09