Functions and Package
1 An Example
2 Read and Create Data Frame
3 Visualize the Data
4 Fit a Linear Model
- 4.1 Computing and Understanding the Coefficients
5 ANOVA
- 5.1 Output from ANOVA Function
- 5.2 How to compute SS by hand?
6 Selecting Significant Effects
- 6.1 Normal Probability Plot of Effects (Coefficients)
- 6.2 Lenth Method to Determine “Significant” Effects
7 Analysis with selected interactions
8 What can we learn from the linear model (How do the factors interact)?

Functions and Package

library(BsMD)
qqnorm.lenth <- function (effects)
{
    abseffects<- abs(effects)
    m <- length (abseffects)
    # Psuedo sd based all abs(effect)
    s0<-1.5*median(abseffects) 
    non.sig <- abseffects<2.5*s0
    # Refined estimate of sd 
    pse<-1.5*median(abseffects[non.sig]) 

    sme<-qt(1- (1-0.95^(1/m))/2,m/3)*pse;
    sig <- abseffects>sme
    
    hqq<-qqnorm(effects, type = "n")
    text(hqq$x, hqq$y, labels = names (effects), col = sig + 1)
    qqline(effects)
    cat ("Significant Factors Selected by Lenth's Method:\n",
         names(abseffects)[sig], "\n")

    names(effects)[sig]
}

1 An Example

A chemical product is produced in a pressure vessel. A factorial experiment is carried out in the pilot plant to study the factors to study the factors thought to influence the filtration rate of this product. The four factors are temperature (A), pressure (B), concentration of formaldehyde (C), and stirring rate (D). Each factor is present at two levels. A single replicate of the 24 experiment is carried out and the 16 runs are made in random order. The process engineer is interested in maximizing the filtration rate. Current process conditions give filtration rates of around 75 gal/h. The process also currently uses factor C at the high level. The engineer would like to reduce formaldehyde concentration as much as possible but has been unable to do so because it always results in lower filtration rates.

2 Read and Create Data Frame

options(digits=2)

filtration <- data.frame(
  y= scan(text="45 71 48 65 68 60 80 65 43  100 45 104 75 86 70 96"),
  A = rep(rep (c(-1,1),each = 1),8),
  B = rep(rep (c(-1,1),each = 2),4),
  C = rep(rep (c(-1,1),each = 4),2),
  D = rep(rep (c(-1,1),each = 8),1)
)
                
filtration

y	A	B	C	D
45	-1	-1	-1	-1
71	1	-1	-1	-1
48	-1	1	-1	-1
65	1	1	-1	-1
68	-1	-1	1	-1
60	1	-1	1	-1
80	-1	1	1	-1
65	1	1	1	-1
43	-1	-1	-1	1
100	1	-1	-1	1
45	-1	1	-1	1
104	1	1	-1	1
75	-1	-1	1	1
86	1	-1	1	1
70	-1	1	1	1
96	1	1	1	1

3 Visualize the Data

Attach the dataset for assssing y, A,B,C,D, directly without typing filtration$

attach (filtration)

par (mfrow = c(2,2))
boxplot (y~A)
boxplot (y~B)
boxplot (y~C)
boxplot (y~D)

par (mfrow = c(3,2))

interaction.plot(A, B, y, ylim=range (y), col = 1:2)

interaction.plot(C, D, y, ylim=range (y), col = 1:2)

interaction.plot(A, C, y, ylim=range (y), col = 1:2)

interaction.plot(B, C, y, ylim=range (y), col = 1:2)

interaction.plot(D, C, y, ylim=range (y), col = 1:2)

interaction.plot(A, D, y, ylim=range (y), col = 1:2)

4 Fit a Linear Model

g <- lm(y~(A+B+C+D)^4,data=filtration)
summary (g)

## 
## Call:
## lm(formula = y ~ (A + B + C + D)^4, data = filtration)
## 
## Residuals:
## ALL 16 residuals are 0: no residual degrees of freedom!
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  70.0625         NA      NA       NA
## A            10.8125         NA      NA       NA
## B             1.5625         NA      NA       NA
## C             4.9375         NA      NA       NA
## D             7.3125         NA      NA       NA
## A:B           0.0625         NA      NA       NA
## A:C          -9.0625         NA      NA       NA
## A:D           8.3125         NA      NA       NA
## B:C           1.1875         NA      NA       NA
## B:D          -0.1875         NA      NA       NA
## C:D          -0.5625         NA      NA       NA
## A:B:C         0.9375         NA      NA       NA
## A:B:D         2.0625         NA      NA       NA
## A:C:D        -0.8125         NA      NA       NA
## B:C:D        -1.3125         NA      NA       NA
## A:B:C:D       0.6875         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:   NaN 
## F-statistic:  NaN on 15 and 0 DF,  p-value: NA

4.1 Computing and Understanding the Coefficients

4.1.1 Design Matrix

X <- model.matrix(g)
data.frame (filtration$y,X)

filtration.y	X.Intercept.	A	B	C	D	A.B	A.C	A.D	B.C	B.D	C.D	A.B.C	A.B.D	A.C.D	B.C.D	A.B.C.D
45	1	-1	-1	-1	-1	1	1	1	1	1	1	-1	-1	-1	-1	1
71	1	1	-1	-1	-1	-1	-1	-1	1	1	1	1	1	1	-1	-1
48	1	-1	1	-1	-1	-1	1	1	-1	-1	1	1	1	-1	1	-1
65	1	1	1	-1	-1	1	-1	-1	-1	-1	1	-1	-1	1	1	1
68	1	-1	-1	1	-1	1	-1	1	-1	1	-1	1	-1	1	1	-1
60	1	1	-1	1	-1	-1	1	-1	-1	1	-1	-1	1	-1	1	1
80	1	-1	1	1	-1	-1	-1	1	1	-1	-1	-1	1	1	-1	1
65	1	1	1	1	-1	1	1	-1	1	-1	-1	1	-1	-1	-1	-1
43	1	-1	-1	-1	1	1	1	-1	1	-1	-1	-1	1	1	1	-1
100	1	1	-1	-1	1	-1	-1	1	1	-1	-1	1	-1	-1	1	1
45	1	-1	1	-1	1	-1	1	-1	-1	1	-1	1	-1	1	-1	1
104	1	1	1	-1	1	1	-1	1	-1	1	-1	-1	1	-1	-1	-1
75	1	-1	-1	1	1	1	-1	-1	-1	-1	1	1	1	-1	-1	1
86	1	1	-1	1	1	-1	1	1	-1	-1	1	-1	-1	1	-1	-1
70	1	-1	1	1	1	-1	-1	-1	1	1	1	-1	-1	-1	1	-1
96	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1

4.1.2 Contrasts

Signed $y$ with $X_j$, the jth column in the design matrix:

data.frame(y * X) # note that * perform vectorized multiplication.

X.Intercept.	A	B	C	D	A.B	A.C	A.D	B.C	B.D	C.D	A.B.C	A.B.D	A.C.D	B.C.D	A.B.C.D
45	-45	-45	-45	-45	45	45	45	45	45	45	-45	-45	-45	-45	45
71	71	-71	-71	-71	-71	-71	-71	71	71	71	71	71	71	-71	-71
48	-48	48	-48	-48	-48	48	48	-48	-48	48	48	48	-48	48	-48
65	65	65	-65	-65	65	-65	-65	-65	-65	65	-65	-65	65	65	65
68	-68	-68	68	-68	68	-68	68	-68	68	-68	68	-68	68	68	-68
60	60	-60	60	-60	-60	60	-60	-60	60	-60	-60	60	-60	60	60
80	-80	80	80	-80	-80	-80	80	80	-80	-80	-80	80	80	-80	80
65	65	65	65	-65	65	65	-65	65	-65	-65	65	-65	-65	-65	-65
43	-43	-43	-43	43	43	43	-43	43	-43	-43	-43	43	43	43	-43
100	100	-100	-100	100	-100	-100	100	100	-100	-100	100	-100	-100	100	100
45	-45	45	-45	45	-45	45	-45	-45	45	-45	45	-45	45	-45	45
104	104	104	-104	104	104	-104	104	-104	104	-104	-104	104	-104	-104	-104
75	-75	-75	75	75	75	-75	-75	-75	-75	75	75	75	-75	-75	75
86	86	-86	86	86	-86	86	86	-86	-86	86	-86	-86	86	-86	-86
70	-70	70	70	70	-70	-70	-70	70	70	70	-70	-70	-70	70	-70
96	96	96	96	96	96	96	96	96	96	96	96	96	96	96	96

Contrasts are the column-wise sum of y*X, that is the sum of $y_i$ at high level ($y_{X_j+}$) minus the sum of $y_i$ at low level ($y_{X_j-}$). For “intercept”, it is the sum of all $y_i$. In matrix language, \[\mbox{Contrast}_j=X_j'y=y_{X_j+}-y_{X_j-},\] where $N$ is the total sample size. Here $N=16$.

contrasts_filtration<- data.frame(colSums(y*X)); contrasts_filtration

	colSums.y…X.
(Intercept)	1121
A	173
B	25
C	79
D	117
A:B	1
A:C	-145
A:D	133
B:C	19
B:D	-3
C:D	-9
A:B:C	15
A:B:D	33
A:C:D	-13
B:C:D	-21
A:B:C:D	11

4.1.3 Regression coefficients $\hat \tau_j$

In matrix language, $\hat\tau = (X'X)^{-1}X'y$. The differences between means (such as A, B, ABC) are $2\times \hat\tau$. The formula for coefficients can be written as: \[\mbox{Contrast}_j=X_j'y=y_{X_j+}-y_{X_j-}\] \[\hat \tau_j ={\mbox{Contrast}_j\over N}=\frac{\bar y_{X_j+}-\bar y_{X_j-}}{2}\]

contrasts_filtration / 16

	colSums.y…X.
(Intercept)	70.06
A	10.81
B	1.56
C	4.94
D	7.31
A:B	0.06
A:C	-9.06
A:D	8.31
B:C	1.19
B:D	-0.19
C:D	-0.56
A:B:C	0.94
A:B:D	2.06
A:C:D	-0.81
B:C:D	-1.31
A:B:C:D	0.69

The linear model is written as:

\[y = 70 + 11 x_A + 1.6 x_B + 4.9 x_C + 7.3 x_D +\\ 0.062 x_Ax_B -9.1 x_Ax_C + 1.2x_Bx_C + 8.3x_Ax_D-0.19 x_Bx_D -0.56 x_Cx_D + \\ 0.94x_Ax_Bx_C+2.1x_Ax_Bx_D -0.81 x_Ax_Cx_D -1.3x_Bx_Cx_D+\\0.69x_Ax_Bx_Cx_D +\\ \epsilon,\] where $x_j=1$ if factor $j$ at high level, and $x_j=-1$ is factor $j$ at low level, for $j = A,B,C,D$.

5 ANOVA

5.1 Output from ANOVA Function

anova.filtration <- anova(g); anova.filtration

## Warning in anova.lm(g): ANOVA F-tests on an essentially perfect fit are
## unreliable

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
A	1	1870.56	1870.56	NaN	NaN
B	1	39.06	39.06	NaN	NaN
C	1	390.06	390.06	NaN	NaN
D	1	855.56	855.56	NaN	NaN
A:B	1	0.06	0.06	NaN	NaN
A:C	1	1314.06	1314.06	NaN	NaN
A:D	1	1105.56	1105.56	NaN	NaN
B:C	1	22.56	22.56	NaN	NaN
B:D	1	0.56	0.56	NaN	NaN
C:D	1	5.06	5.06	NaN	NaN
A:B:C	1	14.06	14.06	NaN	NaN
A:B:D	1	68.06	68.06	NaN	NaN
A:C:D	1	10.56	10.56	NaN	NaN
B:C:D	1	27.56	27.56	NaN	NaN
A:B:C:D	1	7.56	7.56	NaN	NaN
Residuals	0	0.00	NaN	NA	NA

5.2 How to compute SS by hand?

The sum squares can be computed by \[SS_j=N \times \hat \tau_j^2=\frac{\mbox{Contrast}^2_j}{N},\] where $N=2^k\times n$ is the total number of observations. In this example, $N=16$ since $k=4, n=1$. This formular is also a specialization of the general formular for SS: \[SS_\mbox{treatment}=\sum_{i=1}^a n_i (\bar y_i-\bar y)^2\] with \[n_i=N/2, \bar y_{X_j+}-\bar y = \hat\tau_j,\bar y_{X_j-}-\bar y = -\hat\tau_j, a = 2.\] Therefore, \[SS_j=(N/2)\times\hat\tau_j^2 + (N/2)\times\hat \tau_j^2=N\times \hat\tau_j^2.\]

SS <- (g$coefficients)^2*16; 
data.frame(t(SS))

X.Intercept.	A	B	C	D	A.B	A.C	A.D	B.C	B.D	C.D	A.B.C	A.B.D	A.C.D	B.C.D	A.B.C.D
78540	1871	39	390	856	0.06	1314	1106	23	0.56	5.1	14	68	11	28	7.6

$SS_j$ is just the Sum square of $X_j\hat \tau_j$:

pred.each.submodel <- sweep(X, 2,  g$coefficients, FUN = "*")
data.frame(pred.each.submodel)

X.Intercept.	A	B	C	D	A.B	A.C	A.D	B.C	B.D	C.D	A.B.C	A.B.D	A.C.D	B.C.D	A.B.C.D
70	-11	-1.6	-4.9	-7.3	0.06	-9.1	8.3	1.2	-0.19	-0.56	-0.94	-2.1	0.81	1.3	0.69
70	11	-1.6	-4.9	-7.3	-0.06	9.1	-8.3	1.2	-0.19	-0.56	0.94	2.1	-0.81	1.3	-0.69
70	-11	1.6	-4.9	-7.3	-0.06	-9.1	8.3	-1.2	0.19	-0.56	0.94	2.1	0.81	-1.3	-0.69
70	11	1.6	-4.9	-7.3	0.06	9.1	-8.3	-1.2	0.19	-0.56	-0.94	-2.1	-0.81	-1.3	0.69
70	-11	-1.6	4.9	-7.3	0.06	9.1	8.3	-1.2	-0.19	0.56	0.94	-2.1	-0.81	-1.3	-0.69
70	11	-1.6	4.9	-7.3	-0.06	-9.1	-8.3	-1.2	-0.19	0.56	-0.94	2.1	0.81	-1.3	0.69
70	-11	1.6	4.9	-7.3	-0.06	9.1	8.3	1.2	0.19	0.56	-0.94	2.1	-0.81	1.3	0.69
70	11	1.6	4.9	-7.3	0.06	-9.1	-8.3	1.2	0.19	0.56	0.94	-2.1	0.81	1.3	-0.69
70	-11	-1.6	-4.9	7.3	0.06	-9.1	-8.3	1.2	0.19	0.56	-0.94	2.1	-0.81	-1.3	-0.69
70	11	-1.6	-4.9	7.3	-0.06	9.1	8.3	1.2	0.19	0.56	0.94	-2.1	0.81	-1.3	0.69
70	-11	1.6	-4.9	7.3	-0.06	-9.1	-8.3	-1.2	-0.19	0.56	0.94	-2.1	-0.81	1.3	0.69
70	11	1.6	-4.9	7.3	0.06	9.1	8.3	-1.2	-0.19	0.56	-0.94	2.1	0.81	1.3	-0.69
70	-11	-1.6	4.9	7.3	0.06	9.1	-8.3	-1.2	0.19	-0.56	0.94	2.1	0.81	1.3	0.69
70	11	-1.6	4.9	7.3	-0.06	-9.1	8.3	-1.2	0.19	-0.56	-0.94	-2.1	-0.81	1.3	-0.69
70	-11	1.6	4.9	7.3	-0.06	9.1	-8.3	1.2	-0.19	-0.56	-0.94	-2.1	0.81	-1.3	-0.69
70	11	1.6	4.9	7.3	0.06	-9.1	8.3	1.2	-0.19	-0.56	0.94	2.1	-0.81	-1.3	0.69

6 Selecting Significant Effects

6.1 Normal Probability Plot of Effects (Coefficients)

g <- lm(y~(A+B+C+D)^4,data=filtration) # fit linear model
effects <- g$coefficients[-1] # exluding intercept which is of no interest
qq<-qqnorm(effects, type = "n")
qqline(effects)
text(qq$x, qq$y, labels = names(effects))

6.2 Lenth Method to Determine “Significant” Effects

Understand Psuedo Standard deviation

1.5*median (abs(rnorm (1000, sd = 2)))

## [1] 2

1.5*median (abs(rnorm (1000, sd = 10)))

## [1] 10

1.5*median (abs(rnorm (1000, sd = 100)))

## [1] 102

Lenth Method

QQ plot of effects with lenth significance

qqnorm.lenth(effects)

## Significant Factors Selected by Lenth's Method:
##  A D A:C A:D

## [1] "A"   "D"   "A:C" "A:D"

Line plots with lenth significance

library(BsMD)
par(mfrow=c(1,1))
LenthPlot(effects, las=3)

## alpha   PSE    ME   SME 
##  0.05  1.31  3.37  6.85

7 Analysis with selected interactions

# Removing factor CD and ACD
h2 <- lm(y~A+C+D+A*C+A*D, data=filtration)
anova(h2)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
A	1	1871	1871	96	0
C	1	390	390	20	0
D	1	856	856	44	0
A:C	1	1314	1314	67	0
A:D	1	1106	1106	57	0
Residuals	10	195	20	NA	NA

Important note: the Sum Squares for the residuals are equal to the sum of sum squares of those “removed” terms.

summary (h2)

## 
## Call:
## lm(formula = y ~ A + C + D + A * C + A * D, data = filtration)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.375 -1.500  0.063  2.906  5.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    70.06       1.10   63.44  2.3e-14 ***
## A              10.81       1.10    9.79  1.9e-06 ***
## C               4.94       1.10    4.47   0.0012 ** 
## D               7.31       1.10    6.62  5.9e-05 ***
## A:C            -9.06       1.10   -8.21  9.4e-06 ***
## A:D             8.31       1.10    7.53  2.0e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.4 on 10 degrees of freedom
## Multiple R-squared:  0.966,  Adjusted R-squared:  0.949 
## F-statistic: 56.7 on 5 and 10 DF,  p-value: 5.14e-07

The final linear model for relating y to x based on the above “lm” output is written as follows: \[y=70.06 + 10.81 x_A + 4.94 x_C + 7.31 x_D - 9.06 x_Ax_C + 8.31 x_Ax_D + \epsilon \]

8 What can we learn from the linear model (How do the factors interact)?

When $x_A=1$, the coefficient for $x_C$ is 4.94-9.1 = -4.2, implying that filtration rate decreases about 8% when C increases from low level to high level; when $x_A=-1$, the coefficient for $x_C$ is 4.94+9.1 = 14.01. We see that high C factor (formaldehyde) on average increases filtration rate, but can decrease filtration rate when temperature (A) high (+1). Look at the interaction plot:

interaction.plot(C,A, y)

When $x_A=1$, the coefficient for $x_D$ is 7.31+8.31 = 16; when $x_A=-1$, the coefficient for $x_D$ is 7.31-8.31=-1. High D (stiring rates) increases filtration rate on average, but increases filtration rate when A (temperatre) is high (+1), and decreases slightly filtration rate a little bit.

interaction.plot (D,A,y)

Prediction of $y_i$ in each combination of $A,C,D$:

newdata <- data.frame (
A = rep(c(-1,1), each =4),
C = rep(rep(c(-1,1), each =2),2),
D = rep(c(-1,1),4)
)

predicted.filtration <- predict(h2, newdata=newdata)

data.frame(newdata,predicted.filtration)

A	C	D	predicted.filtration
-1	-1	-1	46
-1	-1	1	44
-1	1	-1	74
-1	1	1	72
1	-1	-1	69
1	-1	1	101
1	1	-1	61
1	1	1	92

The combination high temperature (A+), low formaldehye (C-) and high stiring (D+) can provide good filtration rate.

detach ("filtration")

Unit 5: Full \(2^k\) Factorial Design

Longhai Li

March 2020