# Multiple regression example with R

We begin by simulating some data. Our response variable `y` (perhaps FA?) is related to `age` and `group`. It is also related to test `score`, but this relationship is modified by `group` (an interaction). Further, as might be expected, `score` is also somewhat correlated with `age`. There is also a sampling error term.

```n     = 1000
age   = runif(n)
score = runif(n) + 0.5*age
group = factor(rep(1:3, length.out=n))
e     = rnorm(n, 0, 0.1)

y     = as.numeric(group) + 0.6*score + as.numeric(group)*score + 0.75*age + e```

Now we fit the full multiple regression and examine the results.

```fit = lm(y ~ group*score + age)
summary(fit)```

Output:

```Call:
lm(formula = y ~ group * score + age)

Residuals:
Min        1Q    Median        3Q       Max
-0.349251 -0.065301  0.001216  0.064021  0.301822

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.98404    0.01311   75.03   <2e-16 ***
group2        1.02755    0.01866   55.06   <2e-16 ***
group3        2.02072    0.01879  107.53   <2e-16 ***
score         1.61725    0.01680   96.29   <2e-16 ***
age           0.76173    0.01204   63.24   <2e-16 ***
group2:score  0.96657    0.02299   42.05   <2e-16 ***
group3:score  1.97391    0.02344   84.19   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0967 on 993 degrees of freedom
Multiple R-squared: 0.9968,	Adjusted R-squared: 0.9968
F-statistic: 5.187e+04 on 6 and 993 DF,  p-value: < 2.2e-16```

We can see that the parameter estimates (i.e. the partial regression coefficients) are approximately equal to the weights that we gave these variables when we simulated the data.

Next, we can plot the simple correlations between the variables to explore the data.

```lm_y.a = lm(y ~ age)
dev.new()
plot(y ~ age)
abline(lm_y.a\$coef, lm_y.a\$coef)

lm_y.g = lm(y ~ group)
dev.new()
plot(y ~ group)

lm_y.s = lm(y ~ score)
dev.new()
plot(y ~ score)
abline(lm_y.s\$coef, lm_y.s\$coef)

lm_score.a = lm(score ~ age)
dev.new()
plot(score ~ age)
abline(lm_score.a\$coef, lm_score.a\$coef)```

We can also use `ggplot2` to display some of these correlations jointly. Here we display `response` vs. `age`, with plots faceted by `group`. The points are colored by `score`, which we can see is correlated with both `response` and `age`.

```df = data.frame(age, score, group, y)
dev.new()
qplot(age, y, color=score, data=df) + facet_wrap(~group, nrow=1)```

Finally, we can generate a partial regression plot of `response` on `score`. To do so, we

1. Residualize `response` for all the independent variables except `score`.
2. Residualize `score` for the other independent variables.
3. Plot the residuals against each other.
```lm_y.g.a = lm(y ~ group + age)
lm_score.g.a = lm(score ~ group + age)
fit2 = lm(lm_y.g.a\$res ~ lm_score.g.a\$res:group + 0)
summary(fit2)```

Output:

```Call:
lm(formula = lm_y.g.a\$res ~ lm_score.g.a\$res:group + 0)

Residuals:
Min        1Q    Median        3Q       Max
-0.438602 -0.105403 -0.001883  0.106990  0.470098

Coefficients:
Estimate Std. Error t value Pr(>|t|)
lm_score.g.a\$res:group1  1.60419    0.02955   54.28   <2e-16 ***
lm_score.g.a\$res:group2  2.58006    0.02915   88.52   <2e-16 ***
lm_score.g.a\$res:group3  3.57188    0.03103  115.09   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.156 on 997 degrees of freedom
Multiple R-squared: 0.9602,	Adjusted R-squared:  0.96
F-statistic:  8010 on 3 and 997 DF,  p-value: < 2.2e-16```

The magical thing to note here is that these simple correlations between `score` and `group` on the residualized data are equal to the partial regression coefficients for `score` and `group` on the full data set. This is the heart of multiple regression!

Now we can make the plot:

```df2 = data.frame(lm_y.g.a\$res, lm_score.g.a\$res, group)
dev.new()
qplot(lm_score.g.a.res, lm_y.g.a.res, color=group, data=df2) + stat_smooth(method='lm')```

The slopes of the three trendlines are given by the parameter estimates, and since the data are residualized, the intercepts are all zero.

The R code for this example is available here.

<bibtex>

@book{wickham_ggplot2_2009,

```address = {New York},
title = {ggplot2 - Elegant Graphics for Data Analysis},
publisher = {Springer},
year = {2009}```

} @book{venables_modern_2002,

```address = {New York},
edition = {4th},
title = {Modern Applied Statistics with S},
publisher = {Springer},
author = {W N Venables and B D Ripley},
year = {2002}```

} </bibtex> 