We begin by simulating some data. Our response variable `y`

(perhaps FA?) is related to `age`

and `group`

. It is also related to test `score`

, but this relationship is modified by `group`

(an interaction). Further, as might be expected, `score`

is also somewhat correlated with `age`

. There is also a sampling error term.

n = 1000 age = runif(n) score = runif(n) + 0.5*age group = factor(rep(1:3, length.out=n)) e = rnorm(n, 0, 0.1) y = as.numeric(group) + 0.6*score + as.numeric(group)*score + 0.75*age + e

Now we fit the full multiple regression and examine the results.

fit = lm(y ~ group*score + age) summary(fit)

Output:

Call: lm(formula = y ~ group * score + age) Residuals: Min 1Q Median 3Q Max -0.349251 -0.065301 0.001216 0.064021 0.301822 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.98404 0.01311 75.03 <2e-16 *** group2 1.02755 0.01866 55.06 <2e-16 *** group3 2.02072 0.01879 107.53 <2e-16 *** score 1.61725 0.01680 96.29 <2e-16 *** age 0.76173 0.01204 63.24 <2e-16 *** group2:score 0.96657 0.02299 42.05 <2e-16 *** group3:score 1.97391 0.02344 84.19 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0967 on 993 degrees of freedom Multiple R-squared: 0.9968, Adjusted R-squared: 0.9968 F-statistic: 5.187e+04 on 6 and 993 DF, p-value: < 2.2e-16

We can see that the parameter estimates (i.e. the partial regression coefficients) are approximately equal to the weights that we gave these variables when we simulated the data.

Next, we can plot the simple correlations between the variables to explore the data.

lm_y.a = lm(y ~ age) dev.new() plot(y ~ age) abline(lm_y.a$coef[1], lm_y.a$coef[2]) lm_y.g = lm(y ~ group) dev.new() plot(y ~ group) lm_y.s = lm(y ~ score) dev.new() plot(y ~ score) abline(lm_y.s$coef[1], lm_y.s$coef[2]) lm_score.a = lm(score ~ age) dev.new() plot(score ~ age) abline(lm_score.a$coef[1], lm_score.a$coef[2])

We can also use `ggplot2`

to display some of these correlations jointly. Here we display `response`

vs. `age`

, with plots faceted by `group`

. The points are colored by `score`

, which we can see is correlated with both `response`

and `age`

.

df = data.frame(age, score, group, y) dev.new() qplot(age, y, color=score, data=df) + facet_wrap(~group, nrow=1)

Finally, we can generate a partial regression plot of `response`

on `score`

. To do so, we

- Residualize
`response`

for all the independent variables except`score`

. - Residualize
`score`

for the other independent variables. - Plot the residuals against each other.

lm_y.g.a = lm(y ~ group + age) lm_score.g.a = lm(score ~ group + age) fit2 = lm(lm_y.g.a$res ~ lm_score.g.a$res:group + 0) summary(fit2)

Output:

Call: lm(formula = lm_y.g.a$res ~ lm_score.g.a$res:group + 0) Residuals: Min 1Q Median 3Q Max -0.438602 -0.105403 -0.001883 0.106990 0.470098 Coefficients: Estimate Std. Error t value Pr(>|t|) lm_score.g.a$res:group1 1.60419 0.02955 54.28 <2e-16 *** lm_score.g.a$res:group2 2.58006 0.02915 88.52 <2e-16 *** lm_score.g.a$res:group3 3.57188 0.03103 115.09 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.156 on 997 degrees of freedom Multiple R-squared: 0.9602, Adjusted R-squared: 0.96 F-statistic: 8010 on 3 and 997 DF, p-value: < 2.2e-16

The magical thing to note here is that these simple correlations between `score`

and `group`

on the residualized data are equal to the partial regression coefficients for `score`

and `group`

on the full data set. This is the heart of multiple regression!

Now we can make the plot:

df2 = data.frame(lm_y.g.a$res, lm_score.g.a$res, group) dev.new() qplot(lm_score.g.a.res, lm_y.g.a.res, color=group, data=df2) + stat_smooth(method='lm')

The slopes of the three trendlines are given by the parameter estimates, and since the data are residualized, the intercepts are all zero.

The R code for this example is available here.

<bibtex>

@book{wickham_ggplot2_2009,

address = {New York}, title = {ggplot2 - Elegant Graphics for Data Analysis}, publisher = {Springer}, author = {Hadley Wickham}, year = {2009}

} @book{venables_modern_2002,

address = {New York}, edition = {4th}, title = {Modern Applied Statistics with S}, publisher = {Springer}, author = {W N Venables and B D Ripley}, year = {2002}

} </bibtex>

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International