Multiple regression example with R

We begin by simulating some data. Our response variable y (perhaps FA?) is related to age and group. It is also related to test score, but this relationship is modified by group (an interaction). Further, as might be expected, score is also somewhat correlated with age. There is also a sampling error term.

n     = 1000
age   = runif(n)
score = runif(n) + 0.5*age
group = factor(rep(1:3, length.out=n))
e     = rnorm(n, 0, 0.1)
 
y     = as.numeric(group) + 0.6*score + as.numeric(group)*score + 0.75*age + e

Now we fit the full multiple regression and examine the results.

fit = lm(y ~ group*score + age)
summary(fit)

Output:

Call:
lm(formula = y ~ group * score + age)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.349251 -0.065301  0.001216  0.064021  0.301822 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.98404    0.01311   75.03   <2e-16 ***
group2        1.02755    0.01866   55.06   <2e-16 ***
group3        2.02072    0.01879  107.53   <2e-16 ***
score         1.61725    0.01680   96.29   <2e-16 ***
age           0.76173    0.01204   63.24   <2e-16 ***
group2:score  0.96657    0.02299   42.05   <2e-16 ***
group3:score  1.97391    0.02344   84.19   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.0967 on 993 degrees of freedom
Multiple R-squared: 0.9968,	Adjusted R-squared: 0.9968 
F-statistic: 5.187e+04 on 6 and 993 DF,  p-value: < 2.2e-16

We can see that the parameter estimates (i.e. the partial regression coefficients) are approximately equal to the weights that we gave these variables when we simulated the data.

Next, we can plot the simple correlations between the variables to explore the data.

lm_y.a = lm(y ~ age)
dev.new()
plot(y ~ age)
abline(lm_y.a$coef[1], lm_y.a$coef[2])
 
lm_y.g = lm(y ~ group)
dev.new()
plot(y ~ group)
 
lm_y.s = lm(y ~ score)
dev.new()
plot(y ~ score)
abline(lm_y.s$coef[1], lm_y.s$coef[2])
 
lm_score.a = lm(score ~ age)
dev.new()
plot(score ~ age)
abline(lm_score.a$coef[1], lm_score.a$coef[2])
Response vs. age
Response vs. group
Response vs. score
Score vs. age

We can also use ggplot2 to display some of these correlations jointly. Here we display response vs. age, with plots faceted by group. The points are colored by score, which we can see is correlated with both response and age.

df = data.frame(age, score, group, y)
dev.new()
qplot(age, y, color=score, data=df) + facet_wrap(~group, nrow=1)
ggplot of raw data

Finally, we can generate a partial regression plot of response on score. To do so, we

  1. Residualize response for all the independent variables except score.
  2. Residualize score for the other independent variables.
  3. Plot the residuals against each other.
lm_y.g.a = lm(y ~ group + age)
lm_score.g.a = lm(score ~ group + age)
fit2 = lm(lm_y.g.a$res ~ lm_score.g.a$res:group + 0)
summary(fit2)

Output:

Call:
lm(formula = lm_y.g.a$res ~ lm_score.g.a$res:group + 0)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.438602 -0.105403 -0.001883  0.106990  0.470098 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
lm_score.g.a$res:group1  1.60419    0.02955   54.28   <2e-16 ***
lm_score.g.a$res:group2  2.58006    0.02915   88.52   <2e-16 ***
lm_score.g.a$res:group3  3.57188    0.03103  115.09   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.156 on 997 degrees of freedom
Multiple R-squared: 0.9602,	Adjusted R-squared:  0.96 
F-statistic:  8010 on 3 and 997 DF,  p-value: < 2.2e-16

The magical thing to note here is that these simple correlations between score and group on the residualized data are equal to the partial regression coefficients for score and group on the full data set. This is the heart of multiple regression!

Now we can make the plot:

df2 = data.frame(lm_y.g.a$res, lm_score.g.a$res, group)
dev.new()
qplot(lm_score.g.a.res, lm_y.g.a.res, color=group, data=df2) + stat_smooth(method='lm')
Partial regression plot

The slopes of the three trendlines are given by the parameter estimates, and since the data are residualized, the intercepts are all zero.

The R code for this example is available here.

References

<bibtex>

@book{wickham_ggplot2_2009,

address = {New York},
title = {ggplot2 - Elegant Graphics for Data Analysis},
publisher = {Springer},
author = {Hadley Wickham},
year = {2009}

} @book{venables_modern_2002,

address = {New York},
edition = {4th},
title = {Modern Applied Statistics with S},
publisher = {Springer},
author = {W N Venables and B D Ripley},
year = {2002}

} </bibtex>

statistics/multiple-regression.txt · Last modified: 2010/12/26 5:10 pm PST by John Colby
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki