Preparing Data

In this page, you can create a virtual data using a random number. Random number in accordance with the normal distribution can be generated in rnorm function. rnorm function generates the random numbers of the average of the second argument. Also, you should specify a variance by giving the third argument.

x1 <- rnorm(100, 10, 1)
x2 <- rnorm(100, 10, 2)
x3 <- rnorm(100, 10, 3)
y <-  x1 + 2 * x2 + 3 * x3 + rnorm(100, 0, 1)

Multiple Regression Analysis by lm Function

We used lm function for single explanatory regression analysis, however, we can use the same function for multiple regression analysis.

result <- lm(y ~ x1 + x2 + x3)

Check the result.

summary(result)

The summary function shows us the summary of regression analysis. It prints like following.


Call:
lm(formula = y ~ x1 + x2 + x3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.96202 -0.62604 -0.01741  0.73180  2.64501 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.80572    1.21700   0.662     0.51    
x1           0.85806    0.09992   8.588  1.6e-13 ***
x2           2.03232    0.04711  43.140  < 2e-16 ***
x3           3.04166    0.03769  80.698  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9662 on 96 degrees of freedom
Multiple R-squared:  0.9887,	Adjusted R-squared:  0.9883 
F-statistic:  2798 on 3 and 96 DF,  p-value: < 2.2e-16

The summary shows that the regression equation of y = 0.80572 * x1 + 2.03232 * x2 + 3.04166 * x3 + 0.80572. For using the random number, it is not guaranteed for you to get the same result. Significant probability is very small for each coefficient and it means they have the 0.1% level of significance. Intercept does not have small significance, but this is usually ignored because it means the only hardness to deny that this is not 0.

This time you should see the Adjusted R-squared for the explanatory variable is plural. When the explanatory variable increases in such multiple regression analysis, even meaningless variable can raise the Multiple R-squared value. It is important to use the Adjusted R-squared in order to compensate for this drawback.

Significance of Coefficient

If a meaningless explanatory variable is included, you can notice it at the output of the summary function. First, generate a random number of as x4 that is independent to y. Then do multiple regression analysis again with the explanatory variables x1~x4.

x4 <- rnorm(100, 10, 4)
result2 <- lm(y ~ x1 + x2 + x3 + x4)
summary(result2)

summary function shows us the summary of regression analysis. It prints like following.


Call:
lm(formula = y ~ x1 + x2 + x3 + x4)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.95356 -0.61666 -0.03673  0.74286  2.65140 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.780315   1.227547   0.636    0.527    
x1          0.856324   0.100668   8.506 2.55e-13 ***
x2          2.031220   0.047563  42.706  < 2e-16 ***
x3          3.041296   0.037908  80.228  < 2e-16 ***
x4          0.005898   0.024471   0.241    0.810    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.971 on 95 degrees of freedom
Multiple R-squared:  0.9887,	Adjusted R-squared:  0.9882 
F-statistic:  2078 on 4 and 95 DF,  p-value: < 2.2e-16

Coefficient of x4 shown to be not significant. In other words, it can not be denied that the coefficient of x4 is not zero.

Selection of Explanatory Variables

If there may be irrelevant in the explanatory variable, stepwise method, the way of selecting the variables that are deemed appropriate one by one. It can be done by R automatically.

result3 <- lm(y~1)
step(result3, direction="both", scope=list(upper=~x1+x2+x3+x4))

It prints like following.


Start:  AIC=417.69
y ~ 1

       Df Sum of Sq     RSS    AIC
+ x3    1    9548.9  1946.6 300.87
+ x2    1    1748.3  9747.2 461.96
              11495.5 476.45
+ x4    1       2.8 11492.7 478.43
+ x1    1       0.8 11494.7 478.45

Step:  AIC=300.87
y ~ x3

       Df Sum of Sq     RSS    AIC
+ x2    1    1755.6   191.0  70.73
+ x1    1     113.9  1832.7 296.84
               1946.6 300.87
+ x4    1      10.2  1936.4 302.34
- x3    1    9548.9 11495.5 476.45

Step:  AIC=70.73
y ~ x3 + x2

       Df Sum of Sq    RSS    AIC
+ x1    1     125.6   65.4 -34.43
+ x4    1       3.9  187.1  70.66
               191.0  70.73
- x2    1    1755.6 1946.6 300.87
- x3    1    9556.2 9747.2 461.96

Step:  AIC=-34.43
y ~ x3 + x2 + x1

       Df Sum of Sq    RSS    AIC
                65.4 -34.43
+ x4    1       0.5   65.0 -33.15
- x1    1     125.6  191.0  70.73
- x2    1    1767.2 1832.7 296.84
- x3    1    9679.7 9745.1 463.94

Call:
lm(formula = y ~ x3 + x2 + x1)

Coefficients:
(Intercept)           x3           x2           x1  
     0.8057        3.0416        2.0323       0.8580

You will see that finally x4 are cut.