Linear Regression and ANOVA in R
Linear Regression in R
In linear regression, we are attempting to estimate constants m1,…,mr and c so that the formula y=m1x1+⋯+mrxr+c “fits” all the given data points Pi=(xi,1,…,xi,r,yi) up to experimental error. The variables xi are referred to as the “exogenous” variables and the variable y is called the “endogenous” variable; loosely speaking, the values of xi,j are the ones we use to control the experiment, while the y is the measurement we want to predict.
We write the error terms ei as ei=yi−(m1xi,1+⋯+mpxi,p+c) In these identities it is worth noting that only the yi and the xi,j are known. We need to determine the “optimal” m1,…,mp and c so that the following assumptions are upheld.
The assumptions behind normal linear regression are:
Exogeneity: This is the assumption that all the errors are concentrated in the measurement of yi. In other words, the values xi,j are very precise and the experimental error is entirely contained in ei.
Independence: The experiments are independent and so ei are independent random variables. In many cases, the weaker assumption that ei and ej have covariance 0 is enough.
Identical: We assume that ei are identically distributed. In fact, often this assumption is weakened to the assumption that all of them have the same variance σ2. This weaker assumption is called homoskedasticity.
Normality: We assume that ei are normally distributed around 0 so that all of them follow the distribution N(0,σ). This distribution is replaced by others in the General Linear Model.
Lack of Perfect Multi-collinearity: This is the assumption that we have enough measurements so that the column vectors xj=(x1,j,…,xN,j) and the vector u=(1,…,1) form a vector space of dimension at least p+1. Since we may as well assume that no variable xj is a linear combination of the other x’s, this assumption is usually quite easy to ensure with a large number N of data points.
The matrix defined by Q=(x1⋅x1…x1⋅xpx1⋅u⋮⋱⋮⋮xp⋅x1…xp⋅xpxp⋅uu⋅x1…u⋅xpu⋅u) is the matrix of covariances of the xi. It can be used to calculate the estimators ˆm1,…,ˆmp,ˆc. The vector ˆy=(ˆy1,…,ˆyN) defined by ˆy=ˆm1x1+⋯+ˆmpxp+ˆcu is called the vector of fitted values. The difference y−ˆy is called the vector of residuals; we denote it as r in order to distinguish it from e which is unknown!
An Example
We will use linear regression techniques to explore the the performance of BS-MS students at IISER Mohali who have completed two years.
First, we read in the table of data.
atable <- read.csv("atable.csv")
atable[1:3,]
major <fctr> | Bcr <int> | Bsc <int> | Ccr <int> | Csc <int> | Mcr <int> | Msc <int> | Pcr <int> | Psc <int> | ||
---|---|---|---|---|---|---|---|---|---|---|
1 | B | 16 | 100 | 16 | 102 | 12 | 90 | 16 | 112 | |
2 | B | 16 | 100 | 16 | 104 | 12 | 48 | 16 | 80 | |
3 | B | 16 | 100 | 16 | 106 | 12 | 54 | 16 | 108 |
As you can see the table contains (anonymized) scores of each student in Biology, Chemistry, Maths and Physics as well as the completed credits in each. The column majsc
contains the score in the chosen major and the column nmajsc
contains the sum of the remaining scores.
Next, we calculate the performance indices from the scores
atable$Bpi <- atable$Bsc/atable$Bcr
atable$Cpi <- atable$Csc/atable$Ccr
atable$Mpi <- atable$Msc/atable$Mcr
atable$Ppi <- atable$Psc/atable$Pcr
atable$majpi <- atable$majsc/atable$majcr
atable$nmajpi <- atable$nmajsc/atable$nmajcr
Note that majpi
contains the performance of the student in the major and that nmajpi
contains the performance of the student in the remaining subjects.
Finally, we create a new table (called a data.frame
in R
) which contains only the relevant data.
perfdat <- atable[,c("major","Bpi","Cpi","Mpi","Ppi","majpi","nmajpi")]
perfdat[1:3,]
major <fctr> | Bpi <dbl> | Cpi <dbl> | Mpi <dbl> | Ppi <dbl> | majpi <dbl> | nmajpi <dbl> | |
---|---|---|---|---|---|---|---|
1 | B | 6.25 | 6.375 | 7.5 | 7.00 | 6.25 | 6.909091 |
2 | B | 6.25 | 6.500 | 4.0 | 5.00 | 6.25 | 5.272727 |
3 | B | 6.25 | 6.625 | 4.5 | 6.75 | 6.25 | 6.090909 |
Is Cpi a linear combination?
Each subject is tested independently. However, we also know that students who perform well in one subject often perform well in another. So let us look for a formula of the form Cpi=c+m1Bpi+m2Mpi+m3Ppi In other words, we are trying to write the performance in Chemistry as a linear combination of the performance in other subjects!
In R
, this done as follows:
lm_Cpi <- lm(Cpi ~ Bpi + Mpi + Ppi,data=perfdat)
We now look at a summary of the result of this linear regression model.
summary(lm_Cpi)
Call:
lm(formula = Cpi ~ Bpi + Mpi + Ppi, data = perfdat)
Residuals:
Min 1Q Median 3Q Max
-2.67700 -0.49614 0.02346 0.47709 1.93330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.46995 0.16639 8.834 <2e-16 ***
Bpi 0.32134 0.02353 13.656 <2e-16 ***
Mpi 0.22455 0.02660 8.442 <2e-16 ***
Ppi 0.29066 0.03228 9.005 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7003 on 706 degrees of freedom
Multiple R-squared: 0.6992, Adjusted R-squared: 0.6979
F-statistic: 546.9 on 3 and 706 DF, p-value: < 2.2e-16
Estimates
Whew! That is a lot of information! Let us try to break it down.
First of all, we can calculate the matrix Q as above.
Xmat <- as.matrix(cbind(perfdat[,c("Bpi","Mpi","Ppi")],1))
Qmat <- t(Xmat) %*% Xmat
Qmat
Bpi Mpi Ppi 1
Bpi 42101.56 35036.75 37310.188 5386.250
Mpi 35036.75 30798.75 32117.188 4506.500
Ppi 37310.19 32117.19 34305.078 4828.375
1 5386.25 4506.50 4828.375 710.000
We use this to calculate the estimators by solving the above linear equation.
t(solve(Qmat,t(Xmat) %*% perfdat$Cpi))
Bpi Mpi Ppi 1
[1,] 0.3213371 0.2245547 0.2906597 1.46995
We see that these are the same as the coefficients of the linear form as estimated by R.
coefficients(lm_Cpi)
(Intercept) Bpi Mpi Ppi
1.4699502 0.3213371 0.2245547 0.2906597
This is the same as the column called “Estimate” in the summary. These are the estimated values of c and m1, m2 and m3. In other words, we have got the linear regression formula Cpi=1.46995+0.32134⋅Bpi+0.22455⋅Mpi+0.29066⋅Ppi As statisticians (and experimentalists!) we cannot be satisfied with just getting the answer. We must ask ourselves, “How good is this formula?” which is another way of asking “How well does it fit the data?”
Residuals
The primary source of all measurements of deviation from the model is the vector r of residuals. It can be directly obtained in R as follows:
resid <- residuals(lm_Cpi)
As per our assumption, these values should be normally distributed around 0 with some common variance σ2.
summary(resid)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.67700 -0.49610 0.02346 0.00000 0.47710 1.93300
This is the first part of the summary of lm_Cpi
that we saw above. (We will see below that the mean is 0 by choice of r and so this is not displayed in the summary.)
The histogram we obtain is
hist(resid)
which seems to indicate that the distribution is as we might expect.
Residual Standard Error
If we define s2=r⋅rN−p−1 Then (one can show that) this is an unbiased estimator for σ2. Recall that σ2 is the variance of each “experimental error” ei. Thus s is called the “standard error of the equation” (the denominator is due to the fact that p+1 dimensions are “used up by the estimation process”).
degfree <- length(perfdat$Cpi)-3-1
mys <- sqrt(resid%*%resid/degfree[1])
c(mys,degfree)
[1] 0.7002666 706.0000000
We see that this is the value called Residual standard error in the previous summary. The second number 706 is the number of degrees of freedom N−p−1.
Now σ2 is determined by the experimental parameters. It is a measure of experimental error which is determined by the experimental apparatus to measure y. So s2 alone is not a measure of how well the regression model fits the data.
R-squared
We next examine whether the variation within the yi’s can be expressed as a combination of the variation among the xi,j’s using the linear expression.
Our solution of the linear regression was based on the idea the r is orthogonal to the vectors xi and to the vector u. In particular, we have 0=r⋅u=r1+⋯+rN It follows that the mean of the yi’s is the same as the mean of the ˆyi’s. Let μy denote this mean.
As a consequence of orthogonality of y−ˆy and ˆy−μyu, one can prove the identity ∑i(yi−μy)2=∑i(yi−ˆyi)2+∑i(ˆyi−μy)2 This is written conceptually as SStot=SSres+SSexp where:
- SStot is the total Sum of Square differences
- SSres is the Sum of Squares of the residuals (or errors) (Note that this is the same as r⋅r.)
- SSexp is the explained Sum of square differences.
Now, if the model is good, most of the variance in the yi’s should come from the second term. In other words, we should expect R2=∑i(^yi−μy)2∑i(yi−μy)2=1−SSresSStot to be close to 1 (it lies between 0 and 1). The number R2 is called the coefficient of Determination.
myN <- length(perfdat$Cpi)
sstot <- myN*var(perfdat$Cpi)
ssred <- myN*var(resid)
ssexp <- sstot - ssred
rsquared <- ssexp/sstot
rsquared
[1] 0.6991628
The value of R2 is what is displayed in the summary with the label Multiple R-squared. The expression 1−(1−R2)N−1N−p−1 takes the number of independent parameters into account and so is called the Adjusted R-squared value.
arsquared <- 1 - (ssred*(length(perfdat$Cpi)))/(sstot*degfree)
arsquared
[1] 0.6974583
As the number of experiments increases, one can show (by an application of the law of large numbers) that R2 increases to 1. Hence, looking at how close R2 is to 1 is not adequate to measure how good our linear expression is. It is merely an indicator that we have carried out an enough of experiments to determine a fit, if a linear formula holds.
How good is the fit?
Under the null hypothesis that the explanatory variables are not relevant to the measurement of y, the ratio F=SSexp/pSSres/(N−p−1) obeys the F-distribution (Fisher-Snecdor distribution) with (p,N−p−1) degrees of freedom. Thus, having a large value of F (much larger than 1) gives an indication that the linear expression of y in terms of mi’s and c has some significance.
fstat <- (ssexp*degfree)/(ssred*3)
fstat
[1] 546.9281
We thus see this value displayed as the F-statistic in the summary of lm_Cpi
. Along with it, we see a p-value. A low p-value indicates a strong rejection of the null hypothesis.
How good are the Estimators?
Similarly, we can also think of the estimators ˆmi and ˆc as random variables. Under the standard assumptions, these are (for large numbers N of data points) normally distributed with variance given by σ2⋅(Q−1)i,i. Using s2∗(N−p−1)/(N−1) as an estimator for σ2 allows us to calculate this variance. The square root of this is called the standard error for estimator ˆmi.
reserr <- mys*sqrt(diag(solve(Qmat)))*sqrt(degfree/(myN-1))
reserr
[1] 0.02348098 0.02654388 0.03221075 0.16604256
In the second column we see the standard error for each of these entries which is calculated as above. (Note that with our notation the standard error for ˆc comes last.)
The column called t-value that contains the ratio of the first two columns. This is the value to be used with the Student-t distribution (if errors ei are normally distributed) to get the last column. The last column represents the probability of the null hypothesis that this particular term in our formula is not significant. In our case, we can see that all these values are low.
All the above can be used to indicate that there is indeed a non-trivial linear relationship between the Chemistry performance and the performance in the remaining subjects. However, it would be too much to conclude that we need not have examinations in Chemistry at all and just define the score for it as the above linear combination of the scores in the other subjects!
ANOVA
The above kind of analysis falls into the broad category of “analysis of variance”. The acronym ANOVA stands for ANalysis Of VAriance, however, it has the specific meaning of applying these tests in the context where the linear regression is actually associated with a partitioning of a population sample according to the values of a certain “factor”. (There is also MANOVA or Multi-factor ANOVA which we will not discuss here.)
More precisely, suppose that all the yi’s are measurements of values of a measurement from a certain population. The variables xi,j are actually indicators whether the i-th individual has a certain characteristic fj; hence it is 1 if the i-th individual belongs to the sub-population j and 0 otherwise. Moreover, we assume that each individual lies in only one sub-population. The question that we are posing is whether this stratification (factor-isation!) of the population into sub-population is significant in the context of the measurement y.
The linear regression model is then of the form y=m1x1+m2x2+⋯+mpxp+c where there are p distinct sub-populations with indicators xi. This equation is short-hand for the assertion that y is distributed with mean c+mj for individuals in the sub-population j. The ANOVA analysis will then assign a significance to these distinctions between the values of y for these sub-populations.
Since the variables xj are indicator variables, one can calculate the terms as follows. We take c=μ to be the mean of all the yi’s. We take mj=μj−μ where μj is the mean of yi’s for those i which belong to the j-th population. To understand the total variance, we divide it as follows:
∑jnj(μj−μ)2 is one part of the “explained” sum of squares where nj denotes the size of the j-th population.
∑j(nj−1)s2j is the error part of the sum of squares where s2j denotes the sample variance of y’s within the j-th population; in other words (nj−1)s2j=∑xi,j=1(yi−μj)2
The total sum of squares ∑i(yi−μ)2 is the sum of these two quantities since ∑xi,j=1(yi−μj)=0 This allows one to use the F-distribution to check for significance of the deviation for individual populations.
Does performance differ as per the major?
We can ask whether the performance in the major is significantly different for different choices of majors. We can use the choice of major as a factor to divide the population and see if the performance in a particular subject (say Mathematics) varies significantly across different choices of the major.
aov_Mpi <- aov(Mpi ~ major, data=perfdat)
summary(aov_Mpi)
Df Sum Sq Mean Sq F value Pr(>F)
major 3 10.4 3.453 1.116 0.342
Residuals 706 2184.8 3.095
There does not seem to be a very large difference as the F-value is just a bit bigger than 1. We can also check whether this is subject specific by using Tukey’s Honest Significant Difference method for further exploration.
tuk_Mpi <- TukeyHSD(aov_Mpi)
tuk_Mpi
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Mpi ~ major, data = perfdat)
$major
diff lwr upr p adj
C-B -0.07364897 -0.5216495 0.3743515 0.9744878
M-B -0.16156017 -0.6808957 0.3577754 0.8539038
P-B -0.31449864 -0.7772269 0.1482296 0.2985532
M-C -0.08791120 -0.6077190 0.4318966 0.9723304
P-C -0.24084967 -0.7041079 0.2224085 0.5384358
P-M -0.15293848 -0.6854919 0.3796149 0.8810952
We see that none of the pairwise differences are significant. Hence, it does not seem to be the case that performance in mathematics differs across different choices of majors.
Post-hoc Analysis
Methods like ANOVA and Tukey’s HSD are often used as exploratory tools for post-doc analysis. These tools can be used to “discover” patterns in the data that had not been suspected earlier.
It is important to note that since the experiment was not designed to test the alternate hypothesis, the significance can be overplayed due to erroneous interpretation. Specifically, these tools cannot confidently be said to indicate a specific pattern without further testing of that pattern. The tests are really of the null hypothesis that there is no pattern. Hence, a rejection of the hypothesis can merely indicate that there is a pattern, not what the pattern is. The exploratory results of post-hoc analysis can suggest a new hypothesis to be tested by means of a controlled experiment.