Linear Regression and ANOVA in R
Linear Regression in R
In linear regression, we are attempting to estimate constants and
so that the formula
“fits” all the given data points
up to experimental error. The variables
are referred to as the “exogenous” variables and the variable
is called the “endogenous” variable; loosely speaking, the values of
are the ones we use to control the experiment, while the
is the measurement we want to predict.
We write the error terms as
In these identities it is worth noting that only the
and the
are known. We need to determine the “optimal”
and
so that the following assumptions are upheld.
The assumptions behind normal linear regression are:
Exogeneity: This is the assumption that all the errors are concentrated in the measurement of
. In other words, the values
are very precise and the experimental error is entirely contained in
.
Independence: The experiments are independent and so
are independent random variables. In many cases, the weaker assumption that
and
have covariance 0 is enough.
Identical: We assume that
are identically distributed. In fact, often this assumption is weakened to the assumption that all of them have the same variance
. This weaker assumption is called homoskedasticity.
Normality: We assume that
are normally distributed around
so that all of them follow the distribution
. This distribution is replaced by others in the General Linear Model.
Lack of Perfect Multi-collinearity: This is the assumption that we have enough measurements so that the column vectors
and the vector
form a vector space of dimension at least
. Since we may as well assume that no variable
is a linear combination of the other
’s, this assumption is usually quite easy to ensure with a large number
of data points.











The matrix defined by is the matrix of covariances of the
. It can be used to calculate the estimators
. The vector
defined by
is called the vector of fitted values. The difference
is called the vector of residuals; we denote it as
in order to distinguish it from
which is unknown!
An Example
We will use linear regression techniques to explore the the performance of BS-MS students at IISER Mohali who have completed two years.
First, we read in the table of data.
atable <- read.csv("atable.csv")
atable[1:3,]
As you can see the table contains (anonymized) scores of each student in Biology, Chemistry, Maths and Physics as well as the completed credits in each. The column majsc
contains the score in the chosen major and the column nmajsc
contains the sum of the remaining scores.
Next, we calculate the performance indices from the scores
atable$Bpi <- atable$Bsc/atable$Bcr
atable$Cpi <- atable$Csc/atable$Ccr
atable$Mpi <- atable$Msc/atable$Mcr
atable$Ppi <- atable$Psc/atable$Pcr
atable$majpi <- atable$majsc/atable$majcr
atable$nmajpi <- atable$nmajsc/atable$nmajcr
Note that majpi
contains the performance of the student in the major and that nmajpi
contains the performance of the student in the remaining subjects.
Finally, we create a new table (called a data.frame
in R
) which contains only the relevant data.
perfdat <- atable[,c("major","Bpi","Cpi","Mpi","Ppi","majpi","nmajpi")]
perfdat[1:3,]
Is Cpi a linear combination?
Each subject is tested independently. However, we also know that students who perform well in one subject often perform well in another. So let us look for a formula of the form In other words, we are trying to write the performance in Chemistry as a linear combination of the performance in other subjects!
In R
, this done as follows:
lm_Cpi <- lm(Cpi ~ Bpi + Mpi + Ppi,data=perfdat)
We now look at a summary of the result of this linear regression model.
summary(lm_Cpi)
Call:
lm(formula = Cpi ~ Bpi + Mpi + Ppi, data = perfdat)
Residuals:
Min 1Q Median 3Q Max
-2.67700 -0.49614 0.02346 0.47709 1.93330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.46995 0.16639 8.834 <2e-16 ***
Bpi 0.32134 0.02353 13.656 <2e-16 ***
Mpi 0.22455 0.02660 8.442 <2e-16 ***
Ppi 0.29066 0.03228 9.005 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7003 on 706 degrees of freedom
Multiple R-squared: 0.6992, Adjusted R-squared: 0.6979
F-statistic: 546.9 on 3 and 706 DF, p-value: < 2.2e-16
Estimates
Whew! That is a lot of information! Let us try to break it down.
First of all, we can calculate the matrix as above.
Xmat <- as.matrix(cbind(perfdat[,c("Bpi","Mpi","Ppi")],1))
Qmat <- t(Xmat) %*% Xmat
Qmat
Bpi Mpi Ppi 1
Bpi 42101.56 35036.75 37310.188 5386.250
Mpi 35036.75 30798.75 32117.188 4506.500
Ppi 37310.19 32117.19 34305.078 4828.375
1 5386.25 4506.50 4828.375 710.000
We use this to calculate the estimators by solving the above linear equation.
t(solve(Qmat,t(Xmat) %*% perfdat$Cpi))
Bpi Mpi Ppi 1
[1,] 0.3213371 0.2245547 0.2906597 1.46995
We see that these are the same as the coefficients of the linear form as estimated by R.
coefficients(lm_Cpi)
(Intercept) Bpi Mpi Ppi
1.4699502 0.3213371 0.2245547 0.2906597
This is the same as the column called “Estimate” in the summary. These are the estimated values of and
,
and
. In other words, we have got the linear regression formula
As statisticians (and experimentalists!) we cannot be satisfied with just getting the answer. We must ask ourselves, “How good is this formula?” which is another way of asking “How well does it fit the data?”
Residuals
The primary source of all measurements of deviation from the model is the vector of residuals. It can be directly obtained in R as follows:
resid <- residuals(lm_Cpi)
As per our assumption, these values should be normally distributed around 0 with some common variance .
summary(resid)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.67700 -0.49610 0.02346 0.00000 0.47710 1.93300
This is the first part of the summary of lm_Cpi
that we saw above. (We will see below that the mean is 0 by choice of and so this is not displayed in the summary.)
The histogram we obtain is
hist(resid)
which seems to indicate that the distribution is as we might expect.
Residual Standard Error
If we define Then (one can show that) this is an unbiased estimator for
. Recall that
is the variance of each “experimental error”
. Thus
is called the “standard error of the equation” (the denominator is due to the fact that
dimensions are “used up by the estimation process”).
degfree <- length(perfdat$Cpi)-3-1
mys <- sqrt(resid%*%resid/degfree[1])
c(mys,degfree)
[1] 0.7002666 706.0000000
We see that this is the value called Residual standard error in the previous summary. The second number 706 is the number of degrees of freedom .
Now is determined by the experimental parameters. It is a measure of experimental error which is determined by the experimental apparatus to measure
. So
alone is not a measure of how well the regression model fits the data.
R-squared
We next examine whether the variation within the ’s can be expressed as a combination of the variation among the
’s using the linear expression.
Our solution of the linear regression was based on the idea the is orthogonal to the vectors
and to the vector
. In particular, we have
It follows that the mean of the
’s is the same as the mean of the
’s. Let
denote this mean.
As a consequence of orthogonality of and
, one can prove the identity
This is written conceptually as
where:
is the total Sum of Square differences
is the Sum of Squares of the residuals (or errors) (Note that this is the same as
.)
is the explained Sum of square differences.
Now, if the model is good, most of the variance in the ’s should come from the second term. In other words, we should expect
to be close to 1 (it lies between 0 and 1). The number
is called the coefficient of Determination.
myN <- length(perfdat$Cpi)
sstot <- myN*var(perfdat$Cpi)
ssred <- myN*var(resid)
ssexp <- sstot - ssred
rsquared <- ssexp/sstot
rsquared
[1] 0.6991628
The value of is what is displayed in the summary with the label Multiple R-squared. The expression
takes the number of independent parameters into account and so is called the Adjusted R-squared value.
arsquared <- 1 - (ssred*(length(perfdat$Cpi)))/(sstot*degfree)
arsquared
[1] 0.6974583
As the number of experiments increases, one can show (by an application of the law of large numbers) that increases to 1. Hence, looking at how close
is to 1 is not adequate to measure how good our linear expression is. It is merely an indicator that we have carried out an enough of experiments to determine a fit, if a linear formula holds.
How good is the fit?
Under the null hypothesis that the explanatory variables are not relevant to the measurement of , the ratio
obeys the
-distribution (Fisher-Snecdor distribution) with
degrees of freedom. Thus, having a large value of
(much larger than 1) gives an indication that the linear expression of
in terms of
’s and
has some significance.
fstat <- (ssexp*degfree)/(ssred*3)
fstat
[1] 546.9281
We thus see this value displayed as the F-statistic in the summary of lm_Cpi
. Along with it, we see a -value. A low
-value indicates a strong rejection of the null hypothesis.
How good are the Estimators?
Similarly, we can also think of the estimators and
as random variables. Under the standard assumptions, these are (for large numbers
of data points) normally distributed with variance given by
. Using
as an estimator for
allows us to calculate this variance. The square root of this is called the standard error for estimator
.
reserr <- mys*sqrt(diag(solve(Qmat)))*sqrt(degfree/(myN-1))
reserr
[1] 0.02348098 0.02654388 0.03221075 0.16604256
In the second column we see the standard error for each of these entries which is calculated as above. (Note that with our notation the standard error for comes last.)
The column called -value that contains the ratio of the first two columns. This is the value to be used with the Student-
distribution (if errors
are normally distributed) to get the last column. The last column represents the probability of the null hypothesis that this particular term in our formula is not significant. In our case, we can see that all these values are low.
All the above can be used to indicate that there is indeed a non-trivial linear relationship between the Chemistry performance and the performance in the remaining subjects. However, it would be too much to conclude that we need not have examinations in Chemistry at all and just define the score for it as the above linear combination of the scores in the other subjects!
ANOVA
The above kind of analysis falls into the broad category of “analysis of variance”. The acronym ANOVA stands for ANalysis Of VAriance, however, it has the specific meaning of applying these tests in the context where the linear regression is actually associated with a partitioning of a population sample according to the values of a certain “factor”. (There is also MANOVA or Multi-factor ANOVA which we will not discuss here.)
More precisely, suppose that all the ’s are measurements of values of a measurement from a certain population. The variables
are actually indicators whether the
-th individual has a certain characteristic
; hence it is
if the
-th individual belongs to the sub-population
and 0 otherwise. Moreover, we assume that each individual lies in only one sub-population. The question that we are posing is whether this stratification (factor-isation!) of the population into sub-population is significant in the context of the measurement
.
The linear regression model is then of the form where there are
distinct sub-populations with indicators
. This equation is short-hand for the assertion that
is distributed with mean
for individuals in the sub-population
. The ANOVA analysis will then assign a significance to these distinctions between the values of
for these sub-populations.
Since the variables are indicator variables, one can calculate the terms as follows. We take
to be the mean of all the
’s. We take
where
is the mean of
’s for those
which belong to the
-th population. To understand the total variance, we divide it as follows:
is one part of the “explained” sum of squares where
denotes the size of the
-th population.
is the error part of the sum of squares where
denotes the sample variance of
’s within the
-th population; in other words
The total sum of squares is the sum of these two quantities since
This allows one to use the
-distribution to check for significance of the deviation for individual populations.
Does performance differ as per the major?
We can ask whether the performance in the major is significantly different for different choices of majors. We can use the choice of major as a factor to divide the population and see if the performance in a particular subject (say Mathematics) varies significantly across different choices of the major.
aov_Mpi <- aov(Mpi ~ major, data=perfdat)
summary(aov_Mpi)
Df Sum Sq Mean Sq F value Pr(>F)
major 3 10.4 3.453 1.116 0.342
Residuals 706 2184.8 3.095
There does not seem to be a very large difference as the -value is just a bit bigger than 1. We can also check whether this is subject specific by using Tukey’s Honest Significant Difference method for further exploration.
tuk_Mpi <- TukeyHSD(aov_Mpi)
tuk_Mpi
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Mpi ~ major, data = perfdat)
$major
diff lwr upr p adj
C-B -0.07364897 -0.5216495 0.3743515 0.9744878
M-B -0.16156017 -0.6808957 0.3577754 0.8539038
P-B -0.31449864 -0.7772269 0.1482296 0.2985532
M-C -0.08791120 -0.6077190 0.4318966 0.9723304
P-C -0.24084967 -0.7041079 0.2224085 0.5384358
P-M -0.15293848 -0.6854919 0.3796149 0.8810952
We see that none of the pairwise differences are significant. Hence, it does not seem to be the case that performance in mathematics differs across different choices of majors.
Post-hoc Analysis
Methods like ANOVA and Tukey’s HSD are often used as exploratory tools for post-doc analysis. These tools can be used to “discover” patterns in the data that had not been suspected earlier.
It is important to note that since the experiment was not designed to test the alternate hypothesis, the significance can be overplayed due to erroneous interpretation. Specifically, these tools cannot confidently be said to indicate a specific pattern without further testing of that pattern. The tests are really of the null hypothesis that there is no pattern. Hence, a rejection of the hypothesis can merely indicate that there is a pattern, not what the pattern is. The exploratory results of post-hoc analysis can suggest a new hypothesis to be tested by means of a controlled experiment.