Linear Regression and ANOVA in R

Linear Regression

Linear Regression in R

In linear regression, we are attempting to estimate constants $m_1,\dots,m_r$ and $c$ so that the formula $y=m_1x_1+\dots+m_rx_r+c$ “fits” all the given data points $P_i=(x_{i,1},\dots,x_{i,r},y_i)$ up to experimental error. The variables $x_i$ are referred to as the “exogenous” variables and the variable $y$ is called the “endogenous” variable; loosely speaking, the values of $x_{i,j}$ are the ones we use to control the experiment, while the $y$ is the measurement we want to predict.

We write the error terms $e_i$ as $e_i = y_i - \left(m_1 x_{i,1} + \dots + m_{p} x_{i,p} +c \right)$ In these identities it is worth noting that only the $y_i$ and the $x_{i,j}$ are known. We need to determine the “optimal” $m_1,\dots,m_p$ and $c$ so that the following assumptions are upheld.

The assumptions behind normal linear regression are:

Exogeneity: This is the assumption that all the errors are concentrated in the measurement of $y_i$ . In other words, the values $x_{i,j}$ are very precise and the experimental error is entirely contained in $e_i$ .
Independence: The experiments are independent and so $e_i$ are independent random variables. In many cases, the weaker assumption that $e_i$ and $e_j$ have covariance 0 is enough.
Identical: We assume that $e_i$ are identically distributed. In fact, often this assumption is weakened to the assumption that all of them have the same variance $\sigma^2$ . This weaker assumption is called homoskedasticity.
Normality: We assume that $e_i$ are normally distributed around $0$ so that all of them follow the distribution $N(0,\sigma)$ . This distribution is replaced by others in the General Linear Model.
Lack of Perfect Multi-collinearity: This is the assumption that we have enough measurements so that the column vectors $\mathbf{x}_j=(x_{1,j},\dots,x_{N,j})$ and the vector $\mathbf{u}=(1,\dots,1)$ form a vector space of dimension at least $p+1$ . Since we may as well assume that no variable $x_j$ is a linear combination of the other $x$ ’s, this assumption is usually quite easy to ensure with a large number $N$ of data points.

Putting

$\mathbf{y} = (y_1,\dots,y_N)$ and

$\mathbf{e}=(e_1,\dots,e_N)$ , the above equation can be written as

$\mathbf{e} = \mathbf{y} - \left(m_1 \mathbf{x}_{1} + \dots + m_{p} \mathbf{x}_{p} +c\mathbf{u} \right)$ Under these hypothesis, the maximum likelihood estimate for

$m_1,\dots,m_p$ and

$c$ is obtained by minimising the sum of squares of the errors

$e_i$ and it can be calculated by solving the system of

$p+1$ linear equations in

$p+1$ unknowns (

$m_1, \dots, m_p$ and

$c$ )

$\begin{align*} \mathbf{x}_1\cdot \mathbf{y} &= m_1 \mathbf{x}_1\cdot\mathbf{x}_1 + \dots + m_p \mathbf{x}_1\cdot\mathbf{x}_p + c \mathbf{x}_1\cdot\mathbf{u} \\ & \vdots \\ \mathbf{x}_p\cdot \mathbf{y} &= m_1 \mathbf{x}_p\cdot\mathbf{x}_1 + \dots + m_p \mathbf{x}_p\cdot\mathbf{x}_p + c \mathbf{x}_p\cdot\mathbf{u} \\ \mathbf{u}\cdot \mathbf{y} &= m_1 \mathbf{u}\cdot\mathbf{x}_1 + \dots + m_p \mathbf{u}\cdot\mathbf{x}_p + c \mathbf{u}\cdot\mathbf{u} \end{align*}$

The matrix defined by $Q=\begin{pmatrix} \mathbf{x}_1\cdot\mathbf{x}_1 & \dots & \mathbf{x}_1\cdot\mathbf{x}_p & \mathbf{x}_1\cdot\mathbf{u} \\ \vdots & \ddots & \vdots & \vdots\\ \mathbf{x}_p\cdot\mathbf{x}_1 & \dots & \mathbf{x}_p\cdot\mathbf{x}_p & \mathbf{x}_p\cdot\mathbf{u} \\ \mathbf{u}\cdot\mathbf{x}_1 & \dots & \mathbf{u}\cdot\mathbf{x}_p & \mathbf{u}\cdot\mathbf{u} \end{pmatrix}$ is the matrix of covariances of the $x_i$ . It can be used to calculate the estimators $\hat{m}_1,\dots,\hat{m}_p,\hat{c}$ . The vector $\hat{\mathbf{y}}=\left(\hat{y}_1,\dots,\hat{y}_N\right)$ defined by $\hat{\mathbf{y}} = \hat{m}_1 \mathbf{x}_{1} + \dots + \hat{m}_{p} \mathbf{x}_{p} +\hat{c}\mathbf{u}$ is called the vector of fitted values. The difference $\mathbf{y}-\hat{\mathbf{y}}$ is called the vector of residuals; we denote it as $\mathbf{r}$ in order to distinguish it from $\mathbf{e}$ which is unknown!

An Example

We will use linear regression techniques to explore the the performance of BS-MS students at IISER Mohali who have completed two years.

First, we read in the table of data.

atable <- read.csv("atable.csv")
atable[1:3,]

As you can see the table contains (anonymized) scores of each student in Biology, Chemistry, Maths and Physics as well as the completed credits in each. The column majsc contains the score in the chosen major and the column nmajsc contains the sum of the remaining scores.

Next, we calculate the performance indices from the scores

atable$Bpi <- atable$Bsc/atable$Bcr
atable$Cpi <- atable$Csc/atable$Ccr
atable$Mpi <- atable$Msc/atable$Mcr
atable$Ppi <- atable$Psc/atable$Pcr
atable$majpi <- atable$majsc/atable$majcr
atable$nmajpi <- atable$nmajsc/atable$nmajcr

Note that majpi contains the performance of the student in the major and that nmajpi contains the performance of the student in the remaining subjects.

Finally, we create a new table (called a data.frame in R) which contains only the relevant data.

perfdat <- atable[,c("major","Bpi","Cpi","Mpi","Ppi","majpi","nmajpi")]
perfdat[1:3,]

Is Cpi a linear combination?

Each subject is tested independently. However, we also know that students who perform well in one subject often perform well in another. So let us look for a formula of the form $\tt{Cpi} = c + m_1 \tt{Bpi} + m_2 \tt{Mpi} + m_3 \tt{Ppi}$ In other words, we are trying to write the performance in Chemistry as a linear combination of the performance in other subjects!

In R, this done as follows:

lm_Cpi <- lm(Cpi ~ Bpi + Mpi + Ppi,data=perfdat)

We now look at a summary of the result of this linear regression model.

summary(lm_Cpi)


Call:
lm(formula = Cpi ~ Bpi + Mpi + Ppi, data = perfdat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.67700 -0.49614  0.02346  0.47709  1.93330 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.46995    0.16639   8.834   <2e-16 ***
Bpi          0.32134    0.02353  13.656   <2e-16 ***
Mpi          0.22455    0.02660   8.442   <2e-16 ***
Ppi          0.29066    0.03228   9.005   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7003 on 706 degrees of freedom
Multiple R-squared:  0.6992,    Adjusted R-squared:  0.6979 
F-statistic: 546.9 on 3 and 706 DF,  p-value: < 2.2e-16

Estimates

Whew! That is a lot of information! Let us try to break it down.

First of all, we can calculate the matrix $Q$ as above.

Xmat <- as.matrix(cbind(perfdat[,c("Bpi","Mpi","Ppi")],1))
Qmat <- t(Xmat) %*% Xmat
Qmat

         Bpi      Mpi       Ppi        1
Bpi 42101.56 35036.75 37310.188 5386.250
Mpi 35036.75 30798.75 32117.188 4506.500
Ppi 37310.19 32117.19 34305.078 4828.375
1    5386.25  4506.50  4828.375  710.000

We use this to calculate the estimators by solving the above linear equation.

t(solve(Qmat,t(Xmat) %*% perfdat$Cpi))

           Bpi       Mpi       Ppi       1
[1,] 0.3213371 0.2245547 0.2906597 1.46995

We see that these are the same as the coefficients of the linear form as estimated by R.

coefficients(lm_Cpi)

(Intercept)         Bpi         Mpi         Ppi 
  1.4699502   0.3213371   0.2245547   0.2906597

This is the same as the column called “Estimate” in the summary. These are the estimated values of $c$ and $m_1$ , $m_2$ and $m_3$ . In other words, we have got the linear regression formula $\tt{Cpi} = 1.46995 + 0.32134 \cdot\tt{Bpi} + 0.22455 \cdot\tt{Mpi} + 0.29066 \cdot\tt{Ppi}$ As statisticians (and experimentalists!) we cannot be satisfied with just getting the answer. We must ask ourselves, “How good is this formula?” which is another way of asking “How well does it fit the data?”

Residuals

The primary source of all measurements of deviation from the model is the vector $\mathbf{r}$ of residuals. It can be directly obtained in R as follows:

resid <- residuals(lm_Cpi)

As per our assumption, these values should be normally distributed around 0 with some common variance $\sigma^2$ .

summary(resid)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.67700 -0.49610  0.02346  0.00000  0.47710  1.93300

This is the first part of the summary of lm_Cpi that we saw above. (We will see below that the mean is 0 by choice of $\mathbf{r}$ and so this is not displayed in the summary.)

The histogram we obtain is

hist(resid)

which seems to indicate that the distribution is as we might expect.

Residual Standard Error

If we define $s^2 = \frac{\mathbf{r}\cdot\mathbf{r}}{N-p-1}$ Then (one can show that) this is an unbiased estimator for $\sigma^2$ . Recall that $\sigma^2$ is the variance of each “experimental error” $e_i$ . Thus $s$ is called the “standard error of the equation” (the denominator is due to the fact that $p+1$ dimensions are “used up by the estimation process”).

degfree <- length(perfdat$Cpi)-3-1
mys <- sqrt(resid%*%resid/degfree[1])
c(mys,degfree)

[1]   0.7002666 706.0000000

We see that this is the value called Residual standard error in the previous summary. The second number 706 is the number of degrees of freedom $N-p-1$ .

Now $\sigma^2$ is determined by the experimental parameters. It is a measure of experimental error which is determined by the experimental apparatus to measure $y$ . So $s^2$ alone is not a measure of how well the regression model fits the data.

R-squared

We next examine whether the variation within the $y_i$ ’s can be expressed as a combination of the variation among the $x_{i,j}$ ’s using the linear expression.

Our solution of the linear regression was based on the idea the $\mathbf{r}$ is orthogonal to the vectors $\mathbf{x}_i$ and to the vector $\mathbf{u}$ . In particular, we have $0 = \mathbf{r}\cdot\mathbf{u} = r_1 + \dots + r_N$ It follows that the mean of the $y_i$ ’s is the same as the mean of the $\hat{y}_i$ ’s. Let $\mu_y$ denote this mean.

As a consequence of orthogonality of $\mathbf{y}-\hat{\mathbf{y}}$ and $\hat{\mathbf{y}}-\mu_y\mathbf{u}$ , one can prove the identity $\sum_i (y_i - \mu_y)^2 = \sum_i (y_i -\hat{y}_i)^2 + \sum_i (\hat{y}_i - \mu_y)^2$ This is written conceptually as $\mathrm{SS}_{\mathrm{tot}} = \mathrm{SS}_{\mathrm{res}} + \mathrm{SS}_{\mathrm{exp}}$ where:

$\mathrm{SS}_{\mathrm{tot}}$ is the total Sum of Square differences
$\mathrm{SS}_{\mathrm{res}}$ is the Sum of Squares of the residuals (or errors) (Note that this is the same as $\mathbf{r}\cdot\mathbf{r}$ .)
$\mathrm{SS}_{\mathrm{exp}}$ is the explained Sum of square differences.

Now, if the model is good, most of the variance in the $y_i$ ’s should come from the second term. In other words, we should expect $R^2 = \frac{\sum_i (\hat{y_i}-\mu_y)^2}{\sum_i (y_i-\mu_y)^2} = 1 - \frac{\mathrm{SS}_{\mathrm{res}}}{\mathrm{SS}_{\mathrm{tot}}}$ to be close to 1 (it lies between 0 and 1). The number $R^2$ is called the coefficient of Determination.

myN <- length(perfdat$Cpi)
sstot <- myN*var(perfdat$Cpi)
ssred <- myN*var(resid)
ssexp <- sstot - ssred
rsquared <- ssexp/sstot
rsquared

[1] 0.6991628

The value of $R^2$ is what is displayed in the summary with the label Multiple R-squared. The expression $1-(1-R^2)\frac{N-1}{N-p-1}$ takes the number of independent parameters into account and so is called the Adjusted R-squared value.

arsquared <- 1 - (ssred*(length(perfdat$Cpi)))/(sstot*degfree)
arsquared

[1] 0.6974583

As the number of experiments increases, one can show (by an application of the law of large numbers) that $R^2$ increases to 1. Hence, looking at how close $R^2$ is to 1 is not adequate to measure how good our linear expression is. It is merely an indicator that we have carried out an enough of experiments to determine a fit, if a linear formula holds.

How good is the fit?

Under the null hypothesis that the explanatory variables are not relevant to the measurement of $y$ , the ratio $F = \frac{ \mathrm{SS}_{\mathrm{exp}}/p }{ \mathrm{SS}_{\mathrm{res}}/(N-p-1) }$ obeys the $F$ -distribution (Fisher-Snecdor distribution) with $(p,N-p-1)$ degrees of freedom. Thus, having a large value of $F$ (much larger than 1) gives an indication that the linear expression of $y$ in terms of $m_i$ ’s and $c$ has some significance.

fstat <- (ssexp*degfree)/(ssred*3)
fstat

[1] 546.9281

We thus see this value displayed as the F-statistic in the summary of lm_Cpi. Along with it, we see a $p$ -value. A low $p$ -value indicates a strong rejection of the null hypothesis.

How good are the Estimators?

Similarly, we can also think of the estimators $\hat{m}_i$ and $\hat{c}$ as random variables. Under the standard assumptions, these are (for large numbers $N$ of data points) normally distributed with variance given by $\sigma^2\cdot(Q^{-1})_{i,i}$ . Using $s^2*(N-p-1)/(N-1)$ as an estimator for $\sigma^2$ allows us to calculate this variance. The square root of this is called the standard error for estimator $\hat{m}_i$ .

reserr <- mys*sqrt(diag(solve(Qmat)))*sqrt(degfree/(myN-1))
reserr

[1] 0.02348098 0.02654388 0.03221075 0.16604256

In the second column we see the standard error for each of these entries which is calculated as above. (Note that with our notation the standard error for $\hat{c}$ comes last.)

The column called $t$ -value that contains the ratio of the first two columns. This is the value to be used with the Student- $t$ distribution (if errors $e_i$ are normally distributed) to get the last column. The last column represents the probability of the null hypothesis that this particular term in our formula is not significant. In our case, we can see that all these values are low.

All the above can be used to indicate that there is indeed a non-trivial linear relationship between the Chemistry performance and the performance in the remaining subjects. However, it would be too much to conclude that we need not have examinations in Chemistry at all and just define the score for it as the above linear combination of the scores in the other subjects!

ANOVA

The above kind of analysis falls into the broad category of “analysis of variance”. The acronym ANOVA stands for ANalysis Of VAriance, however, it has the specific meaning of applying these tests in the context where the linear regression is actually associated with a partitioning of a population sample according to the values of a certain “factor”. (There is also MANOVA or Multi-factor ANOVA which we will not discuss here.)

More precisely, suppose that all the $y_i$ ’s are measurements of values of a measurement from a certain population. The variables $x_{i,j}$ are actually indicators whether the $i$ -th individual has a certain characteristic $f_j$ ; hence it is $1$ if the $i$ -th individual belongs to the sub-population $j$ and 0 otherwise. Moreover, we assume that each individual lies in only one sub-population. The question that we are posing is whether this stratification (factor-isation!) of the population into sub-population is significant in the context of the measurement $y$ .

The linear regression model is then of the form $y = m_1 x_1 + m_2 x_2 + \dots + m_p x_p + c$ where there are $p$ distinct sub-populations with indicators $x_i$ . This equation is short-hand for the assertion that $y$ is distributed with mean $c+m_j$ for individuals in the sub-population $j$ . The ANOVA analysis will then assign a significance to these distinctions between the values of $y$ for these sub-populations.

Since the variables $x_j$ are indicator variables, one can calculate the terms as follows. We take $c=\mu$ to be the mean of all the $y_i$ ’s. We take $m_j=\mu_j-\mu$ where $\mu_j$ is the mean of $y_i$ ’s for those $i$ which belong to the $j$ -th population. To understand the total variance, we divide it as follows:

$\sum_j n_j (\mu_j-\mu)^2$ is one part of the “explained” sum of squares where $n_j$ denotes the size of the $j$ -th population.
$\sum_j (n_j-1) s_j^2$ is the error part of the sum of squares where $s_j^2$ denotes the sample variance of $y$ ’s within the $j$ -th population; in other words $(n_j-1)s_j^2=\sum_{x_{i,j}=1} (y_i-\mu_j)^2$

The total sum of squares $\sum_i (y_i-\mu)^2$ is the sum of these two quantities since $\sum_{x_{i,j}=1} (y_i-\mu_j)=0$ This allows one to use the $F$ -distribution to check for significance of the deviation for individual populations.

Does performance differ as per the major?

We can ask whether the performance in the major is significantly different for different choices of majors. We can use the choice of major as a factor to divide the population and see if the performance in a particular subject (say Mathematics) varies significantly across different choices of the major.

aov_Mpi <- aov(Mpi ~ major, data=perfdat)
summary(aov_Mpi)

             Df Sum Sq Mean Sq F value Pr(>F)
major         3   10.4   3.453   1.116  0.342
Residuals   706 2184.8   3.095

There does not seem to be a very large difference as the $F$ -value is just a bit bigger than 1. We can also check whether this is subject specific by using Tukey’s Honest Significant Difference method for further exploration.

tuk_Mpi <- TukeyHSD(aov_Mpi)
tuk_Mpi

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Mpi ~ major, data = perfdat)

$major
           diff        lwr       upr     p adj
C-B -0.07364897 -0.5216495 0.3743515 0.9744878
M-B -0.16156017 -0.6808957 0.3577754 0.8539038
P-B -0.31449864 -0.7772269 0.1482296 0.2985532
M-C -0.08791120 -0.6077190 0.4318966 0.9723304
P-C -0.24084967 -0.7041079 0.2224085 0.5384358
P-M -0.15293848 -0.6854919 0.3796149 0.8810952

We see that none of the pairwise differences are significant. Hence, it does not seem to be the case that performance in mathematics differs across different choices of majors.

Post-hoc Analysis

Methods like ANOVA and Tukey’s HSD are often used as exploratory tools for post-doc analysis. These tools can be used to “discover” patterns in the data that had not been suspected earlier.

It is important to note that since the experiment was not designed to test the alternate hypothesis, the significance can be overplayed due to erroneous interpretation. Specifically, these tools cannot confidently be said to indicate a specific pattern without further testing of that pattern. The tests are really of the null hypothesis that there is no pattern. Hence, a rejection of the hypothesis can merely indicate that there is a pattern, not what the pattern is. The exploratory results of post-hoc analysis can suggest a new hypothesis to be tested by means of a controlled experiment.

पिछ्ला सुधार: सोमवार, 16 अप्रैल 2018, 9:08 अपराह्न

MTH202 Probability and Statistics