Lab Session 6
Lab Assignment 6
Kapil Paranjape
11/04/2018
Changing our viewpoint
All science is based on hypothesis about the world around us. Sometimes, we repeatedly find that results of experiments are not what we expect. This leads us to question the hypothesis. Since we cannot work without at least one hypothesis, we need to formulate an alternative hypothesis. The alternative hypothesis needs to be mutually exclusive from the original (or null) hypothesis but the two need not be exhaustive.
Some examples can clarify the situation:
Medicine
Physics
IISER
Coaching
In each case, we see that the hypothesis and the alternative hypothesis are mutually exclusive but not exhaustive.
The way in we test whether to abandon the null Hypothesis () or not is called “testing of hypothesis”. Note that we typically do not (expect to) find definitive evidence to accept the alternative Hypothesis (
). In the practice of science, we will often carry out further calculations and experiments based on the alternative hypothesis and then test it against some other alternative hypothesis.
To make this clear, when the Physics community accepted that experiments (like double slit experiment) had given significant evidence that Newtonian physics was to be rejected, this was not statistical proof that Quantum Mechanics was to be accepted.
In summary, the null hypothesis has a theoretical prediction which is a particular value (typically 0, but a non-zero value can always be translated to 0!). The alternative hypothesis provides a theory that has the prediction that the value is different from 0. The statistical process of testing hypothesis is to carry out experiments to check whether the value obtained is significantly different from 0 (or not).
Ozone levels
To make matters more concrete let us examine the data provided by the IISER Mohali environmental monitoring facility.
dat <- read.csv("summer_data_ozone_temp_solar.csv")
Let us compare the ozone levels for 9:00am and for 12:00noon.
summary(dat$o3_avg[dat$time==9])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.458 22.480 32.030 34.890 43.580 151.600 9
summary(dat$o3_avg[dat$time==12])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.887 39.170 52.230 53.180 64.140 154.400 14
We notice that the levels at 12:00noon are higher. So we form a theory that this is so. Our null hypothesis is:
: Ozone levels do not depend on the time of day.
Our alternate hypothesis is:
: The ozone levels are higher at 12:00noon compared with 09:00am.
For each day we compare the ozone level
at 9:00am to the ozone level
at 12:00noon. We define
as
myx <- dat[dat$time==9,c("day","month","year","o3_avg")]
myy <- dat[dat$time==12,c("day","month","year","o3_avg")]
myxy <- merge(myx,myy,by=c("day","month","year"),suffixes=c(".nine",".noon"))
myxy <- myxy[!is.na(myxy$o3_avg.nine) & !is.na(myxy$o3_avg.noon),]
myxy[1:5,]
myz <- (myxy$o3_avg.nine < myxy$o3_avg.noon)
We can think of as a Bernoulli variable with
under
and with
under the alternate hypothesis.
Thus, the expected value of is
and so the expected value of the “statistic”
is
where
is number of days for which we have a measurement. Moreover, the variance is
for this variable. So the following is the expected distribution. We plot where our
lies as well (as a point in red).
myN <- length(myz)
myw <- sum(myz)
mym <- myN/2
mys <- sqrt(myN/4)
vals<- 0:myN
plot(vals,dbinom(vals,myN,p=0.5),ylab="")
points(myw,0,pch=16,cex=1.5,col="red")
We see that our point is so far outside the bulk of the distribution that our null hypothesis has to be rejected with high probability!
In other words, it is “clear”" that the time of day is having an effect on ozone levels.
A different approach
Note that we could also have calculated the differences and looked at the average
of these values.
myt <- (myxy$o3_avg.nine - myxy$o3_avg.noon)
Under the null hypothesis, these values would be close to 0 while under the alternate hypothesis, these values would vary significantly from 0. Since we have a large number of observations, we can expect the distribution of to be normal around 0 with deviation as determined by the variance. We can take the latter to be the sum of the variances of
and
.
myv <- mean(myt)
mysd <- sqrt((var(myxy$o3_avg.nine)+var(myxy$o3_avg.noon))/myN)
tvals <- seq(-20*mysd,20*mysd,length=100)
plot (tvals, dnorm(tvals,0,mysd), type="l", xlab="", ylab="", main="" )
points(myv,0,pch=16,cex=1.5,col="red")
Again, we see that the point representing our data is very far outside the range predicted by the null hypothesis.
Types of errors
Errors in testing are often classified into two broad types:
Type I error : Rejection of the null hypothesis when in fact the null hypothesis is true.
Type II error : Failing to reject the null hypothesis when in fact the null hypothesis is false.
It is good to be aware of reasons for such errors which are outlined in the notes.
Small sample size
A typical error is due to small sample size. Let us take such a small sample.
mysam <- sample(myN,10)
nz <- myz[mysam]
nt <- myt[mysam]
nxy <- myxy[mysam,]
Let us test the hypothesis as in the first method:
nN <- length(nz)
nw <- sum(nz)
nm <- nN/2
ns <- sqrt(nN/4)
vals<- 0:nN
plot(vals,dbinom(vals,nN,p=0.5),ylab="")
points(nw,0,pch=16,cex=1.5,col="red")
We see that the result is not so clear-cut anymore!
Now let us test the hypothesis as in the second method.
nv <- mean(nt)
nsd <- sqrt((var(nxy$o3_avg.nine)+var(nxy$o3_avg.noon))/nN)
tvals <- seq(-20*nsd,20*nsd,length=100)
plot (tvals, dnorm(tvals,0,nsd), type="l", xlab="", ylab="", main="" )
points(nv,0,pch=16,cex=1.5,col="red")
Here it actually looks as if the hypothesis in fact accepted! However, the correct statement is that the hypothesis is not rejected.
Thus, small sample sizes can result in errors.
Assignment
Load the data as in the previous assignment.
Do the above calculation for some chosen 1 month. Is the result still significant?
Divide the data into data for hot days (temperature > 30) and for cool days (temperate < 20)
Test the null hypothesis that there is no significant difference of ozone levels between hot and cool days.