Lab Session 6

Lab Assignment 6

Changing our viewpoint

All science is based on hypothesis about the world around us. Sometimes, we repeatedly find that results of experiments are not what we expect. This leads us to question the hypothesis. Since we cannot work without at least one hypothesis, we need to formulate an alternative hypothesis. The alternative hypothesis needs to be mutually exclusive from the original (or null) hypothesis but the two need not be exhaustive.

Some examples can clarify the situation:

  • Medicine

    H_0: the new and old drug are equally effective.

    H_a: the new drug is more effective

  • Physics

    H_0: Newtonian mechanics holds

    H_a: Quantum mechanics holds

  • IISER

    H_0: IISER education makes no change to students

    H_a: IISER education adds value to students

  • Coaching

    H_0: Coaching classes have no effect on JEE performance

    H_a: Coaching classes improve JEE performance

In each case, we see that the hypothesis and the alternative hypothesis are mutually exclusive but not exhaustive.

The way in we test whether to abandon the null Hypothesis (H_0) or not is called “testing of hypothesis”. Note that we typically do not (expect to) find definitive evidence to accept the alternative Hypothesis (H_a). In the practice of science, we will often carry out further calculations and experiments based on the alternative hypothesis and then test it against some other alternative hypothesis.

To make this clear, when the Physics community accepted that experiments (like double slit experiment) had given significant evidence that Newtonian physics was to be rejected, this was not statistical proof that Quantum Mechanics was to be accepted.

In summary, the null hypothesis H_0 has a theoretical prediction which is a particular value (typically 0, but a non-zero value can always be translated to 0!). The alternative hypothesis provides a theory that has the prediction that the value is different from 0. The statistical process of testing hypothesis is to carry out experiments to check whether the value obtained is significantly different from 0 (or not).

Ozone levels

To make matters more concrete let us examine the data provided by the IISER Mohali environmental monitoring facility.

dat <- read.csv("summer_data_ozone_temp_solar.csv")

Let us compare the ozone levels for 9:00am and for 12:00noon.

summary(dat$o3_avg[dat$time==9])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.458  22.480  32.030  34.890  43.580 151.600       9 
summary(dat$o3_avg[dat$time==12])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  3.887  39.170  52.230  53.180  64.140 154.400      14 

We notice that the levels at 12:00noon are higher. So we form a theory that this is so. Our null hypothesis is:

H_0: Ozone levels do not depend on the time of day.

Our alternate hypothesis is:

H_a: The ozone levels are higher at 12:00noon compared with 09:00am.

For each day d_i we compare the ozone level X_i at 9:00am to the ozone level Y_i at 12:00noon. We define Z_i as 
  Z_i = \begin{cases}
            1 & X_i < Y_i \\
            0 & \text{otherwise}
        \end{cases}

myx <- dat[dat$time==9,c("day","month","year","o3_avg")]
myy <- dat[dat$time==12,c("day","month","year","o3_avg")]
myxy <- merge(myx,myy,by=c("day","month","year"),suffixes=c(".nine",".noon"))
myxy <- myxy[!is.na(myxy$o3_avg.nine) & !is.na(myxy$o3_avg.noon),]
myxy[1:5,]
myz <- (myxy$o3_avg.nine < myxy$o3_avg.noon)

We can think of Z_i as a Bernoulli variable with p=0.5 under H_0 and with p>0.5 under the alternate hypothesis.

Thus, the expected value of Z_i is 1/2 and so the expected value of the “statistic” w=\sum_i Z_i is N/2 where N is number of days for which we have a measurement. Moreover, the variance is N/4 for this variable. So the following is the expected distribution. We plot where our w lies as well (as a point in red).

myN <- length(myz)
myw <- sum(myz)
mym <- myN/2
mys <- sqrt(myN/4)
vals<- 0:myN
plot(vals,dbinom(vals,myN,p=0.5),ylab="")
points(myw,0,pch=16,cex=1.5,col="red")

We see that our point is so far outside the bulk of the distribution that our null hypothesis H_0 has to be rejected with high probability!

In other words, it is “clear”" that the time of day is having an effect on ozone levels.

A different approach

Note that we could also have calculated the differences T_i=X_i-Y_i and looked at the average v=(1/N)\sum_i T_i of these values.

myt <- (myxy$o3_avg.nine - myxy$o3_avg.noon)

Under the null hypothesis, these values would be close to 0 while under the alternate hypothesis, these values would vary significantly from 0. Since we have a large number of observations, we can expect the distribution of w to be normal around 0 with deviation as determined by the variance. We can take the latter to be the sum of the variances of x and y.

myv <- mean(myt)
mysd <- sqrt((var(myxy$o3_avg.nine)+var(myxy$o3_avg.noon))/myN)
tvals <- seq(-20*mysd,20*mysd,length=100)
plot (tvals, dnorm(tvals,0,mysd), type="l", xlab="", ylab="", main="" )
points(myv,0,pch=16,cex=1.5,col="red")

Again, we see that the point representing our data is very far outside the range predicted by the null hypothesis.

Types of errors

Errors in testing are often classified into two broad types:

Type I error : Rejection of the null hypothesis when in fact the null hypothesis is true.

Type II error : Failing to reject the null hypothesis when in fact the null hypothesis is false.

It is good to be aware of reasons for such errors which are outlined in the notes.

Small sample size

A typical error is due to small sample size. Let us take such a small sample.

mysam <- sample(myN,10)
nz <- myz[mysam]
nt <- myt[mysam]
nxy <- myxy[mysam,]

Let us test the hypothesis as in the first method:

nN <- length(nz)
nw <- sum(nz)
nm <- nN/2
ns <- sqrt(nN/4)
vals<- 0:nN
plot(vals,dbinom(vals,nN,p=0.5),ylab="")
points(nw,0,pch=16,cex=1.5,col="red")

We see that the result is not so clear-cut anymore!

Now let us test the hypothesis as in the second method.

nv <- mean(nt)
nsd <- sqrt((var(nxy$o3_avg.nine)+var(nxy$o3_avg.noon))/nN)
tvals <- seq(-20*nsd,20*nsd,length=100)
plot (tvals, dnorm(tvals,0,nsd), type="l", xlab="", ylab="", main="" )
points(nv,0,pch=16,cex=1.5,col="red")

Here it actually looks as if the hypothesis in fact accepted! However, the correct statement is that the hypothesis is not rejected.

Thus, small sample sizes can result in errors.

Assignment

  1. Load the data as in the previous assignment.

  2. Do the above calculation for some chosen 1 month. Is the result still significant?

  3. Divide the data into data for hot days (temperature > 30) and for cool days (temperate < 20)

  4. Test the null hypothesis that there is no significant difference of ozone levels between hot and cool days.


पिछ्ला सुधार: सोमवार, 9 अप्रैल 2018, 7:32 अपराह्न