MTH202(2017-18): Lab Assignment 5

Lab Assignment 5

Which coin?

We have a box with three coins with different biases which are known. We pick a coin at random and flip it 100 times. Can we decide on the basis of the results which was the “most likely” coin which was picked?

To answer this question, let $p_i$ for $i=1,2,3$ be the probability of getting Heads depending on the coin which was picked. Let $a$ Heads and $100-a$ Tails be the result of the experiment. The probability of that particular result with the $i$ -th coin is $L_i=\binom{100}{a} p_i^{a}(1-p_i)^{100-a}$ Note that the contant in front is the same, the only changing parameter is $p_i$ . This number $L_i$ is called the likelihood; it is the probability of obtaining the result which has been obtained if we had picked the $i$ -th coin. The value of $i$ for which $L_i$ is maximum is called the maximum likelihood estimate (MLE) for which coin was picked. It is that choice of $i$ for which the probability was highest was what actually occurred!

As usual, let us simulate and calculate.

coins <- c(0.3,0.5,0.7)
chosen <- sample(coins,1)

The vector coins is that of the $p_i$ ’s. The chosen one is randomly picked. We carry out one observation of 100 coin flips:

heads <- rbinom(1,100,chosen)

Now, we calculate the probability of the given result for each of the coins:

lcoins <- sapply(coins,function(p){dbinom(heads,100,prob=p);})
result <- data.frame(coins,lcoins)
result

We see that the largest value is for the second coin, so our maximum likelihood estimate for the chosen coin is the second coin. Let’s check!

chosen

[1] 0.5

We also notice that the values of likelihood are between 0 and 1 (as they are probabilities) and these are often small. So one can also take the “log”-likelihood. Since $\log$ is an order-preserving function, it is good enough. (Note that we will get negative values of logs, so the maximum is the one which is smallest in magnitude!)

result$llcoins <- sapply(coins,function(p){dbinom(heads,100,prob=p,log=TRUE)})
result

This shows a little more clearly that the second coin is the most likely.

We can also consider a situation where the possible values of $p$ vary continuously. In that case, how do we find the maximum likelihood estimate. We can plot:

curve(dbinom(heads,100,p=x),from=0.3,to=0.7,xlab="p",ylab="likelihood")

We see that the maximum occurs somewhere between $0.55$ and $0.6$ .

We can also use calculus to calculate the value of $p$ where $L(p)=\binom{100}{h}p^a(1-p)^{100-a}$ takes a maximum where $a$ is the value of heads. This is easily calculated, by solving $dL/dp=0$ for $p$ to get $p=a/100$ .

heads/100

[1] 0.57

Continuous distributions

The same problem can also be considered for (absolutely) continuous distributions. We can think of a distribution which has a continuous parameter $\theta$ . Now, we make a large number of observations. To find the maximum likelihood estimate for $\theta$ we can calculate the probability $L(\theta)$ of the event that already took place for each value of $\theta$ . The MLE is that $\hat{\theta}$ for which $L(\hat{\theta})$ is maximum. However, there is no “probability” for an actual value in this case! So we use the density function in this case. (This can be justified, but we will not do so here!)

The data set faithful in R gives the waiting times between eruptions of a water “geyser” called “Old Faithful” in USA.

faithful[1:5,]

We can try to model it using a waiting time distribution with a single parameter $\lambda$ . The distribution density is given by $\lambda\exp(-t\lambda)$ . This makes it easier to calculate the log-likelihood:

loglhood <- function(lambda){sum(dexp(faithful$waiting,rate=lambda,log=TRUE))}

Let us plot this.

x <- seq(0.001,0.1,0.001)
plot(x,sapply(x,loglhood),type='l',xlab="lambda",ylab="log-likelihood")

This seems to indicate that the MLE is somewhere in the range 0.01 to 0.02.

In fact, we can again use calculus to check that the MLE $\hat{\lambda}=1/T$ , where $T$ is the average waiting time between eruptions.

1/mean(faithful$waiting)

[1] 0.01410496

We can do a similar analysis of faithful$eruptions. This measures the quantum of eruption of the geyser.

hist(faithful$eruptions,breaks=15,main="Histogram of Eruptions",xlab="Duration")

This looks like a bimodal distribution with density something like $(1/2)(N(m_1,s)+N(m_2,s))$ where $s=(m_2-m_1)/6$ . Let us try to fit it to such a distribution. We first define functions to calculate the log-likelihood.

bimod <- function(x,m1,m2){log(dnorm(x,m1,(m2-m1)/6)+dnorm(x,m2,(m2-m1)/6))}
loglik <- function(m1,m2){      sum(sapply(faithful$eruptions,function(x){lbimod(x,m1,m2)}))}

The problem is that this is a function of two variables and so it might appear that we need a 3-dimensional plot. However, since we are trying to locate “peak”, we can make do with a contour plot.

m1range <- seq(1.5,3.0,length=50)
m2range <- seq(4.0,5.0,length=50)
llik <- matrix(NA,50,50)
for(i in 1:50)
  for(j in 1:50)
    llik[i,j] <- loglik(m1range[i],m2range[j])
contour(m1range,m2range,llik,levels=10*(-30:-10),drawlabels=FALSE,xlab=expression(m[1]),ylab=expression(m[2]))

We see that the most likely estimate for $m_1$ is about 2.05 and for $m_2$ is around 4.3. In such situations, solving the problem using calculus is possible, but somewhat more complicated than the visual method!

Note: This calculation does not say that the distribution $(1/2)(N(m_1,s)+N(m_2,s))$ (with $s=(m_2-m_1)/6$ ) fits the given data. It merely says that if this is the type of distribution, then $m_1=2.05$ and $m_2=4.3$ are (approximately) the MLE’s for these parameters.

Assignment

Take 10 different coins with unknown biases and choose one at random.

mycoins <- runif(10,min=0.2,max=0.8)
mychosen <- sample(mycoins,1)

Now flip the coin a certain number of times.

flips <- 107
heads <- rbinom(1,flips,chosen)

Apply the method given above to find maximum likelihood estimate for which coin it is.

Do assignment 1 with a different number of flips like 50 or 200. How good are is the MLE now?
Copy the data file /usr/local/lib/summer_data_ozone_temp_solar.csv to your project directory and load it (or load it directly).

dat <- read.csv('/usr/local/lib/summer_data_ozone_temp_solar.csv')
dat[1:5,]

Now, isolate the data for noon time ozone levels.

ozo <- dat$o3_avg[dat$time==12]
summary(ozo)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  6.075  41.470  53.760  52.250  62.330 103.600       7

hist(ozo)

Assuming that this belongs to a normal distribution $N(m,s)$ centred around the mean $m$ , estimate the maximum likelihood estimate for $s$ .

m <- mean(ozo)
ll(s) <- function(s){sum(dnorm(ozo,mean=m,sd=s,log=TRUE))}

Here ll is the log likelihood function. So make an appropriate plot and “guess” the value.

Last modified: Monday, 2 April 2018, 6:36 PM

MTH202 Probability and Statistics

Lab Assignment 5

Which coin?

Continuous distributions

Assignment

Hosted and maintained by IISER Mohali Computer Centre