Lab Assignment 3

Lab Assignment 3

Law of Large Numbers

The Law of Large Numbers can be interpreted in statistical thinking as follows. If we take a large number of samples (with replacement) from a population and measure a certain quantity, then the average of the sample values obtained comes close to the population average with high probability. (We assume as usual that each individual in the population is equally likely to be picked). In other words, for a large enough sample, the sample average is close to the population average with high probability. Let us examine this by simulation.

We first import some data. (To do this we need to use a bash shell chunk rather than an R chunk.)

cp /usr/local/lib/dat.csv .

We can now import it into our computation (this can also be done by using the Import Dataset button on the top-right tab):

dat <- read.csv("dat.csv")

We observe that there are two columns to this data. Let us plot the histograms to get a sense of the data.

hist(dat$X)

hist(dat$Y)

Now we calculate the basic attributes of this dataset.

num=dim(dat)[1]
num
[1] 180
summary(dat$X)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   1.500   1.472   2.000   4.000 
mX <- mean(dat$X)
summary(dat$Y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    9.50   12.00   11.41   14.00   15.00 
mY <- mean(dat$Y)

For later use we also need the variances

vX <- var(dat$X)
vX
[1] 1.370732
vY <- var(dat$Y)
vY
[1] 9.269483

By Chebychev’s inequality, we expect a deviation of c from the expected value to have probability at most s^2/(n\cdot c^2) where s^2 is the variance and n is the sample size. Let’s see what we get for samples of size 15.

samn <- 15
cX <- 0.7
pX <- vX/(samn*(cX)^2)
pX
[1] 0.1864942
cY <- 2.0
pY <- vY/(samn*(cY)^2)
pY
[1] 0.1544914

We can see these are low enough to be unlikely!

We now calculate the sample averages to see how close these are to the actual values.

First create a random samples of size 15.

sdat<- dat[sample(c(1:num),samn),]

Now calculate averages for this sample

smX <- mean(sdat$X)
smX
[1] 1.9
smY <- mean(sdat$Y)
smY
[1] 10.76667

Thus the actual deviations can be compared with the projected deviations:

c(abs(smX-mX),cX)
[1] 0.4277778 0.7000000
c(abs(smY-mY),cY)
[1] 0.6416667 2.0000000

We see that the deviations are actually substantially lower than what we accounted for!

Poisson vs Binomial

The Poisson distribution is typically used as a good approximation to the Binomial distribution. Let us define a function that compares these two distributions.

comp <- function(p,N,k){
  a <- p*N;
  plot(dbinom(k,N,p),xlab="successes",ylab="probs",col="blue");
  points(dpois(k,a),col="red");
}

The value p represents the probability of success in a single Bernoulli trial. The Binomial distribution is the one associated with N independent trials of this type. We are trying to compare the probability of k successes for a vector of values. The blue colour represents the “real” probabilities coming from the Binomial distribution and the red colour is the Poisson approximation.

The approximation is supposed to work well when:

  • The probablity p of success is quite small.
  • The number N of Bernoulli trials (in the Binomial) is very large.
  • We are determining the probability of a small number k of successes.

Note that the first two conditions need to be co-ordinated to get the expected number of successes as a known constant a=Np. This is the parameter in the Poisson distribution.

p <- 1/5
N <- 10
ks <- c(0:10)
comp(p,N,ks)

Assignment Problems

  1. Try to take other values of cX and cY and other sample sizes with the same data set and see how well the weak law of large numbers works.

  2. Vary the values of a (via values of N and p) and k to get a feeling for when the Poisson distribution is a good approximation to the Binomial distribution and when it is not.

  3. Repeat the first exercise with other data sets that are in-built in R.


Last modified: Tuesday, 20 February 2018, 2:46 PM