Lab Assignment 2

Lab Assignment 2

Frequency

Given a certain population of individuals and some determinable attribute of this population, we can make a chart showing the number of individuals who share a particular value for that attribute.

For example, the built-in dataset discoveries in R contains the number of discoveries made each year from 1860 to 1959. For each k=0,1,\dots we can look at how often there were k discoveries during that particular year. The y-axis gives us the frequencies in the following plot.

hist(discoveries, xlab="number of discoveries", ylab="", main="",breaks=12)

Instead of using this built-in dataset, we can “fake” one of our own which is more “regular” by repeating various values.

fake <- c(rep(1:21,1:21),rep(7:0,(0:7)*3)+22)
hist(fake,breaks=seq(-0.5,29.5,by=1))

We can calculate the “standard” statistics of this data. The mode is clearly at 21.

median(fake)
[1] 18
mean(fake)
[1] 16.91111
var(fake)
[1] 37.93475

Thus, we see that mean, median and mode can be different! Instead of calculating each thing separately, R allows us to calculate all the “summary” statistics directly.

summary(fake)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   13.00   18.00   16.91   22.00   28.00 

We can also calculate the mode by finding which value occurs most frequently.

vals <- unique(fake)
tab <- tabulate(match(fake, vals))
vals[tab == max(tab)]
[1] 21 22

Multiple Choice

Instead of fake data we can also simulate an experiment.

Here we consider an experiment where students randomly select answers to 10 4-choice questions in a test. In this case all the correct answers are (for simplicity!) assumed to be the choice 1. Given that -1 is marked for a wrong answer and 3 marks for a correct answer, we see that the typical score sheet looks like the following.

sc <- sample(c(3,-1,-1,-1),10,replace=T)
sc
 [1]  3 -1 -1  3 -1 -1 -1  3 -1 -1

The actual score is given by

sum(sc)
[1] 2

Let us now assume that a 1000 students give the test and look at their scores.

escores <- replicate(1000,sum(sample(c(3,-1,-1,-1),10,replace=T)))
summary(escores)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-10.000  -2.000  -2.000  -0.148   2.000  18.000 

We can also plot this as above

hist(escores,breaks=seq(-10.5,22.5,by=1))

We can actually calculate the precise probabilities for each score between -10 and 30. The probability of getting k 3’s and the rest as (-1) is P(C_k)=\binom{10}{k} (1/4)^k (3/4)^{10-k}. The score associated with this is 3k-(10-k) or 4k-10. We can plot this for k between 0 and 10.

ks <- c(0:10)
s_to_p <- data.frame(score=4*ks-10,prob=dbinom(ks,10,1/4))
plot(s_to_p,type="l")

To see clearly what happens near the 0 score let us plot only the part with scores in the range -10 to 10.

plot(s_to_p[s_to_p$score >= -10 & s_to_p$score <=5, ],type="l")

We can also see where this probability is the maximum.

s_to_p$score[s_to_p$prob == max(s_to_p$prob)]
[1] -2

Lab Assignment 2

  1. Play with the basic R functions rep, seq, sample to see how they work.

  2. Create some fake data which consists of the numbers k in 1:50 each repeated k times. Plot its histogram and calculate the various summary statistics. Do the same where the numbers k in 1:50 are repeated (50-k) times or for other distributions.

  3. Simulate an examination with different negative marking schemes and see what the scores look like. For example, no negative marks or 4 marks for the correct answer.


Last modified: Tuesday, 20 February 2018, 2:33 PM