Lab Session 1

Lab Assignment 1

Random-ness

We will see how numbers you enter can be checked for random-ness.

mydata <- c(0,1,1,0,1,0,1,0,0,1,1,0,1,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1)
num <- length(mydata)

Now we can compare this with random generation

randata <- sample(c(0,1),num,replace=T)

Now, obviously, the two sets are different!

mydata
 [1] 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1
randata
 [1] 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0

The question we want to ask is: “Are they statistically different?”

Simple tests

We can just count the number of 0’s and 1’s in each case.

sum(mydata)
[1] 14
sum(randata)
[1] 13

What should we expect? The “random” data should have 1’s and 0’s to be equally likely. So we should compare these with the value

0.5*num
[1] 14

Looking at this it looks like our data is “more random” than the random sample! So we are still left with the question of how significant is the deviation of 11 from 14 in this kind of analysis.

Counting pairs

How about the successive pairs? We can form tables of pairs.

mypairs  <- data.frame(a=mydata[1:num-1],b=mydata[2:num])
ranpairs <- data.frame(a=randata[1:num-1],b=randata[2:num])

How often does the pair (0,0) occur in each?

sum(mypairs$a==0 & mypairs$b==0)
[1] 5
sum(ranpairs$a==0 & ranpairs$b==0)
[1] 6

We expect the number to be about a quarter of the total.

num/4
[1] 7

We can also calculate the other pairs for my data

sum(mypairs$a==0 & mypairs$b==1)
[1] 9
sum(mypairs$a==1 & mypairs$b==0)
[1] 8
sum(mypairs$a==1 & mypairs$b==1)
[1] 5

We can similarly do it for the random data

sum(ranpairs$a==0 & ranpairs$b==1)
[1] 8
sum(ranpairs$a==1 & ranpairs$b==0)
[1] 8
sum(ranpairs$a==1 & ranpairs$b==1)
[1] 5

Again, how do we know that the data entered by me is “random” enough given that the randomly generated data seems to deviate from the expectation at least as much as what I entered?

Of course, I could also count triples and so on.

Run lengths

One other test is to look at lengths of “runs”. These are sequences of 1’s and 0’s.

rle(mydata)
Run Length Encoding
  lengths: int [1:18] 1 2 1 1 1 1 2 2 1 3 ...
  values : num [1:18] 0 1 0 1 0 1 0 1 0 1 ...
rle(randata)
Run Length Encoding
  lengths: int [1:17] 1 1 4 1 1 4 1 1 2 1 ...
  values : num [1:17] 0 1 0 1 0 1 0 1 0 1 ...

Here we see some difference. Since I was trying to be careful to give equal numbers of 0’s and 1’s, I avoided a long run. However, as we shall see later, long runs are not as improbable as one might think! While the run of length 7 is unusual, a run of length 4 in a sample of length 28 is not particularly unlikely.

A Mathematician’s answer

So, did I manage to type a “random” sequence or not? The mathematician’s answer is to calculate and give an estimate of the probability that the sequence is “random”!

This is done as follows. We use probability theory to calculate what would be the deviation from the expectation if there were a large number of experiments such as the one above with random sequences of the same length. We can then give an estimate of the likelihood as to whether my sequence is one such.

In more complex situations, such a calculation may not be possible! For example, even in the case above, without some additional probability theory, we do not know what the expected result is! In that case, in modern statistics, the accepted method is to carry out repeated simulations (using software like R). We can generate a large number (say about 1000) such seqences as randata and look at the mean and variance of the values that we obtain for counts such as above. By the “Central Limit Theorem” (which we will learn later) this distribution of values will follow a “normal” distribution. So we can decide how far from the mean is “significant”.

Today’s experiment

  1. Open a project called Lab1.

  2. Type out a large-ish sequence (about a 100) 0’s and 1’s as your own mydata.

  3. Use commands such as the one above to generated a random sequence randata.

  4. Calculate the number of 1’s in each case and compare your results with the number expected.

  5. Calculate the number of pairs in each case and compare your results with the number expected.

  6. Look at the run-length calculations and see how different they are.

  7. Repeat steps 3-4 after re-generating randata (as in step 2). This will give you some idea how the data varies.

  8. (Starred) Try to “touch-up” mydata to meet the expected values. How successful are you?

Last modified: Tuesday, 20 February 2018, 2:32 PM