## Variance of sample: decreased or not?

One beautiful day I was watching statistics lessons on Khan Academy about variance and standard deviation of sample. To give a quick intro, variance for population is defined by formula , but for sample, it is . You shall not get confused by notation, the only difference is denominator: for sample it is decreased by one. And Sal Khan, the teacher, encouraged me to try and see if it is true. I immediately figured out, that this could be done with not-so difficult program.

Follow me and I’ll show you how I done it!

## Idea and plan

Lets try and break things down.

- Generate population of
*N*random integers, from*a*to*b*; - Calculate variance and standard deviation of population;
- Extract a sample with
*n*random members of population; - Calculate variance and standard deviation using formulas for population and sample and compare with population ones;
- Repeat procedure from point 1 for
*k*amount of times; - Derive some conclusion from data produced.

**Note**: If you have any interest in reading my code, it is available for everyone. I even encourage you to check it and tell me how it can be improved!

## Execution of the plan

Step 1: generate populationPopulation can contain any data. Literally, any. Neither you or me can tell that data is not valid (unless some obvious cases), because we ask statistics for insights! So, I choose *random distribution* strategy.

Following code explains itself:

from random import randint population = [] for i in range(N): integer = randint(a, b) population.append(integer)

### Step 2: calculate variance and standard deviation

One of the reasons I love Python is libraries. NumPy is one of them that is worth mentioning. The reason I love it is functionality working with multi-dimensional arrays, but this time, I will use it only for calculating variance and standard deviation.

import numpy as np population_variance = np.var(population) population_standard_deviation = np.std(population)

### Step 3: extract sample

If you ask what is sample, then I want to give you simple answer. It is some subset of population; a small part of data, that could be possibly taken.

from random import sample as generate_sample sample = generate_sample(population, n)

### Step 4: calculate some statistics

sample_sigma2 = np.var(sample) sample_sigma = np.std(sample) sample_s2 = np.var(sample, ddof=1) sample_s = np.std(sample, ddof=1) sigma2_minus_s2 = abs(population_sigma2 - sample_s2) sigma_minus_s = abs(population_sigma - sample_s) sigma2_minus_sigma2 = abs(population_sigma2 - sample_sigma2) sigma_minus_sigma = abs(population_sigma - sample_sigma)

### Step 5: repeat (and collect data)

The software I wrote was asynchronous to run computations much faster than synchronous software would, so this step is a little bit too complicated to explain for few paragraphs.

If you are interested in this topic, then please read a post about it. 🙂

### Step 6: statistics

After 1000 tests, I got this kind of data.

**Note**: it is worth mentioning, that minuend (number before minus sign) is calculated using population’s data and subtrahend (number after minus sign) is calculated using sample’s data.

Averages: |σ²-σ²| = 2713.33889 |σ²-S²| = 2706.47660 |σ-σ| = 4.71017 |σ-S| = 4.69359

## Conclusions

Experiment, that I conducted, confirms hypothesis. Formula with decreased denominator works better for samples than with normal denominator. And it kinda makes sense for me, although I cannot explain it.

Another thing that I noticed that same goes for standard deviation, though Sal Khan said it was not true. Well, this case then may be exception. 🙂 I call this experiment successful!

Have you ever written software to prove your point?

Can think of a hypothesis that boggles you and software could prove it? Try it! It is fun! 🙂

Pingback: How I managed to make all cores work in unison - Tomas Čerkauskas