Variance of sample: decreased or not?

One beautiful day I was watching statistics lessons on Khan Academy about variance and standard deviation of sample. To give a quick intro, variance for population is defined by formula \( \sigma^2 = { { \sum_{i=1}^N ( x_i – \mu )^2 } \over N } \) , but for sample, it is \( S^2 = { { \sum_{i=1}^n ( x_i – \bar{x} )^2 } \over n } \). You shall not get confused by notation, the only difference is denominator: for sample it is decreased by one. And Sal Khan, the teacher, encouraged me to try and see if it is true. I immediately figured out, that this could be done with not-so difficult program.

Follow me and I’ll show you how I done it!

Idea and plan

Lets try and break things down.

Generate population of N random integers, from a to b;
Calculate variance and standard deviation of population;
Extract a sample with n random members of population;
Calculate variance and standard deviation using formulas for population and sample and compare with population ones;
Repeat procedure from point 1 for k amount of times;
Derive some conclusion from data produced.

Note: If you have any interest in reading my code, it is available for everyone. I even encourage you to check it and tell me how it can be improved!

Execution of the plan

Step 1: generate population

Population can contain any data. Literally, any. Neither you or me can tell that data is not valid (unless some obvious cases), because we ask statistics for insights! So, I choose random distribution strategy.

Following code explains itself:

from random import randint

population = []

for i in range(N):
    integer = randint(a, b)

    population.append(integer)

Step 2: calculate variance and standard deviation

One of the reasons I love Python is libraries. NumPy is one of them that is worth mentioning. The reason I love it is functionality working with multi-dimensional arrays, but this time, I will use it only for calculating variance and standard deviation.

import numpy as np

population_variance = np.var(population)
population_standard_deviation = np.std(population)

Step 3: extract sample

If you ask what is sample, then I want to give you simple answer. It is some subset of population; a small part of data, that could be possibly taken.

from random import sample as generate_sample

sample = generate_sample(population, n)

Step 4: calculate statistics

sample_sigma2 = np.var(sample)
sample_sigma = np.std(sample)
sample_s2 = np.var(sample, ddof=1)
sample_s = np.std(sample, ddof=1)

sigma2_minus_s2 = abs(population_sigma2 - sample_s2)
sigma_minus_s = abs(population_sigma - sample_s)
sigma2_minus_sigma2 = abs(population_sigma2 - sample_sigma2)
sigma_minus_sigma = abs(population_sigma - sample_sigma)

Step 5: repeat (and collect data)

The software I wrote was asynchronous to run computations much faster than synchronous software would, so this step is a little bit too complicated to explain for few paragraphs.

If you are interested in this topic, then please read a post about it. 🙂

After 1000 tests, I got this kind of data.

Note: it is worth mentioning, that minuend (number before minus sign) is calculated using population’s data and subtrahend (number after minus sign) is calculated using sample’s data.

Averages:

|σ²-σ²| = 2713.33889
|σ²-S²| = 2706.47660
|σ-σ| = 4.71017
|σ-S| = 4.69359

Conclusion

Experiment, that I conducted, confirms hypothesis. Formula with decreased denominator works better for samples than with normal denominator. And it kinda makes sense for me, although I cannot explain it.

Another thing that I noticed that same goes for standard deviation, though Sal Khan said it was not true. Well, this case then may be exception. 🙂 I call this experiment successful!

Have you ever written software to prove your point?

Can think of a hypothesis that boggles you and software could prove it? Try it! It is fun! 🙂