Variance of sample: decreased or not?
One beautiful day I was watching statistics lessons on Khan Academy about variance and standard deviation of sample. To give a quick intro, variance for population is defined by formula , but for sample, it is
. You shall not get confused by notation, the only difference is denominator: for sample it is decreased by one. And Sal Khan, the teacher, encouraged me to try and see if it is true. I immediately figured out, that this could be done with not-so difficult program.
Follow me and I’ll show you how I done it!
Idea and plan
Lets try and break things down.
- Generate population of N random integers, from a to b;
- Calculate variance and standard deviation of population;
- Extract a sample with n random members of population;
- Calculate variance and standard deviation using formulas for population and sample and compare with population ones;
- Repeat procedure from point 1 for k amount of times;
- Derive some conclusion from data produced.
Note: If you have any interest in reading my code, it is available for everyone. I even encourage you to check it and tell me how it can be improved!
Execution of the plan
Step 1: generate populationPopulation can contain any data. Literally, any. Neither you or me can tell that data is not valid (unless some obvious cases), because we ask statistics for insights! So, I choose random distribution strategy.
Following code explains itself:
from random import randint population = [] for i in range(N): integer = randint(a, b) population.append(integer)
Step 2: calculate variance and standard deviation
One of the reasons I love Python is libraries. NumPy is one of them that is worth mentioning. The reason I love it is functionality working with multi-dimensional arrays, but this time, I will use it only for calculating variance and standard deviation.
import numpy as np population_variance = np.var(population) population_standard_deviation = np.std(population)
Step 3: extract sample
If you ask what is sample, then I want to give you simple answer. It is some subset of population; a small part of data, that could be possibly taken.
from random import sample as generate_sample sample = generate_sample(population, n)
Step 4: calculate some statistics
sample_sigma2 = np.var(sample) sample_sigma = np.std(sample) sample_s2 = np.var(sample, ddof=1) sample_s = np.std(sample, ddof=1) sigma2_minus_s2 = abs(population_sigma2 - sample_s2) sigma_minus_s = abs(population_sigma - sample_s) sigma2_minus_sigma2 = abs(population_sigma2 - sample_sigma2) sigma_minus_sigma = abs(population_sigma - sample_sigma)
Step 5: repeat (and collect data)
The software I wrote was asynchronous to run computations much faster than synchronous software would, so this step is a little bit too complicated to explain for few paragraphs.
If you are interested in this topic, then please read a post about it. 🙂
Step 6: statistics
After 1000 tests, I got this kind of data.
Note: it is worth mentioning, that minuend (number before minus sign) is calculated using population’s data and subtrahend (number after minus sign) is calculated using sample’s data.
Averages: |σ²-σ²| = 2713.33889 |σ²-S²| = 2706.47660 |σ-σ| = 4.71017 |σ-S| = 4.69359
Conclusions
Experiment, that I conducted, confirms hypothesis. Formula with decreased denominator works better for samples than with normal denominator. And it kinda makes sense for me, although I cannot explain it.
Another thing that I noticed that same goes for standard deviation, though Sal Khan said it was not true. Well, this case then may be exception. 🙂 I call this experiment successful!
Have you ever written software to prove your point?
Can think of a hypothesis that boggles you and software could prove it? Try it! It is fun! 🙂
Pingback: How I managed to make all cores work in unison - Tomas Čerkauskas