Renáta Németh, Dávid Simon
ELTE
With interval-ratio variables.
They give us information about the overall variation and, unlike the range or the IQR, are not based on only two values.
The most frequently used measures of variability.
They reflect how much, on the average, each value of the variable deviates from the mean.
The sensitivity of the mean to outliers carries over to the calculation of these measures. (Hence they are not appropriate for much skewed distributions; see Section Finding the appropriate measure of variability).
They can have only positive values, a value of 0 means that there is no variability in the distribution (that is, each observation has the same value). A greater value shows greater variability.
The variance and the standard deviation can be calculated from each other. Variance is the mean of the squared deviations from the mean, the standard deviation is its square root:
Variance:
where Y denotes the variable, n is the sample size, is the mean.
Standard deviation:
Why we use squared deviations?
By simply using the deviations the sum of the deviations would be always zero, because the negative and positive deviations would neutralize each other. E.g. for the sample {1, 2, 3}, the sum of deviations would be
so the variance would be also 0, though there is some variability in the distribution!
We could use the absolute values of the deviations, but absolute values are mathematically difficult to work with. Another difference between absolute and squared deviations is that squaring increases deviations greater than 1, while decreases deviations smaller than 1. That is, squaring penalizes larger deviations. E.g. for the sample {1, 3, 8}, the sum of absolute deviations would be
while the sum of squared deviations is
Example for calculating the variance and the standard deviation
Consider the sample {1, 3, 8} again. Variance is (9+1+16)/3 = 26/3 = 8.7, and standard deviation is its square root, 2.95.
Question
The variance of a variable is 0 if and only if all observations have the same value. Which other measures of variability have this property?
The standard deviation
The variance is based on squared deviations, therefore is no longer expressed in the original units of measurement.
For example, according to Hungarian ISSP data 2006, individual monthly net income has a mean of 134,244 Ft, while its variance is about 26.5 milliards, which is difficult to interpret.
Thus, the square root of variance is taken. This measure is called standard deviation.
In the previous example the standard deviation is 162,817. We can say that the typical deviation from the mean income of 134,000 is about 163,000. That is, income shows a large variability, since the standard deviation is greater than the mean.
Interpretation of the standard deviation is more obvious when comparing two groups or two points of time:
Example
Hungarian parliamentary election 1990 and 2002, first round turnout rates by county
(source: Hungarian Central Bureau of Statistics, Társadalmi helyzetkép, 2002).
County |
1990 |
2002 |
Budapest |
71.2 |
77.5 |
Pest |
63.3 |
70.6 |
Fejér |
64.5 |
69.6 |
Komárom-Esztergom |
64.5 |
71.0 |
Veszprém |
70.9 |
72.6 |
Gy-M-S |
76.4 |
73.9 |
Vas |
76.8 |
74.2 |
Zala |
69.3 |
70.7 |
Baranya |
65.9 |
71.8 |
Somogy |
62.5 |
68.0 |
Tolna |
64.0 |
68.5 |
B-A-Z |
61.0 |
68.0 |
Heves |
65.3 |
70.1 |
Nógrád |
62.6 |
69.3 |
H-B |
56.3 |
66.0 |
J-N-Sz |
59.0 |
66.7 |
Sz-Sz-B |
53.8 |
65.8 |
Bács-Kiskun |
60.7 |
65.0 |
Békés |
54.6 |
66.9 |
Csongrád |
63.4 |
67.3 |
Total |
65.8 |
70.5 |
Calculate the standard deviation of turnout rates in 1990 and in 2002.
The formula:
First step: calculate the mean. Can we use the national turnout rates (65.8 and 70.5) as means?
No. The national turnout rate is not equal to the mean of the county-specific turnout rates. The mean for 1990 is:
The mean for 2002 is:
After substitution into the formula, the standard deviation for 1990 is obtained as:
For 2002:
Interpret the difference in the means and the standard deviations!
Compared to 1990, the mean county-specific turnout rate increased by 5% for 2002. Standard deviation decreased by half for 2002, which shows that county-specific turnout rates were more homogenous in 2002.
Remark
In some textbooks there is n-1 instead of n in the denominator of the above formulas. The choice between the two definitions depends on convention. The variance defined with n-1 is often called sample variance, having some desirable properties when used to estimate the population variance. Population variance is always defined with n in the denominator.