Renáta Németh, Dávid Simon
ELTE
The (arithmetic) mean is what people call the “average”.
Appropriate for interval-ratio variables.
Let Y denote the interval-ratio variable, now the mean is:
where (y-bar) denotes the sample mean of Y, n is the sample size, Σ (sigma) is the summation sign in mathematics, and denotes summing over all y-value, and y_{i} is the value of Y measured for the ith observation in the sample. Generally, lowercase letters denote sample values, while uppercase letters denote variables or population factors.
Example. ISSP 2006, Hungarian data. Mean of monthly net income by party preference.
Party preference |
Mean |
n |
Std. deviation |
MDF |
224,050.00 |
10 |
198,666.730 |
SZDSZ |
133,392.86 |
14 |
158,119.986 |
FKGP |
57,166.67 |
6 |
11,214.574 |
MSZP |
123,963.76 |
264 |
149,650.388 |
FIDESZ |
125,898.94 |
231 |
158,621.847 |
Munkáspárt |
75,400.00 |
6 |
34,556.620 |
MIÉP |
165,433.50 |
8 |
207,676.491 |
Other |
159,100.00 |
10 |
181,112.273 |
Uncertain |
148,636.12 |
283 |
176,798.697 |
Total |
134,243.96 |
832 |
162,816.877 |
The supporters of which party have the highest mean income? Of which the second highest? Of which the lowest?
Important to note that data on uncertain voters are also informative: they seem to have higher mean income than certain voters.
Comment I:
Data above are from a sample. Party-specific differences in mean income may arise simply due to sampling error caused by observing a sample instead of the whole population (e.g. if by chance the only MDF-supporter with an extremely high income was selected). The question arises whether the mean income differences observed are valid for the population (in technical term: whether they are statistically significant differences). Statistical inference introduced in later courses gives the answer to that question.
Comment II:
The mean considers only one feature of the distribution. High mean income of MDF-supporters does not necessarily imply that each MDF-supporter has a high income (low variability). As an extreme example consider the case when only a few MDF-supporter with extremely high income pull the mean up. That is, income may have a distribution with high variability among the MDF supporters. Standard deviation, a measure relating to the variability of the distribution is shown in the fourth column of the table above. Standard deviation will be discussed in the next lecture.
Sensitivity to outliers (also called extremes).
Unlike with the mode or the median, every value enters into the calculation of the mean. Therefore the mean is sensitive to extremely high or extremely low values in the distribution
Example: a) no outlier
Y (monthy net income, $) |
Sample frequency |
Σy_{i} |
1000 |
1 |
1000 |
2000 |
2 |
4000 |
3000 |
4 |
12000 |
4000 |
2 |
8000 |
5000 |
1 |
5000 |
Total |
n=10 |
Σy_{i}=30,000 |
= 30,000/10 = 3,000
b) one outlier
Y (monthy net income, $) |
Sample frequency |
Σy_{i} |
1000 |
1 |
1000 |
2000 |
2 |
4000 |
3000 |
4 |
12000 |
4000 |
2 |
8000 |
35000 |
1 |
35000 |
Total |
n=10 |
Σy_{i}=60,000 |
= 60,000/10 = 6,000
The income of only one person has changed, but the mean has increased twice!
What are the medians in the above cases?
The median did not change, because it is not sensitive to outliers.