Ugrás a tartalomhoz

## SOCIAL STATISTICS

Renáta Németh, Dávid Simon

ELTE

3. fejezet - Lecture 3

## 3. fejezet - Lecture 3

Topics

• Frequency distributions for interval-ratio variables

• Cumulative distribution

• Rates

## Frequency distributions for interval-ratio variables

A frequency distribution for nominal and ordinal level variables is simple to construct. List the categories and count the number of observations that fall into each category.

Example: marital status of the respondent (nominal)

 Frequency Percentage Married 559 55.9 Widowed 164 16.4 Divorced 110 11.0 Unmarried partners 24 2.4 Single 143 14.3 Total 1000 100.0

How close do you feel to your town/city? (ordinal)

 Frequency Percentage Very close 587 58.7 Close 250 25.0 Not very close 102 10.2 Not close at all 60 6.0 Total 999 100

Interval-ratio variables have usually a wide range of values, which makes simple frequency distributions very difficult to read.

Example: age of respondent

 Age Frequency Percentage 18 13 1.3 19 13 1.3 20 17 1.7 21 12 1.2 22 11 1.1 23 13 1.3 24 17 1.7 25 8 .8 26 31 3.1 27 13 1.3 28 16 1.6 29 15 1.5 30 15 1.5 31 14 1.4 32 19 1.9 33 15 1.5 34 19 1.9 35 20 2.0 36 15 1.5 37 21 2.1 38 14 1.4 39 22 2.2 40 20 2.0 41 28 2.8 42 27 2.7 43 16 1.6 44 19 1.9 45 23 2.3 46 23 2.3 47 16 1.6 48 20 2.0 49 17 1.7 50 13 1.3 51 22 2.2 52 13 1.3 53 14 1.4 54 17 1.7 55 16 1.6 56 17 1.7 57 17 1.7 58 15 1.5 59 7 .7 60 14 1.4 61 16 1.6 62 21 2.1 63 17 1.7 64 14 1.4 65 12 1.2 66 17 1.7 67 16 1.6 68 10 1.0 69 18 1.8 70 17 1.7 71 12 1.2 72 12 1.2 73 14 1.4 74 9 .9 75 7 .7 76 8 .8 77 2 .2 78 10 1.0 79 7 .7 80 4 .4 81 5 .5 82 4 .4 83 6 .6 84 2 .2 85 2 .2 86 2 .2 87 4 .4 88 4 .4 89 1 .1 Total 1000 100.0

For more easy reading, the large number of different values could be reduced into a smaller number of groups (classes), each containing a range of values.

How to construct classes?

Two possible methods:

1. On theoretical base: class intervals depend on what makes sense in terms of the purpose of the research

(e.g. age groups may be defined according to legal/economic/social age boundaries; child: 0–18, adult: 19–61, elderly: 62–)

2. Mathematical methods:

 Frequency Percentage -19 26 2.6 20-29 153 15.3 30-39 174 17.4 40-49 209 20.9 50-59 151 15.1 60-69 155 15.5 70+ 132 13.2 Total 1000 100.0

b) equal class sizes (quantiles)

 Frequency Percentage 18-31 208 20.8 32-41 193 19.3 42-52 209 20.9 53-65 197 19.7 66+ 193 19.3 Total 1000 100.0

Terminology: quintiles (devided into 5), “the first (or lowest) quintile is 31” etc.

Quantiles can be computed with the help of the cumulative distribution.

Cumulative distribution

A cumulative frequency (percentage) distribution shows the frequencies (percentages) at or below each category of the variable.

For which levels of measurement is this meaningful?

Example (ISSP 2006):

„Do you think it should or should not be the government’s responsibility to provide a job for everyone who wants one?”

 Frequency Cumulative frequency Percentage Cumulative percentage Definitely should be 516 516 51.7 51.7 Probably should be 389 905 38.9 90.6 Probably should not be 84 989 8.4 99.0 Definitely should not be 10 999 1.0 100.0 Total 999 100.0

It is easy to see…

- what percentage of the respondents think the government is responsible to some extent

(90.6 %),

- what percentage of the respondents do not think that the government definitely should not be responsible

(99.0 %).

Back to the quantiles.

Quantiles can be easily computed using the cumulative percentage distribution. For example 20% of the observations are at or below the first quintile.

In some cases it is not obvious which threshold to choose as a quantile, see the cumulative distribution of age below. What is the first quintile here? 30 or 31?

Rule of thumb: choose the lowest category that has a cumulative percentage greater than 20%.

Following the rule, let choose 31 as the first quintile here.

There are more sophisticated alternative methods for selecting quantiles in such an ambiguous case, see for example Frankfort-Nachmias (1997).

Which values are the second, third and fourth quintiles?

 Age Frequency Percentage Cumulative percentage 18 13 1.3 1.3 19 13 1.3 2.6 20 17 1.7 4.3 21 12 1.2 5.5 22 11 1.1 6.6 23 13 1.3 7.9 24 17 1.7 9.6 25 8 .8 10.4 26 31 3.1 13.5 27 13 1.3 14.8 28 16 1.6 16.4 29 15 1.5 17.9 30 15 1.5 19.4 31 14 1.4 20.8 32 19 1.9 22.7 33 15 1.5 24.2 34 19 1.9 26.1 35 20 2.0 28.1 36 15 1.5 29.6 37 21 2.1 31.7 38 14 1.4 33.1 39 22 2.2 35.3 40 20 2.0 37.3 41 28 2.8 40.1 42 27 2.7 42.8 43 16 1.6 44.4 44 19 1.9 46.3 45 23 2.3 48.6 46 23 2.3 50.9 47 16 1.6 52.5 48 20 2.0 54.5 49 17 1.7 56.2 50 13 1.3 57.5 51 22 2.2 59.7 52 13 1.3 61 53 14 1.4 62.4 54 17 1.7 64.1 55 16 1.6 65.7 56 17 1.7 67.4 57 17 1.7 69.1 58 15 1.5 70.6 59 7 .7 71.3 60 14 1.4 72.7 61 16 1.6 74.3 62 21 2.1 76.4 63 17 1.7 78.1 64 14 1.4 79.5 65 12 1.2 80.7 66 17 1.7 82.4 67 16 1.6 84 68 10 1.0 85 69 18 1.8 86.8 70 17 1.7 88.5 71 12 1.2 89.7 72 12 1.2 90.9 73 14 1.4 92.3 74 9 .9 93.2 75 7 .7 93.9 76 8 .8 94.7 77 2 .2 94.9 78 10 1.0 95.9 79 7 .7 96.6 80 4 .4 97 81 5 .5 97.5 82 4 .4 97.9 83 6 .6 98.5 84 2 .2 98.7 85 2 .2 98.9 86 2 .2 99.1 87 4 .4 99.5 88 4 .4 99.9 89 1 .1 100 Total 1000 100.0

Further example for quantiles:

quartiles (divided into 4):

 Frequency Percentage 18-34 261 26.1 35-46 248 24.8 47-62 255 25.5 63+ 236 23.6 Total 1000 100.0

deciles (10):

 Frequency Percentage 18-25 104 10.4 26-31 104 10.4 32-37 109 10.9 ... … … 73+ 91 9.1 Total 1000 100.0

terciles (or tertiles) (3):

 Frequency Percentage 18-39 353 35.3 39-56 321 32.1 57+ 326 32.6 Total 1000 100.0

percentiles (100)

The 25th percentile is the lowest quartile; the 30th percentile is the third decile etc.

median (50)

see in Section Median

Application: comparing two frequency distributions

During industrialization, the age structure has changed radically:

• life expectancy increased,

• infant mortality decreased, while

• birth rate decreased.

Based on the age terciles below, try to find out which country is developed and which is developing?

For another example of the application of quantiles see Section Decile ratio.

What value to assign to a class?

A frequent problem in research practice.

Example: in income questions, respondents are often asked to identify an interval rather than a single precise value.

What is your monthly net income?

Response categories:

Less than 100,000 Ft

100,000 to 200,000 Ft

200,001 to 350,000 Ft

350,001 to 600,000 Ft

More than 600,000 Ft

What are the advantages of this form of question?

• income is a sensitive topic, associated with high non-response; this form is less sensitive

• many people do not know their precise net income

If we want to treat the variable as interval-ratio, we should assign values to their categories. (For example in order to compute total household income).

A possible solution is the middle of the interval:

Less than 100,000 Ft 50,000 Ft

100,000 to 200,000 Ft 150,000 Ft

200,001 to 350,000 Ft 275,000 Ft

350,001 to 600,000 Ft 475,000 Ft

More than 600,000 Ft ?

The upper limit of the last interval is not known, may be estimated by external data sources.

Rates

Terms such as birth rate or unemployment rate are often used by social scientists.

A rate is a number obtained by dividing the number of cases (births, unemployeds etc) by the size of the total population.

• The numerator and the denominator are measured in the same time period (most frequently in a year).

• Rates can be calculated on a more narrowly defined subpopulation E.g. unemployment rate within labor force (employed + unemployed persons).

For further application examples see the lecture about social indicators.

Example:

In 1989 sick-pay days per worker was 25:

number of sick-pay days in 1989 (101.8 million) / number of entitled persons in 1989 (4.064 million)

• different time points (trends) and

• different populations can be compared,

• by controlling for different population sizes

Example:

When comparing social security expenditures of two countries, simple contrasting of the number of sick-pay days does not yield a valid comparison, because the number of entitled persons may be different.

Similar example: per capita GDP

Rates are often expressed as rates per thousand or hundred thousand to make the numbers easier to interpret.

For example suicide rate per 100,000 persons

in Hungary (2002): 28.

Instead of 0.00028 suicide per person

Again: when comparing two regions with regard to suicidal tendencies, contrasting number of suicides does not yield a valid comparison because of the different population sizes. However, number of suicides per 100,000 persons is a meaningful indicator. E.g. in 2002, suicide rate was 38.5 in the Southern-Great Plain region of Hungary, while the country’s overall rate was 28.

(Remark: Suicide as a cultural/sociological phenomenon. In southern and south-eastern districts of the Hungarian Plain, the suicide rate has been 2-3 times higher for 135 years than in the western and north-western areas of the country.)

Rates are computed from population data (based on official data sources such as censuses) rather than sample data. Such information is regularly reported by national bureaus of statistics.

Two healthcare indicators:

• indicator A: number of GPs per 100.000 inhabitants

• indicator B: number of patients per GP

What does an increase in indicator A / in indicator B imply?

Example: Hungarian city crime ranking

Do the data yield a valid comparison? (Source: Unified System of Criminal Statistics of the Investigative Authorities and of Public Prosecution, 2008)

No! The best ranked Pilis has 11,000 inhabitants, while the second best ranked Ózd has 38,000.

Such inadequate indicators are sometimes reported in the media.

However, the indicator below is better defined. Why?

In addition to crime rate, number of crimes is also reported here. Why could it be informative?

 Ranking City Crimes per 10,000 inhabitants Total number of crimes 1. Lengyeltóti 181 61 2. Tiszalök 168 99 3. Nyékládháza 154 76 4. Siófok 145 349 5. Harkány 128 49 6. Vásárosnamény 119 107 7. Jászberény 118 320 ...10. Hajdúsámson 113 142 ...19. Ózd 94 341 ...23. Komló 83 217 ...27. Szigetszentmikós 76 233

What information is most important when interpreting statistical data?

1. When were the data collected?

2. What is the research population?

3. If sample data:

a) Method of sampling?

b) Sample size?

c) Nonresponse rate?

4. Exact definition of variables? If table: what are the row- and column headings?