Ugrás a tartalomhoz

SOCIAL STATISTICS

Renáta Németh, Dávid Simon

ELTE

3. fejezet - Lecture 3

3. fejezet - Lecture 3

Topics

  • Frequency distributions for interval-ratio variables

  • Cumulative distribution

  • Rates

Frequency distributions for interval-ratio variables

A frequency distribution for nominal and ordinal level variables is simple to construct. List the categories and count the number of observations that fall into each category.

Example: marital status of the respondent (nominal)

Frequency

Percentage

Married

559

55.9

Widowed

164

16.4

Divorced

110

11.0

Unmarried partners

24

2.4

Single

143

14.3

Total

1000

100.0

 How close do you feel to your town/city? (ordinal)

Frequency

Percentage

Very close

587

58.7

Close

250

25.0

Not very close

102

10.2

Not close at all

60

6.0

Total

999

100

Interval-ratio variables have usually a wide range of values, which makes simple frequency distributions very difficult to read.

Example: age of respondent

Age

Frequency

Percentage

18

13

1.3

19

13

1.3

20

17

1.7

21

12

1.2

22

11

1.1

23

13

1.3

24

17

1.7

25

8

.8

26

31

3.1

27

13

1.3

28

16

1.6

29

15

1.5

30

15

1.5

31

14

1.4

32

19

1.9

33

15

1.5

34

19

1.9

35

20

2.0

36

15

1.5

37

21

2.1

38

14

1.4

39

22

2.2

40

20

2.0

41

28

2.8

42

27

2.7

43

16

1.6

44

19

1.9

45

23

2.3

46

23

2.3

47

16

1.6

48

20

2.0

49

17

1.7

50

13

1.3

51

22

2.2

52

13

1.3

53

14

1.4

54

17

1.7

55

16

1.6

56

17

1.7

57

17

1.7

58

15

1.5

59

7

.7

60

14

1.4

61

16

1.6

62

21

2.1

63

17

1.7

64

14

1.4

65

12

1.2

66

17

1.7

67

16

1.6

68

10

1.0

69

18

1.8

70

17

1.7

71

12

1.2

72

12

1.2

73

14

1.4

74

9

.9

75

7

.7

76

8

.8

77

2

.2

78

10

1.0

79

7

.7

80

4

.4

81

5

.5

82

4

.4

83

6

.6

84

2

.2

85

2

.2

86

2

.2

87

4

.4

88

4

.4

89

1

.1

Total

1000

100.0

 

For more easy reading, the large number of different values could be reduced into a smaller number of groups (classes), each containing a range of values.

How to construct classes?

Two possible methods:

1. On theoretical base: class intervals depend on what makes sense in terms of the purpose of the research

(e.g. age groups may be defined according to legal/economic/social age boundaries; child: 0–18, adult: 19–61, elderly: 62–)

2. Mathematical methods:

a) equal intervals (e.g. decades)

Frequency

Percentage

-19

26

2.6

20-29

153

15.3

30-39

174

17.4

40-49

209

20.9

50-59

151

15.1

60-69

155

15.5

70+

132

13.2

Total

1000

100.0

b) equal class sizes (quantiles)

Frequency

Percentage

18-31

208

20.8

32-41

193

19.3

42-52

209

20.9

53-65

197

19.7

66+

193

19.3

Total

1000

100.0

Terminology: quintiles (devided into 5), “the first (or lowest) quintile is 31” etc.

Quantiles can be computed with the help of the cumulative distribution.

Cumulative distribution

A cumulative frequency (percentage) distribution shows the frequencies (percentages) at or below each category of the variable.

For which levels of measurement is this meaningful?

Example (ISSP 2006):

„Do you think it should or should not be the government’s responsibility to provide a job for everyone who wants one?”

Frequency

Cumulative frequency

Percentage

Cumulative percentage

Definitely should be

516

516

51.7

51.7

Probably should be

389

905

38.9

90.6

Probably should not be

84

989

8.4

99.0

Definitely should not be

10

999

1.0

100.0

Total

999

100.0

 

It is easy to see…

 - what percentage of the respondents think the government is responsible to some extent

(90.6 %),

- what percentage of the respondents do not think that the government definitely should not be responsible

(99.0 %).

Back to the quantiles.

Quantiles can be easily computed using the cumulative percentage distribution. For example 20% of the observations are at or below the first quintile.

In some cases it is not obvious which threshold to choose as a quantile, see the cumulative distribution of age below. What is the first quintile here? 30 or 31?

Rule of thumb: choose the lowest category that has a cumulative percentage greater than 20%.

Following the rule, let choose 31 as the first quintile here.

There are more sophisticated alternative methods for selecting quantiles in such an ambiguous case, see for example Frankfort-Nachmias (1997).

Which values are the second, third and fourth quintiles?

Age

Frequency

Percentage

Cumulative percentage

18

13

1.3

1.3

19

13

1.3

2.6

20

17

1.7

4.3

21

12

1.2

5.5

22

11

1.1

6.6

23

13

1.3

7.9

24

17

1.7

9.6

25

8

.8

10.4

26

31

3.1

13.5

27

13

1.3

14.8

28

16

1.6

16.4

29

15

1.5

17.9

30

15

1.5

19.4

31

14

1.4

20.8

32

19

1.9

22.7

33

15

1.5

24.2

34

19

1.9

26.1

35

20

2.0

28.1

36

15

1.5

29.6

37

21

2.1

31.7

38

14

1.4

33.1

39

22

2.2

35.3

40

20

2.0

37.3

41

28

2.8

40.1

42

27

2.7

42.8

43

16

1.6

44.4

44

19

1.9

46.3

45

23

2.3

48.6

46

23

2.3

50.9

47

16

1.6

52.5

48

20

2.0

54.5

49

17

1.7

56.2

50

13

1.3

57.5

51

22

2.2

59.7

52

13

1.3

61

53

14

1.4

62.4

54

17

1.7

64.1

55

16

1.6

65.7

56

17

1.7

67.4

57

17

1.7

69.1

58

15

1.5

70.6

59

7

.7

71.3

60

14

1.4

72.7

61

16

1.6

74.3

62

21

2.1

76.4

63

17

1.7

78.1

64

14

1.4

79.5

65

12

1.2

80.7

66

17

1.7

82.4

67

16

1.6

84

68

10

1.0

85

69

18

1.8

86.8

70

17

1.7

88.5

71

12

1.2

89.7

72

12

1.2

90.9

73

14

1.4

92.3

74

9

.9

93.2

75

7

.7

93.9

76

8

.8

94.7

77

2

.2

94.9

78

10

1.0

95.9

79

7

.7

96.6

80

4

.4

97

81

5

.5

97.5

82

4

.4

97.9

83

6

.6

98.5

84

2

.2

98.7

85

2

.2

98.9

86

2

.2

99.1

87

4

.4

99.5

88

4

.4

99.9

89

1

.1

100

Total

1000

100.0

Further example for quantiles:

quartiles (divided into 4):

Frequency

Percentage

18-34

261

26.1

35-46

248

24.8

47-62

255

25.5

63+

236

23.6

Total

1000

100.0

deciles (10):

Frequency

Percentage

18-25

104

10.4

26-31

104

10.4

32-37

109

10.9

...

73+

91

9.1

Total

1000

100.0

terciles (or tertiles) (3):

Frequency

Percentage

18-39

353

35.3

39-56

321

32.1

57+

326

32.6

Total

1000

100.0

percentiles (100)

The 25th percentile is the lowest quartile; the 30th percentile is the third decile etc.

median (50)

see in Section Median

Application: comparing two frequency distributions

During industrialization, the age structure has changed radically:

  • life expectancy increased,

  • infant mortality decreased, while

  • birth rate decreased.

Based on the age terciles below, try to find out which country is developed and which is developing?

n_pic_3

For another example of the application of quantiles see Section Decile ratio.

What value to assign to a class?

A frequent problem in research practice.

Example: in income questions, respondents are often asked to identify an interval rather than a single precise value.

What is your monthly net income?

Response categories:

Less than 100,000 Ft

100,000 to 200,000 Ft

200,001 to 350,000 Ft

350,001 to 600,000 Ft

More than 600,000 Ft

What are the advantages of this form of question?

  • income is a sensitive topic, associated with high non-response; this form is less sensitive

  • many people do not know their precise net income

If we want to treat the variable as interval-ratio, we should assign values to their categories. (For example in order to compute total household income).

A possible solution is the middle of the interval:

Less than 100,000 Ft 50,000 Ft

100,000 to 200,000 Ft 150,000 Ft

200,001 to 350,000 Ft 275,000 Ft

350,001 to 600,000 Ft 475,000 Ft

More than 600,000 Ft ?

The upper limit of the last interval is not known, may be estimated by external data sources.

Rates

Terms such as birth rate or unemployment rate are often used by social scientists.

A rate is a number obtained by dividing the number of cases (births, unemployeds etc) by the size of the total population.

  • The numerator and the denominator are measured in the same time period (most frequently in a year).

  • Rates can be calculated on a more narrowly defined subpopulation E.g. unemployment rate within labor force (employed + unemployed persons).

For further application examples see the lecture about social indicators.

Example:

In 1989 sick-pay days per worker was 25:

number of sick-pay days in 1989 (101.8 million) / number of entitled persons in 1989 (4.064 million)

Advantages:

  • different time points (trends) and

  • different populations can be compared,

  • by controlling for different population sizes

Example:

When comparing social security expenditures of two countries, simple contrasting of the number of sick-pay days does not yield a valid comparison, because the number of entitled persons may be different.

Similar example: per capita GDP

Rates are often expressed as rates per thousand or hundred thousand to make the numbers easier to interpret.

For example suicide rate per 100,000 persons

in Hungary (2002): 28.

Instead of 0.00028 suicide per person

Again: when comparing two regions with regard to suicidal tendencies, contrasting number of suicides does not yield a valid comparison because of the different population sizes. However, number of suicides per 100,000 persons is a meaningful indicator. E.g. in 2002, suicide rate was 38.5 in the Southern-Great Plain region of Hungary, while the country’s overall rate was 28.

(Remark: Suicide as a cultural/sociological phenomenon. In southern and south-eastern districts of the Hungarian Plain, the suicide rate has been 2-3 times higher for 135 years than in the western and north-western areas of the country.)

Rates are computed from population data (based on official data sources such as censuses) rather than sample data. Such information is regularly reported by national bureaus of statistics.

Two healthcare indicators:

  • indicator A: number of GPs per 100.000 inhabitants

  • indicator B: number of patients per GP

What does an increase in indicator A / in indicator B imply?

Example: Hungarian city crime ranking

Do the data yield a valid comparison? (Source: Unified System of Criminal Statistics of the Investigative Authorities and of Public Prosecution, 2008)

n_pic_4

No! The best ranked Pilis has 11,000 inhabitants, while the second best ranked Ózd has 38,000.

Such inadequate indicators are sometimes reported in the media.

However, the indicator below is better defined. Why?

In addition to crime rate, number of crimes is also reported here. Why could it be informative?

Ranking

City

Crimes per 10,000 inhabitants

Total number of crimes

1.

Lengyeltóti

181

61

2.

Tiszalök

168

99

3.

Nyékládháza

154

76

4.

Siófok

145

349

5.

Harkány

128

49

6.

Vásárosnamény

119

107

7.

Jászberény

118

320

...10.

Hajdúsámson

113

142

...19.

Ózd

94

341

...23.

Komló

83

217

...27.

Szigetszentmikós

76

233

What information is most important when interpreting statistical data?

1. When were the data collected?

2. What is the research population?

3. If sample data:

a) Method of sampling?

b) Sample size?

c) Nonresponse rate?

4. Exact definition of variables? If table: what are the row- and column headings?