Ugrás a tartalomhoz

## SOCIAL STATISTICS

Renáta Németh, Dávid Simon

ELTE

## Interpretation pitfall III: Simpson’s paradox

(Fictive example)

Does factory X discriminate against Roma job applicants?

 New workers in 2005 Factory X Other factories Roma workers 108 1530 Non-roma workers 123 1200

How to calculate?

Percentage of Roma workers among new workers:

in factory X below 50 % ( 108 < 123)

in the other factories above 50% (1530 > 1200)

However, the CEO of factory X gives the detailed data as bellow:

 New workers in 2002 with secondary education Factory X Other factories Roma workers 51 1210 Non-roma workers 23 630

 New workers in 2002 without secondary education Factory X Other factories Roma workers 57 320 Non-Roma workers 100 570

How can the CEO argue? How does she/he calculate?

According to the CEO: „at our company among new workers both with and without secondary school, percentage of Romas is higher than at the other companies.”

Percentage of Romas among new workers without secondary education at factory X: 51/(51+23)=69%,

at the other factories: 1210/(1210+630)=66%;

while percentage of Romas among new workers with secondary education at factory X: 57/(57+100)=36.3%

at all other factories: 320/(320+570)=35.9%)

Why did the picture change after controlling for education?

The phenomenon is called Simpson’s paradox. A trend present in a group reversed when the group is split into two. A seeming paradox, but it can be explained:

What is the difference between X and the other factories regarding education of workers? How does general educational level of Roma people differ from the education of non-Romas?

Why does the paradox emerge? Basically for two reasons. Firstly, factory X offers jobs which require higher educational level. Secondly, Roma people tend to have lower education level than the general population.

The aggregation was hiding a confounding variable which is education.

One may go further, by entering a fourth variable, gender, into the analysis:

 New female workers in 2002 with secondary education Factory X Other factories Roma workers 49 250 Non-Roma workers 19 80
 New male workers in 2002 with secondary education Factory X Other factories Roma workers 8 70 Non-Roma workers 81 490

Romas are underrepresented at factory X within workers with secondary education, regarding both genders.

Percentage of Roma workers, among females:

Factory X: 49/(49+19)=72%

Other factories: 250/(250+80)=75%

Among males:

Factory X: 8/(8+81)=9%,

Other factories: 70/(70+490)=12.5%.

Entering a fourth variable into the analysis (that is, controlling for gender) the picture has changed again.

Lesson: the relationship between two variables might be hidden by a third variable, only to be revealed when the third variable is controlled.

(The example is from Alan Crowe’s homepage, where the same tables are presented in another story.)

The example showed what may happen to the relationship between two variables, when a third variable is introduced and subtables are constructed by dividing the first table. Some possible outcomes:

• The original relationship stays the same in each of the subtables.

• The original relationship disappears in each of the subtables.

• The original relationship is maintained in one of the subtables but not in the other.

• The relationship between two variables might be hidden by a third variable, only to be revealed when the third variable is introduced.

In sociology Paul Lazarsfeld used the above logic for understanding the relationship between two variables by controlling for the effect of a third („elaboration model”).

Lessons from the three interpretation pitfalls

The examples show both advantages and limitations of social statistics.

• Result of the analysis depends on which aspects we take into account (see Simpson’s paradox: education, gender).

• We should enter into the analysis all relevant aspects.

• There is no statistical method that can help us to choose the relevant aspects (decision about scientifically relevant aspect requires practical but not statistical knowledge)

• Statistical tools do not offer automated solutions, practical knowledge is always needed.

• Since choice of relevant aspects can not be totally objective, all results can only be interpreted within the framework of the particular model; but

• appropriate statistical tools provide much more effective and correct analysis than ad hoc approaches.

• Results can be manipulated by selecting aspects according to one’s own (economic, political etc.) interests.

• At first sight each of the above fraud interpretations seemed plausible. The goal of this course is to provide a routine in avoiding these pitfalls.

Some words about quantitative and qualitative research

Is social statistics relevant to understand social issues?

Common reasons against quantitative research:

• These tools can not help to understand society, they say nothing about intentions/motivations

• Scope of the data is restricted (questionnaires are too short to be detailed enough)

• The analytical concepts are constructed by the researcher

• The observer can not be independent of the phenomenon observed

Qualitative methods: aimed at data quality rather than data quantity, e.g.: in-depth interview, focus group, participant observation, etc.

• Explicit constructivism (it says roughly that social phenomena are always the result of meaning-making activities of groups or individuals).

• Limitations: problem of generalization potential (can we arrive at a general conclusion about unemployed people based on some interviews with unemployed persons?)

Suggestion for consensus:

• The two approaches can complement each other (compilation of the a questionnaire can be based on qualitative research and vice versa, a qualitative research might involve using textual analysis softwares)

• Often the research question itself determines which approach to choose (exploration of the motives and family background of drug addicts requires obviously qualitative approach)

(Further reading: Qualitative and Quantitative Research: Conjunctions and Divergences)