Ugrás a tartalomhoz

SOCIAL STATISTICS

Renáta Németh, Dávid Simon

ELTE

When shall not we use correlation and linear regression?

When shall not we use correlation and linear regression?

  • when the relationship is not linear (as seen below)

    In this graph the dependent variable obviously depends on the independent variable, yet linear regression would yield results similar to a case of independence. The reason is that the relationship is non-linear. The simplest thing to do in this case is to split up the independent variable into two parts where the relationship is close to linear. (0-50 and 50-100, in the above example).

  • if there are extreme cases in the sample

    In the above example 10 cases show independence, but one case is an odd one out, with both the dependent and the independent variables having extreme values. Thus the result of linear regression will show that there’s a strong relationship, while in 90% of our cases there’s no relationship whatsoever.

    What we can do is to ignore the (few) extreme cases, after analysing their other properties to find out what makes them so extreme. After this, linear regression is supposed to yield reliable results. Warning: we can only ignore a small number of cases (not more than about 10%) because that might lure us into creating an explanation just to endorse our preliminary hypothesis.

Advice: for high measurement level variables always make a scatterplot to give you a first impression of the data.

Important

Linear regression has got some mathematical and statistical prerequisites. Suffice it to say here that the dependent variable must follow normal distribution and the standard deviation of the dependent variable must not depend on the value of the independent variable. These must always be checked before doing linear regression.

Let’s check the conditions of doing linear regression in our data for age and income

The graph tells us that

  • the curve suggests non-linear relationship

  • the standard deviation of income increases until middle age and subsequently decreases

  • there are some highly extreme cases

  • moreover, income doesn’t follow the normal distribution (the graph doesn’t actually show this)

The correct procedure would be to normalise the distribution of the income, to split up the age data and look at the relationship in different age groups.

Some more notes on regression

  • watch out for the unit of measurement

  • several variables can be used as independent variables