Renáta Németh, Dávid Simon

ELTE

when the relationship is not linear (as seen below)

In this graph the dependent variable obviously depends on the independent variable, yet linear regression would yield results similar to a case of independence. The reason is that the relationship is non-linear. The simplest thing to do in this case is to split up the independent variable into two parts where the relationship is close to linear. (0-50 and 50-100, in the above example).

if there are extreme cases in the sample

In the above example 10 cases show independence, but one case is an odd one out, with both the dependent and the independent variables having extreme values. Thus the result of linear regression will show that there’s a strong relationship, while in 90% of our cases there’s no relationship whatsoever.

What we can do is to ignore the (few) extreme cases, after analysing their other properties to find out what makes them so extreme. After this, linear regression is supposed to yield reliable results. Warning: we can only ignore a small number of cases (not more than about 10%) because that might lure us into creating an explanation just to endorse our preliminary hypothesis.

**Advice**: for high measurement level variables
always make a scatterplot to give you a first impression of the data.

**Important**

Linear regression has got some mathematical and statistical prerequisites. Suffice it to say here that the dependent variable must follow normal distribution and the standard deviation of the dependent variable must not depend on the value of the independent variable. These must always be checked before doing linear regression.

Let’s check the conditions of doing linear regression in our data for age and income

The graph tells us that

the curve suggests non-linear relationship

the standard deviation of income increases until middle age and subsequently decreases

there are some highly extreme cases

moreover, income doesn’t follow the normal distribution (the graph doesn’t actually show this)

The correct procedure would be to normalise the distribution of the income, to split up the age data and look at the relationship in different age groups.

Some more notes on regression

watch out for the unit of measurement

several variables can be used as independent variables