Renáta Németh, Dávid Simon

ELTE

**Tartalom**

**Contents**

Normal distribution

Lognormal distribution

**Introduction**

So far a lot has been said about the distribution of variables, its graphical representation, characteristics, central tendency markers and standard deviation. All the distributions seen so far have been empirical distributions. Now we shall look at a theoretical distribution.

Theoretical distributions are based not on a set of actual data, but some kind of theoretical consideration or function. They can be useful because many empirical distributions approach one of the theoretical ones.

Some examples to remind us of the distribution types. The data in this section come from the four-item unit to measure xenophobia in the ISSP 1995 survey.

All the five graphs have one thing in common: they approach a theoretical distribution called normal distribution. Normal distribution can also be described by its central tendency indicators and standard deviation and can be graphically represented. Its advantage over empirical distributions is that its mathematical characteristics are exactly described, so they can be used to characterise the variables whose distribution approaches normal.

Let’s see to what extent the above examples approach normal distribution.

As can be seen, they more or less fit the normal distribution curve.

**The characteristics of normal distribution**

A normal distribution can be characterised using its mean and standard deviation. Unlike empirical distibutions, a normal distribution can be perfectly defined using these two indices, so the entire curve can be reproduced relying on these two pieces of information.

Notation:

N (mean, standard deviation)

here: N (0,1)

What can be said about the mode and the median of normal distribution?

Another typical feature is that the normal distribution is not skewed and it’s symmetrical about the mean (explain). Its shape is often likened to a bell, hence the name ’bell curve’.

**The area under the curve**

Consider a normal distribution whose mean=0 and SD=1. What does the are painted blue represent?

The blue area represents the number/percentage of cases between -2 and -1. In the present case we chose the measurement unit for axis y so that the area under the whole curve is 1, thus each area to go with an interval gives the percentage of the cases between the two given values.

Consider the following graph. What kind of conclusion can be drawn from the fact that the curve is symmetrical?

Because the curve is symmetrical, any two intervals of the same breadth at the same distance from the mean have the same number/percent of cases belonging to them.

So far we have been looking at normal distributions with mean=0 and SD=1. This type of normal distribution is called standard normal distribution.

The graph above shows how we can arrive at any type of normal distribution from the standard normal distribution.

E.g.: Let’s create the normal distribution where mean=1 and SD=2

Procedure:

0. Take the standard normal distribution (blue line)

1. Multiply all the values of the variable by the SD given (purple line, where the mean is still 0, while the SD is exactly as required)

2. Add to each value the mean given (yellow line, the curve whose mean is 1 and whose SD is 2)

Usually we come across the reverse of this operation: we transform any odd normal distribution to a standard one. This procedure is called standardization and the values we get are called z values. The above procedure is reversed:

Take a normal distribution curve (or a variable with a normal distribution)

Subtract the mean, which thus becomes 0.

Divide by the SD, which thus becomes 1.

**When do we use the standardization in
practice?**

As we saw in the previous lecture, the value of the regression coefficient depended on the unit of easurement used. However, if we standardize the variables, this is no longer true.

**Note:** the computer performs this operation
when it is doing the regression, the b value given is called Beta and is called
standardized regression coefficient

**How to intepret the value of the standardized
variable?**

As the above example shows, we can standardize not only the theoretical distribution but also the variables that we assume follow (or approach) a normal distribution.

Let’s see the standardized version of the xenophobia variable used earlier.

What does it mean to have a z score 1.5 for xenophobia?

It means the given person is 1.5 SD away from the mean of xenophobia in the given sample.

**Note 1:** the graph is visibly different, due
to the fact that SPSS program creates its own percentiles used for the bar
chart.

**Note 2:** we can use the characteristics of
normal distribution to interpret the standardized variable (if its distribution
is normal) a bit like the way we use centimeter and meter as units of
measurement.

How to calculate in a specific curve how many or what percentage of cases there are in a given interval? This is what the Standard Normal Distribution Table is for.

**Note:** to spare space, the table makes use of
the symmetry of the bell curve, containing no negative values. The percentages
to go with the negative values are arrived at the following way:

F(x) = 1 − F(−x)

Why is this so? (consider the symmetry and the are under the curve)

Let’s find out how many cases there are in the following intervals in a standard normal distribution:

Intervals:

0 | 1 |

-1 | 0 |

0,5 | 1 |

-1,5 | -1 |

Let’s calculate the proportions for other normal distributions:

N(1,2)

0 | 1 |

Procedure (this is also, in fact, standardization)

We subtract the mean from both extreme values of the interval (here: -1,0)

Divide the values we get by the SD (here: -0,5, 0)

Let’s find the interval in the Table (here: 1-0,691= and 0,5)

Further examples:

N(1,3)

0 | 1 |

-1 | 1 |

Lognormal distribution doesn’t often occur in real life, but since the distribution of income usually follows this pattern, it’s worth remembering.

Lognormal distribution is when it’s the logarithm of the values that shows normal distribution.

E.g.: Self-declared income in Hungary in 1995

How can we interpret the graph?

What can we do if we want to use a procedure that requires normal distribution for income data?

Variables of high (really high) measurement level often show normal distribution, but there are not too many of those around.

Responses to attitude questions often show normal distribution.

Almost all indices tend to have normal distribution.

In general: the more composite index it is, the closer its distribution approaches normal. (This has to do with what mathematicians call the theorem of central limit distribution.)