Think Stats: Probability and Statistics for Programmers by Allen Downey - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

Chapter 6

Operations on distributions

6.1

Skewness

Skewness is a statistic that measures the asymmetry of a distribution.

Given a sequence of values, xi, the sample skewness is:

g1 = m3/m3/2

2

1

m2 =

∑(x

n

i − µ)2

i

1

m3 =

∑(x

n

i − µ)3

i

You might recognize m2 as the mean squared deviation (also known as vari-

ance); m3 is the mean cubed deviation.

Negative skewness indicates that a distribution “skews left;” that is, it ex-

tends farther to the left than the right. Positive skewness indicates that a

distribution skews right.

In practice, computing the skewness of a sample is usually not a good idea.

If there are any outliers, they have a disproportionate effect on g1.

Another way to evaluate the asymmetry of a distribution is to look at the

relationship between the mean and median. Extreme values have more ef-

fect on the mean than the median, so in a distribution that skews left, the

mean is less than the median.

Pearson’s median skewness coefficient is an alternative measure of skew-

ness that explicitly captures the relationship between the mean, µ, and the

68

Chapter 6. Operations on distributions

median, µ 1/2:

gp = 3( µ µ 1/2)/ σ

This statistic is robust, which means that it is less vulnerable to the effect of

outliers.

Exercise 6.1 Write a function named ❙❦❡✇♥❡ss that computes g1 for a sam-

ple.

Compute the skewness for the distributions of pregnancy length and birth

weight. Are the results consistent with the shape of the distributions?

Write a function named P❡❛rs♦♥❙❦❡✇♥❡ss that computes gp for these distri-

butions. How does gp compare to g1?

Exercise 6.2 The “Lake Wobegon effect” is an amusing nickname1 for il-

lusory superiority, which is the tendency for people to overestimate their

abilities relative to others. For example, in some surveys, more than 80%

of respondents believe that they are better than the average driver (see

❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴■❧❧✉s♦r②❴s✉♣❡r✐♦r✐t②).

If we interpret “average” to mean median, then this result is logically im-

possible, but if “average” is the mean, this result is possible, although un-

likely.

What percentage of the population has more than the average number of

legs?

Exercise 6.3 The Internal Revenue Service of the United States (IRS) pro-

vides data about income taxes, and other statistics, at ❤tt♣✿✴✴✐rs✳❣♦✈✴

t❛①st❛ts. If you did Exercise 4.13, you have already worked with this

data; otherwise, follow the instructions there to extract the distribution of

incomes from this dataset.

What fraction of the population reports a taxable income below the mean?

Compute the median, mean, skewness and Pearson’s skewness of the in-

come data. Because the data has been binned, you will have to make some

approximations.

The Gini coefficient is a measure of income inequality.

Read about it

at ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴●✐♥✐❴❝♦❡❢❢✐❝✐❡♥t and write a function

called ●✐♥✐ that computes it for the income distribution.

1If you don’t get it, see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴▲❛❦❡❴❲♦❜❡❣♦♥.

6.2. Random Variables

69

Hint: use the PMF to compute the relative mean difference (see ❤tt♣✿✴✴

✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴▼❡❛♥❴❞✐❢❢❡r❡♥❝❡).

You can download a solution to this exercise from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴

❣✐♥✐✳♣②.

6.2

Random Variables

A random variable represents a process that generates a random number.

Random variables are usually written with a capital letter, like X. When

you see a random variable, you should think “a value selected from a dis-

tribution.”

For example, the formal definition of the cumulative distribution function

is:

CDFX(x) = P(X ≤ x)

I have avoided this notation until now because it is so awful, but here’s what

it means: The CDF of the random variable X, evaluated for a particular

value x, is defined as the probability that a value generated by the random

process X is less than or equal to x.

As a computer scientist, I find it helpful to think of a random variable as

an object that provides a method, which I will call ❣❡♥❡r❛t❡, that uses a

random process to generate values.

For example, here is a definition for a class that represents random vari-

ables:

❝❧❛ss ❘❛♥❞♦♠❱❛r✐❛❜❧❡✭♦❜❥❡❝t✮✿

✧✧✧P❛r❡♥t ❝❧❛ss ❢♦r ❛❧❧ r❛♥❞♦♠ ✈❛r✐❛❜❧❡s✳✧✧✧

And here is a random variable with an exponential distribution:

❝❧❛ss ❊①♣♦♥❡♥t✐❛❧✭❘❛♥❞♦♠❱❛r✐❛❜❧❡✮✿

❞❡❢ ❴❴✐♥✐t❴❴✭s❡❧❢✱ ❧❛♠✮✿

s❡❧❢✳❧❛♠ ❂ ❧❛♠

❞❡❢ ❣❡♥❡r❛t❡✭s❡❧❢✮✿

r❡t✉r♥ r❛♥❞♦♠✳❡①♣♦✈❛r✐❛t❡✭s❡❧❢✳❧❛♠✮

The init method takes the parameter, λ, and stores it as an attribute. The

❣❡♥❡r❛t❡ method returns a random value from the exponential distribution

with that parameter.

70

Chapter 6. Operations on distributions

Each time you invoke ❣❡♥❡r❛t❡, you get a different value. The value you

get is called a random variate, which is why many function names in the

r❛♥❞♦♠ module include the word “variate.”

If I were just generating exponential variates, I would not bother to define

a new class; I would use r❛♥❞♦♠✳❡①♣♦✈❛r✐❛t❡. But for other distributions

it might be useful to use RandomVariable objects. For example, the Erlang

distribution is a continuous distribution with parameters λ and k (see ❤tt♣✿

✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❊r❧❛♥❣❴❞✐str✐❜✉t✐♦♥).

One way to generate values from an Erlang distribution is to add k values

from an exponential distribution with the same λ. Here’s an implementa-

tion:

❝❧❛ss ❊r❧❛♥❣✭❘❛♥❞♦♠❱❛r✐❛❜❧❡✮✿

❞❡❢ ❴❴✐♥✐t❴❴✭s❡❧❢✱ ❧❛♠✱ ❦✮✿

s❡❧❢✳❧❛♠ ❂ ❧❛♠

s❡❧❢✳❦ ❂ ❦

s❡❧❢✳❡①♣♦ ❂ ❊①♣♦♥❡♥t✐❛❧✭❧❛♠✮

❞❡❢ ❣❡♥❡r❛t❡✭s❡❧❢✮✿

t♦t❛❧ ❂ ✵

❢♦r ✐ ✐♥ r❛♥❣❡✭s❡❧❢✳❦✮✿

t♦t❛❧ ✰❂ s❡❧❢✳❡①♣♦✳❣❡♥❡r❛t❡✭✮

r❡t✉r♥ t♦t❛❧

The init method creates an Exponential object with the given parameter;

then ❣❡♥❡r❛t❡ uses it. In general, the init method can take any set of pa-

rameters and the ❣❡♥❡r❛t❡ function can implement any random process.

Exercise 6.4 Write a definition for a class that represents a random vari-

able with a Gumbel distribution (see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴●✉♠❜❡❧❴

❞✐str✐❜✉t✐♦♥).

6.3

PDFs

The derivative of a CDF is called a probability density function, or PDF.

For example, the PDF of an exponential distribution is

PDFexpo(x) = λ e− λ x

6.3. PDFs

71

The PDF of a normal distribution is

1

1

x −

2

µ

PDFnormal(x) = √

exp −

σ

2 π

2

σ

Evaluating a PDF for a particular value of x is usually not useful. The result

is not a probability; it is a probability density.

In physics, density is mass per unit of volume; in order to get a mass, you

have to multiply by volume or, if the density is not constant, you have to

integrate over volume.

Similarly, probability density measures probability per unit of x. In order to

get a probability mass2, you have to integrate over x. For example, if x is a

random variable whose PDF is PDFX, we can compute the probability that

a value from X falls between −0.5 and 0.5:

0.5

P(−0.5 ≤ X < 0.5) =

PDFX(x)dx

−0.5

Or, since the CDF is the integral of the PDF, we can write

P(−0.5 ≤ X < 0.5) = CDFX(0.5) − CDFX(−0.5)

For some distributions we can evaluate the CDF explicitly, so we would

use the second option. Otherwise we usually have to integrate the PDF

numerically.

Exercise 6.5 What is the probability that a value chosen from an exponential

distribution with parameter λ falls between 1 and 20? Express your answer

as a function of λ. Keep this result handy; we will use it in Section 8.8.

Exercise 6.6 In the BRFSS (see Section 4.5), the distribution of heights is

roughly normal with parameters

2

µ = 178 cm and σ

= 59.4 cm for men,

and

2

µ = 163 cm and σ = 52.8 cm for women.

In order to join Blue Man Group, you have to be male between 5’10” and

6’1” (see ❤tt♣✿✴✴❜❧✉❡♠❛♥❝❛st✐♥❣✳❝♦♠). What percentage of the U.S. male

population is in this range? Hint: see Section 4.3.

2To take the analogy one step farther, the mean of a distribution is its center of mass,

and the variance is its moment of inertia.

72

Chapter 6. Operations on distributions

6.4

Convolution

Suppose we have two random variables, X and Y, with distributions CDFX

and CDFY. What is the distribution of the sum Z = X + Y?

One option is to write a RandomVariable object that generates the sum:

❝❧❛ss ❙✉♠✭❘❛♥❞♦♠❱❛r✐❛❜❧❡✮✿

❞❡❢ ❴❴✐♥✐t❴❴✭❳✱ ❨✮✿

s❡❧❢✳❳ ❂ ❳

s❡❧❢✳❨ ❂ ❨

❞❡❢ ❣❡♥❡r❛t❡✭✮✿

r❡t✉r♥ ❳✳❣❡♥❡r❛t❡✭✮ ✰ ❨✳❣❡♥❡r❛t❡✭✮

Given any RandomVariables, X and Y, we can create a Sum object that rep-

resents Z. Then we can use a sample from Z to approximate CDFZ.

This approach is simple and versatile, but not very efficient; we have to

generate a large sample to estimate CDFZ accurately, and even then it is not

exact.

If CDFX and CDFY are expressed as functions, sometimes we can find CDFZ

exactly. Here’s how:

1. To start, assume that the particular value of X is x. Then CDFZ(z) is

P(Z ≤ z | X = x) = P(Y ≤ z − x)

Let’s read that back. The left side is “the probability that the sum is

less than z, given that the first term is x.” Well, if the first term is x and

the sum has to be less than z, then the second term has to be less than

z − x.

2. To get the probability that Y is less than z − x, we evaluate CDFY.

P(Y ≤ z − x) = CDFY(z − x)

This follows from the definition of the CDF.

3. Good so far? Let’s go on. Since we don’t actually know the value of x,

we have to consider all values it could have and integrate over them:

P(Z ≤ z) =

P(Z ≤ z | X = x) PDFX(x) dx

−∞

6.4. Convolution

73

The integrand is “the probability that Z is less than or equal to z, given

that X = x, times the probability that X = x.”

Substituting from the previous steps we get

P(Z ≤ z) =

CDFY(z − x) PDFX(x) dx

−∞

The left side is the definition of CDFZ, so we conclude:

CDFZ(z) =

CDFY(z − x) PDFX(x) dx

−∞

4. To get PDFZ, take the derivative of both sides with respect to z. The

result is

PDFZ(z) =

PDFY(z − x) PDFX(x) dx

−∞

If you have studied signals and systems, you might recognize that

integral. It is the convolution of PDFY and PDFX, denoted with the

operator .

PDFZ = PDFY

PDFX

So the distribution of the sum is the convolution of the distributions.

See ❤tt♣✿✴✴✇✐❦t✐♦♥❛r②✳♦r❣✴✇✐❦✐✴❜♦♦②❛❤!

As an example, suppose X and Y are random variables with an exponential

distribution with parameter λ. The distribution of Z = X + Y is:

PDFZ(z) =

PDFX(x) PDFY(z − x) dx =

λ e− λ x λ e λ(z−x)dx

−∞

−∞

Now we have to remember that PDFexpo is 0 for all negative values, but we

can handle that by adjusting the limits of integration:

z

PDFZ(z) =

λ e− λ x λ e− λ(z−x) dx

0

Now we can combine terms and move constants outside the integral:

z

PDF

2

2

Z(z) = λ e− λ z

dx = λ z e− λ z

0

This, it turns out, is the PDF of an Erlang distribution with parameter k = 2

(see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❊r❧❛♥❣❴❞✐str✐❜✉t✐♦♥). So the convolu-

tion of two exponential distributions (with the same parameter) is an Erlang

distribution.

74

Chapter 6. Operations on distributions

Exercise 6.7 If X has an exponential distribution with parameter λ, and

Y has an Erlang distribution with parameters k and λ, what is the distri-

bution of the sum Z = X + Y?

Exercise 6.8 Suppose I draw two values from a distribution; what is the

distribution of the larger value? Express your answer in terms of the PDF

or CDF of the distribution.

As the number of values increases, the distribution of the maximum con-

verges on one of the extreme value distributions; see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳

♦r❣✴✇✐❦✐✴●✉♠❜❡❧❴❞✐str✐❜✉t✐♦♥.

Exercise 6.9 If you are given Pmf objects, you can compute the distribution

of the sum by enumerating all pairs of values:

❢♦r ① ✐♥ ♣♠❢❴①✳❱❛❧✉❡s✭✮✿

❢♦r ② ✐♥ ♣♠❢❴②✳❱❛❧✉❡s✭✮✿

③ ❂ ① ✰ ②

Write a function that takes PMFX and PMFY and returns a new Pmf that

represents the distribution of the sum Z = X + Y.

Write a similar function that computes the PMF of Z = max(X, Y).

6.5

Why normal?

I said earlier that normal distributions are amenable to analysis, but I didn’t

say why. One reason is that they are closed under linear transformation and

convolution. To explain what that means, it will help to introduce some

notation.

If the distribution of a random variable, X, is normal with parameters µ and

σ, you can write

X ∼ N (

2

µ, σ )

where the symbol ∼ means “is distributed” and the script letter N stands

for “normal.”

A linear transformation of X is something like X’ = aX + b, where a and

b are real numbers. A family of distributions is closed under linear trans-

formation if X’ is in the same family as X. The normal distribution has this

property; if X ∼ N (

2

µ, σ ),

X’ ∼ N (a

2

µ + b, a2 σ )

6.6. Central limit theorem

75

Normal distributions are also closed under convolution. If Z = X + Yand

X ∼ N (

2

2

µ X, σ X ) and Y ∼ N ( µ Y, σ Y ) then

Z ∼ N (

2

2

µ X + µ Y, σ X + σ Y)

The other distributions we have looked at do not have these properties.

Exercise 6.10 If X ∼ N (

2

2

µ X, σ X ) and Y ∼ N ( µ Y, σ Y ), what is the distribu-

tion of Z = aX + bY?

Exercise 6.11 Let’s see what happens when we add values from other dis-

tributions. Choose a pair of distributions (any two of exponential, normal,

lognormal, and Pareto) and choose parameters that make their mean and

variance similar.

Generate random numbers from these distributions and compute the distri-

bution of their sums. Use the tests from Chapter 4 to see if the sum can be

modeled by a continuous distribution.

6.6

Central limit theorem

So far we have seen:

• If we add values drawn from normal distributions, the distribution of

the sum is normal.

• If we add values drawn from other distributions, the sum does not

generally have one of the continuous distributions we have seen.

But it turns out that if we add up a large number of values from almost any

distribution, the distribution of the sum converges to normal.

More specifically, if the distribution of the values has mean and standard

deviation

2

µ and σ, the distribution of the sum is approximately N (n µ, n σ ).

This is called the Central Limit Theorem. It is one of the most useful tools

for statistical analysis, but it comes with caveats:

• The values have to be drawn independently.

• The values have to come from the same distribution (although this

requirement can be relaxed).

76

Chapter 6. Operations on distributions

• The values have to be drawn from a distribution with finite mean and

variance, so most Pareto distributions are out.

• The number of values you need before you see convergence depends

on the skewness of the distribution. Sums from an exponential dis-

tribution converge for small sample sizes. Sums from a lognormal

distribution do not.

The Central Limit Theorem explains, at least in part, the prevalence of nor-

mal distributions in the natural world. Most characteristics of animals and

other life forms are affected by a large number of genetic and environmental

factors whose effect is additive. The characteristics we measure are the sum

of a large number of small effects, so their distribution tends to be normal.

Exercise 6.12 If I draw a sample, x1 .. xn, independently from a distribution

with finite mean

2

µ and variance σ , what is the distribution of the sample

mean:

1

¯x =

∑ x

n

i

As n increases, what happens to the variance of the sample mean? Hint:

review Section 6.5.

Exercise 6.13 Choose a distribution (one of exponential, lognormal or

Pareto) and choose values for the parameter(s). Generate samples with sizes

2, 4, 8, etc., and compute the distribution of their sums. Use a normal prob-

ability plot to see if the distribution is approximately normal. How many

terms do you have to add to see convergence?

Exercise 6.14 Instead of the distribution of sums, compute the distribution

of products; what happens as the number of terms increases? Hint: look at

the distribution of the log of the products.

6.7

The distribution framework

At this point we have seen PMFs, CDFs and PDFs; let’s take a minute to

review. Figure 6.1 shows how these functions relate to each other.

We started with PMFs, which represent the probabilities for a discrete set

of values. To get from a PMF to a CDF, we computed a cumulative sum.

To be more consistent, a discrete CDF should be called a cumulative mass

function (CMF), but as far as I can tell no one uses that term.

6.8. Glossary

77

Discrete

Continuous

CDF

smooth

Cumulative

(CMF)

CDF

sum

diff

integrate

differentiate

Non−cumulative

PMF

PDF

bin

Figure 6.1: A framework that relates representations of distribution func-

tions.

To get from a CDF to a PMF, you can compute differences in cumulative

probabilities.

Similarly, a PDF is the derivative of a continuous CDF; or, equivalently, a

CDF is the integral of a PDF. But remember that a PDF maps from values to

probability densities; to get a probability, you have to integrate.

To get from a discrete to a continuous distribution, you can perform various

kinds of smoothing. One form of smoothing is to assume that the data come

from an analytic continuous distribution (like exponential or normal) and

to estimate the parameters of that distribution. And that’s what Chapter 8

is about.

If you divide a PDF into a set of bins, you can generate a PMF that is at

least an approximation of the PDF. We use this technique in Chapter 8 to do

Bayesian estimation.

Exercise 6.15 Write a function called ▼❛❦❡P♠❢❋r♦♠❈❞❢ that takes a Cdf object

and returns the corresponding Pmf object.

You can find a solution to this exercise in t❤✐♥❦st❛ts✳❝♦♠✴P♠❢✳♣②.