Think Stats: Probability and Statistics for Programmers by Allen Downey - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

Chapter 2

Descriptive statistics

2.1

Means and averages

In the previous chapter, I mentioned three summary statistics—mean, vari-

ance and median—without explaining what they are. So before we go any

farther, let’s take care of that.

If you have a sample of n values, xi, the mean, µ, is the sum of the values

divided by the number of values; in other words

1

µ =

∑ x

n

i

i

The words “mean” and “average” are sometimes used interchangeably, but

I will maintain this distinction:

• The “mean” of a sample is the summary statistic computed with the

previous formula.

• An “average” is one of many summary statistics you might choose to

describe the typical value or the central tendency of a sample.

Sometimes the mean is a good description of a set of values. For example,

apples are all pretty much the same size (at least the ones sold in supermar-

kets). So if I buy 6 apples and the total weight is 3 pounds, it would be a

reasonable summary to say they are about a half pound each.

But pumpkins are more diverse. Suppose I grow several varieties in my

garden, and one day I harvest three decorative pumpkins that are 1 pound

12

Chapter 2. Descriptive statistics

each, two pie pumpkins that are 3 pounds each, and one Atlantic Gi-

ant® pumpkin that weighs 591 pounds. The mean of this sample is 100

pounds, but if I told you “The average pumpkin in my garden is 100

pounds,” that would be wrong, or at least misleading.

In this example, there is no meaningful average because there is no typical

pumpkin.

2.2

Variance

If there is no single number that summarizes pumpkin weights, we can do

a little better with two numbers: mean and variance.

In the same way that the mean is intended to describe the central tendency,

variance is intended to describe the spread. The variance of a set of values

is

2

1

σ =

∑(x

n

i − µ)2

i

The term xi- µ is called the “deviation from the mean,” so variance is the

mean squared deviation, which is why it is denoted 2

σ . The square root of

variance, σ, is called the standard deviation.

By itself, variance is hard to interpret. One problem is that the units are

strange; in this case the measurements are in pounds, so the variance is in

pounds squared. Standard deviation is more meaningful; in this case the

units are pounds.

Exercise 2.1 For the exercises in this chapter you should download ❤tt♣✿

✴✴t❤✐♥❦st❛ts✳❝♦♠✴t❤✐♥❦st❛ts✳♣②, which contains general-purpose func-

tions we will use throughout the book. You can read documentation of

these functions in ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴t❤✐♥❦st❛ts✳❤t♠❧.

Write a function called P✉♠♣❦✐♥ that uses functions from t❤✐♥❦st❛ts✳♣②

to compute the mean, variance and standard deviation of the pumpkins

weights in the previous section.

Exercise 2.2 Reusing code from s✉r✈❡②✳♣② and ❢✐rst✳♣②, compute the stan-

dard deviation of gestation time for first babies and others. Does it look like

the spread is the same for the two groups?

How big is the difference in the means compared to these standard devia-

tions? What does this comparison suggest about the statistical significance

of the difference?

2.3. Distributions

13

If you have prior experience, you might have seen a formula for variance

with n − 1 in the denominator, rather than n. This statistic is called the

“sample variance,” and it is used to estimate the variance in a population

using a sample. We will come back to this in Chapter 8.

2.3

Distributions

Summary statistics are concise, but dangerous, because they obscure the

data. An alternative is to look at the distribution of the data, which de-

scribes how often each value appears.

The most common representation of a distribution is a histogram, which is

a graph that shows the frequency or probability of each value.

In this context, frequency means the number of times a value appears in a

dataset—it has nothing to do with the pitch of a sound or tuning of a radio

signal. A probability is a frequency expressed as a fraction of the sample

size, n.

In Python, an efficient way to compute frequencies is with a dictionary.

Given a sequence of values, t:

❤✐st ❂ ④⑥

❢♦r ① ✐♥ t✿

❤✐st❬①❪ ❂ ❤✐st✳❣❡t✭①✱ ✵✮ ✰ ✶

The result is a dictionary that maps from values to frequencies. To get from

frequencies to probabilities, we divide through by n, which is called nor-

malization:

♥ ❂ ❢❧♦❛t✭❧❡♥✭t✮✮

♣♠❢ ❂ ④⑥

❢♦r ①✱ ❢r❡q ✐♥ ❤✐st✳✐t❡♠s✭✮✿

♣♠❢❬①❪ ❂ ❢r❡q ✴ ♥

The normalized histogram is called a PMF, which stands for “probability

mass function”; that is, it’s a function that maps from values to probabilities

(I’ll explain “mass” in Section 6.3).

It might be confusing to call a Python dictionary a function. In mathematics,

a function is a map from one set of values to another. In Python, we usually

represent mathematical functions with function objects, but in this case we

are using a dictionary (dictionaries are also called “maps,” if that helps).

14

Chapter 2. Descriptive statistics

2.4

Representing histograms

I wrote a Python module called P♠❢✳♣② that contains class definitions for

Hist objects, which represent histograms, and Pmf objects, which represent

PMFs. You can read the documentation at t❤✐♥❦st❛ts✳❝♦♠✴P♠❢✳❤t♠❧ and

download the code from t❤✐♥❦st❛ts✳❝♦♠✴P♠❢✳♣②.

The function ▼❛❦❡❍✐st❋r♦♠▲✐st takes a list of values and returns a new Hist

object. You can test it in Python’s interactive mode:

❃❃❃ ✐♠♣♦rt P♠❢

❃❃❃ ❤✐st ❂ P♠❢✳▼❛❦❡❍✐st❋r♦♠▲✐st✭❬✶✱ ✷✱ ✷✱ ✸✱ ✺❪✮

❃❃❃ ♣r✐♥t ❤✐st

❁P♠❢✳❍✐st ♦❜❥❡❝t ❛t ✵①❜✼✻❝❢✻✽❝❃

P♠❢✳❍✐st means that this object is a member of the Hist class, which is de-

fined in the Pmf module. In general, I use upper case letters for the names

of classes and functions, and lower case letters for variables.

Hist objects provide methods to look up values and their probabilities. ❋r❡q

takes a value and returns its frequency:

❃❃❃ ❤✐st✳❋r❡q✭✷✮

If you look up a value that has never appeared, the frequency is 0.

❃❃❃ ❤✐st✳❋r❡q✭✹✮

❱❛❧✉❡s returns an unsorted list of the values in the Hist:

❃❃❃ ❤✐st✳❱❛❧✉❡s✭✮

❬✶✱ ✺✱ ✸✱ ✷❪

To loop through the values in order, you can use the built-in function

s♦rt❡❞:

❢♦r ✈❛❧ ✐♥ s♦rt❡❞✭❤✐st✳❱❛❧✉❡s✭✮✮✿

♣r✐♥t ✈❛❧✱ ❤✐st✳❋r❡q✭✈❛❧✮

If you are planning to look up all of the frequencies, it is more efficient to

use ■t❡♠s, which returns an unsorted list of value-frequency pairs:

❢♦r ✈❛❧✱ ❢r❡q ✐♥ ❤✐st✳■t❡♠s✭✮✿

♣r✐♥t ✈❛❧✱ ❢r❡q

2.5. Plotting histograms

15

Exercise 2.3 The mode of a distribution is the most frequent value (see ❤tt♣✿

✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴▼♦❞❡❴✭st❛t✐st✐❝s✮). Write a function called ▼♦❞❡

that takes a Hist object and returns the most frequent value.

As a more challenging version, write a function called ❆❧❧▼♦❞❡s that takes

a Hist object and returns a list of value-frequency pairs in descending or-

der of frequency. Hint: the ♦♣❡r❛t♦r module provides a function called

✐t❡♠❣❡tt❡r which you can pass as a key to s♦rt❡❞.

2.5

Plotting histograms

There are a number of Python packages for making figures and graphs. The

one I will demonstrate is ♣②♣❧♦t, which is part of the ♠❛t♣❧♦t❧✐❜ package

at ❤tt♣✿✴✴♠❛t♣❧♦t❧✐❜✳s♦✉r❝❡❢♦r❣❡✳♥❡t.

This package is included in many Python installations. To see whether you

have it, launch the Python interpreter and run:

✐♠♣♦rt ♠❛t♣❧♦t❧✐❜✳♣②♣❧♦t ❛s ♣②♣❧♦t

♣②♣❧♦t✳♣✐❡✭❬✶✱✷✱✸❪✮

♣②♣❧♦t✳s❤♦✇✭✮

If you have ♠❛t♣❧♦t❧✐❜ you should see a simple pie chart; otherwise you

will have to install it.

Histograms and PMFs are most often plotted as bar charts. The ♣②♣❧♦t

function to draw a bar chart is ❜❛r. Hist objects provide a method called

❘❡♥❞❡r that returns a sorted list of values and a list of the corresponding

frequencies, which is the format ❜❛r expects:

❃❃❃ ✈❛❧s✱ ❢r❡qs ❂ ❤✐st✳❘❡♥❞❡r✭✮

❃❃❃ r❡❝t❛♥❣❧❡s ❂ ♣②♣❧♦t✳❜❛r✭✈❛❧s✱ ❢r❡qs✮

❃❃❃ ♣②♣❧♦t✳s❤♦✇✭✮

I wrote a module called ♠②♣❧♦t✳♣② that provides functions for plotting his-

tograms, PMFs and other objects we will see soon. You can read the doc-

umentation at t❤✐♥❦st❛ts✳❝♦♠✴♠②♣❧♦t✳❤t♠❧ and download the code from

t❤✐♥❦st❛ts✳❝♦♠✴♠②♣❧♦t✳♣②. Or you can use ♣②♣❧♦t directly, if you prefer.

Either way, you can find the documentation for ♣②♣❧♦t on the web.

Figure 2.1 shows histograms of pregnancy lengths for first babies and oth-

ers.

Histograms are useful because they make the following features immedi-

ately apparent:

16

Chapter 2. Descriptive statistics

Histogram

2500

first babies

others

2000

1500

frequency 1000

500

0

25

30

35

40

45

weeks

Figure 2.1: Histogram of pregnancy lengths.

Mode: The most common value in a distribution is called the mode. In

Figure 2.1 there is a clear mode at 39 weeks. In this case, the mode is

the summary statistic that does the best job of describing the typical

value.

Shape: Around the mode, the distribution is asymmetric; it drops off

quickly to the right and more slowly to the left. From a medical point

of view, this makes sense. Babies are often born early, but seldom later

than 42 weeks. Also, the right side of the distribution is truncated

because doctors often intervene after 42 weeks.

Outliers: Values far from the mode are called outliers. Some of these are

just unusual cases, like babies born at 30 weeks. But many of them are

probably due to errors, either in the reporting or recording of data.

Although histograms make some features apparent, they are usually not

useful for comparing two distributions. In this example, there are fewer

“first babies” than “others,” so some of the apparent differences in the his-

tograms are due to sample sizes. We can address this problem using PMFs.

2.6

Representing PMFs

P♠❢✳♣② provides a class called P♠❢ that represents PMFs. The notation can

be confusing, but here it is: P♠❢ is the name of the module and also the class,

so the full name of the class is P♠❢✳P♠❢. I often use ♣♠❢ as a variable name.

2.6. Representing PMFs

17

Finally, in the text, I use PMF to refer to the general concept of a probability

mass function, independent of my implementation.

To create a Pmf object, use ▼❛❦❡P♠❢❋r♦♠▲✐st, which takes a list of values:

❃❃❃ ✐♠♣♦rt P♠❢

❃❃❃ ♣♠❢ ❂ P♠❢✳▼❛❦❡P♠❢❋r♦♠▲✐st✭❬✶✱ ✷✱ ✷✱ ✸✱ ✺❪✮

❃❃❃ ♣r✐♥t ♣♠❢

❁P♠❢✳P♠❢ ♦❜❥❡❝t ❛t ✵①❜✼✻❝❢✻✽❝❃

Pmf and Hist objects are similar in many ways. The methods ❱❛❧✉❡s and

■t❡♠s work the same way for both types. The biggest difference is that

a Hist maps from values to integer counters; a Pmf maps from values to

floating-point probabilities.

To look up the probability associated with a value, use Pr♦❜:

❃❃❃ ♣♠❢✳Pr♦❜✭✷✮

✵✳✹

You can modify an existing Pmf by incrementing the probability associated

with a value:

❃❃❃ ♣♠❢✳■♥❝r✭✷✱ ✵✳✷✮

❃❃❃ ♣♠❢✳Pr♦❜✭✷✮

✵✳✻

Or you can multiply a probability by a factor:

❃❃❃ ♣♠❢✳▼✉❧t✭✷✱ ✵✳✺✮

❃❃❃ ♣♠❢✳Pr♦❜✭✷✮

✵✳✸

If you modify a Pmf, the result may not be normalized; that is, the probabil-

ities may no longer add up to 1. To check, you can call ❚♦t❛❧, which returns

the sum of the probabilities:

❃❃❃ ♣♠❢✳❚♦t❛❧✭✮

✵✳✾

To renormalize, call ◆♦r♠❛❧✐③❡:

❃❃❃ ♣♠❢✳◆♦r♠❛❧✐③❡✭✮

❃❃❃ ♣♠❢✳❚♦t❛❧✭✮

✶✳✵

Pmf objects provide a ❈♦♣② method so you can make and and modify a copy

without affecting the original.

18

Chapter 2. Descriptive statistics

Exercise 2.4 According to Wikipedia, “Survival analysis is a branch of statis-

tics which deals with death in biological organisms and failure in mechani-

cal systems;” see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❙✉r✈✐✈❛❧❴❛♥❛❧②s✐s.

As part of survival analysis, it is often useful to compute the remaining life-

time of, for example, a mechanical component. If we know the distribution

of lifetimes and the age of the component, we can compute the distribution

of remaining lifetimes.

Write a function called ❘❡♠❛✐♥✐♥❣▲✐❢❡t✐♠❡ that takes a Pmf of lifetimes and

an age, and returns a new Pmf that represents the distribution of remaining

lifetimes.

Exercise 2.5 In Section 2.1 we computed the mean of a sample by adding

up the elements and dividing by n. If you are given a PMF, you can still

compute the mean, but the process is slightly different:

µ = ∑ pixi

i

where the xi are the unique values in the PMF and pi=PMF(xi). Similarly,

you can compute variance like this:

2

σ = ∑ pi(xi − µ)2

i

Write functions called P♠❢▼❡❛♥ and P♠❢❱❛r that take a Pmf object and com-

pute the mean and variance. To test these methods, check that they are

consistent with the methods ▼❡❛♥ and ❱❛r in P♠❢✳♣②.

2.7

Plotting PMFs

There are two common ways to plot Pmfs:

• To plot a Pmf as a bar graph, you can use ♣②♣❧♦t✳❜❛r or ♠②♣❧♦t✳❍✐st.

Bar graphs are most useful if the number of values in the Pmf is small.

• To plot a Pmf as a line, you can use ♣②♣❧♦t✳♣❧♦t or ♠②♣❧♦t✳P♠❢. Line

plots are most useful if there are a large number of values and the Pmf

is smooth.

Figure 2.2 shows the PMF of pregnancy lengths as a bar graph. Using the

PMF, we can see more clearly where the distributions differ. First babies

2.8. Outliers

19

PMF

first babies

0.5

others

0.4

0.3

probability 0.2

0.1

0.0

25

30

35

40

45

weeks

Figure 2.2: PMF of pregnancy lengths.

seem to be less likely to arrive on time (week 39) and more likely to be a late

(weeks 41 and 42).

The code that generates the figures in this chapters is available from ❤tt♣✿

✴✴t❤✐♥❦st❛ts✳❝♦♠✴❞❡s❝r✐♣t✐✈❡✳♣②. To run it, you will need the modules

it imports and the data from the NSFG (see Section 1.3).

Note: ♣②♣❧♦t provides a function called ❤✐st that takes a sequence of val-

ues, computes the histogram and plots it. Since I use ❍✐st objects, I usually

don’t use ♣②♣❧♦t✳❤✐st.

2.8

Outliers

Outliers are values that are far from the central tendency. Outliers might

be caused by errors in collecting or processing the data, or they might be

correct but unusual measurements. It is always a good idea to check for

outliers, and sometimes it is useful and appropriate to discard them.

In the list of pregnancy lengths for live births, the 10 lowest values are {0, 4,

9, 13, 17, 17, 18, 19, 20, 21}. Values below 20 weeks are certainly errors, and

values higher than 30 weeks are probably legitimate. But values in between

are hard to interpret.

On the other end, the highest values are:

✇❡❡❦s ❝♦✉♥t

✹✸

✶✹✽

20

Chapter 2. Descriptive statistics

4

Difference in PMFs

2

) 0

- PMF other 2

4

100 (PMF first

6

34

8

36

38

40

42

44

46

weeks

Figure 2.3: Difference in percentage, by week.

✹✹

✹✻

✹✺

✶✵

✹✻

✹✼

✹✽

✺✵

Again, some values are almost certainly errors, but it is hard to know for

sure. One option is to trim the data by discarding some fraction of the high-

est and lowest values (see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❚r✉♥❝❛t❡❞❴♠❡❛♥).

2.9

Other visualizations

Histograms and PMFs are useful for exploratory data analysis; once you

have an idea what is going on, it is often useful to design a visualization

that focuses on the apparent effect.

In the NSFG data, the biggest differences in the distributions are near the

mode. So it makes sense to zoom in on that part of the graph, and to trans-

form the data to emphasize differences.

Figure 2.3 shows the difference between the PMFs for weeks 35–45. I multi-

plied by 100 to express the differences in percentage points.

This figure makes the pattern clearer: first babies are less likely to be born

in week 39, and somewhat more likely to be born in weeks 41 and 42.

2.10. Relative risk

21

2.10

Relative risk

We started with the question, “Do first babies arrive late?” To make that

more precise, let’s say that a baby is early if it is born during Week 37 or

earlier, on time if it is born during Week 38, 39 or 40, and late if it is born

during Week 41 or later. Ranges like these that are used to group data are

called bins.

Exercise 2.6 Create a file named r✐s❦✳♣②.

Write functions named

Pr♦❜❊❛r❧②, Pr♦❜❖♥❚✐♠❡ and Pr♦❜▲❛t❡ that take a PMF and compute the frac-

tion of births that fall into each bin. Hint: write a generalized function that

these functions call.

Make three PMFs, one for first babies, one for others, and one for all live

births. For each PMF, compute the probability of being born early, on time,

or late.

One way to summarize data like this is with relative risk, which is a ratio

of two probabilities. For example, the probability that a first baby is born

early is 18.2%. For other babies it is 16.8%, so the relative risk is 1.08. That

means that first babies are about 8% more likely to be early.

Write code to confirm that result, then compute the relative risks of being

born on time and being late. You can download a solution from ❤tt♣✿✴✴

t❤✐♥❦st❛ts✳❝♦♠✴r✐s❦✳♣②.

2.11

Conditional probability

Imagine that someone you know is pregnant, and it is the beginning of

Week 39. What is the chance that the baby will be born in the next week?

How much does the answer change if it’s a first baby?

We can answer these questions by computing a conditional probability,

which is (ahem!) a probability that depends on a condition. In this case, the

condition is that we know the baby didn’t arrive during Weeks 0–38.

Here’s one way to do it:

1. Given a PMF, generate a fake cohort of 1000 pregnancies. For each

number of weeks, x, the number of pregnancies with duration x is

1000 PMF(x).

2. Remove from the cohort all pregnancies with length less than 39.

22

Chapter 2. Descriptive statistics

3. Compute the PMF of the remaining durations; the result is the condi-

tional PMF.

4. Evaluate the conditional PMF at x = 39 weeks.

This algorithm is conceptually clear, but not very efficient. A simple alter-

native is to remove from the distribution the values less than 39 and then

renormalize.

Exercise 2.7 Write a function that implements either of these algorithms and

computes the probability that a baby will be born during Week 39, given

that it was not born prior to Week 39.

Generalize the function to compute the probability that a baby will be born

during Week x, given that it was not born prior to Week x, for all x. Plot this

value as a function of x for first babies and others.

You can download a solution to this problem from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳

❝♦♠✴❝♦♥❞✐t✐♦♥❛❧✳♣②.

2.12

Re