2.1
Means and averages
In the previous chapter, I mentioned three summary statistics—mean, vari-
ance and median—without explaining what they are. So before we go any
farther, let’s take care of that.
If you have a sample of n values, xi, the mean, µ, is the sum of the values
divided by the number of values; in other words
1
µ =
∑ x
n
i
i
The words “mean” and “average” are sometimes used interchangeably, but
I will maintain this distinction:
• The “mean” of a sample is the summary statistic computed with the
previous formula.
• An “average” is one of many summary statistics you might choose to
describe the typical value or the central tendency of a sample.
Sometimes the mean is a good description of a set of values. For example,
apples are all pretty much the same size (at least the ones sold in supermar-
kets). So if I buy 6 apples and the total weight is 3 pounds, it would be a
reasonable summary to say they are about a half pound each.
But pumpkins are more diverse. Suppose I grow several varieties in my
garden, and one day I harvest three decorative pumpkins that are 1 pound
12
Chapter 2. Descriptive statistics
each, two pie pumpkins that are 3 pounds each, and one Atlantic Gi-
ant® pumpkin that weighs 591 pounds. The mean of this sample is 100
pounds, but if I told you “The average pumpkin in my garden is 100
pounds,” that would be wrong, or at least misleading.
In this example, there is no meaningful average because there is no typical
pumpkin.
2.2
Variance
If there is no single number that summarizes pumpkin weights, we can do
a little better with two numbers: mean and variance.
In the same way that the mean is intended to describe the central tendency,
variance is intended to describe the spread. The variance of a set of values
is
2
1
σ =
∑(x
n
i − µ)2
i
The term xi- µ is called the “deviation from the mean,” so variance is the
mean squared deviation, which is why it is denoted 2
σ . The square root of
variance, σ, is called the standard deviation.
By itself, variance is hard to interpret. One problem is that the units are
strange; in this case the measurements are in pounds, so the variance is in
pounds squared. Standard deviation is more meaningful; in this case the
units are pounds.
Exercise 2.1 For the exercises in this chapter you should download ❤tt♣✿
✴✴t❤✐♥❦st❛ts✳❝♦♠✴t❤✐♥❦st❛ts✳♣②, which contains general-purpose func-
tions we will use throughout the book. You can read documentation of
these functions in ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴t❤✐♥❦st❛ts✳❤t♠❧.
Write a function called P✉♠♣❦✐♥ that uses functions from t❤✐♥❦st❛ts✳♣②
to compute the mean, variance and standard deviation of the pumpkins
weights in the previous section.
Exercise 2.2 Reusing code from s✉r✈❡②✳♣② and ❢✐rst✳♣②, compute the stan-
dard deviation of gestation time for first babies and others. Does it look like
the spread is the same for the two groups?
How big is the difference in the means compared to these standard devia-
tions? What does this comparison suggest about the statistical significance
of the difference?
2.3. Distributions
13
If you have prior experience, you might have seen a formula for variance
with n − 1 in the denominator, rather than n. This statistic is called the
“sample variance,” and it is used to estimate the variance in a population
using a sample. We will come back to this in Chapter 8.
2.3
Distributions
Summary statistics are concise, but dangerous, because they obscure the
data. An alternative is to look at the distribution of the data, which de-
scribes how often each value appears.
The most common representation of a distribution is a histogram, which is
a graph that shows the frequency or probability of each value.
In this context, frequency means the number of times a value appears in a
dataset—it has nothing to do with the pitch of a sound or tuning of a radio
signal. A probability is a frequency expressed as a fraction of the sample
size, n.
In Python, an efficient way to compute frequencies is with a dictionary.
Given a sequence of values, t:
❤✐st ❂ ④⑥
❢♦r ① ✐♥ t✿
❤✐st❬①❪ ❂ ❤✐st✳❣❡t✭①✱ ✵✮ ✰ ✶
The result is a dictionary that maps from values to frequencies. To get from
frequencies to probabilities, we divide through by n, which is called nor-
malization:
♥ ❂ ❢❧♦❛t✭❧❡♥✭t✮✮
♣♠❢ ❂ ④⑥
❢♦r ①✱ ❢r❡q ✐♥ ❤✐st✳✐t❡♠s✭✮✿
♣♠❢❬①❪ ❂ ❢r❡q ✴ ♥
The normalized histogram is called a PMF, which stands for “probability
mass function”; that is, it’s a function that maps from values to probabilities
(I’ll explain “mass” in Section 6.3).
It might be confusing to call a Python dictionary a function. In mathematics,
a function is a map from one set of values to another. In Python, we usually
represent mathematical functions with function objects, but in this case we
are using a dictionary (dictionaries are also called “maps,” if that helps).
14
Chapter 2. Descriptive statistics
2.4
Representing histograms
I wrote a Python module called P♠❢✳♣② that contains class definitions for
Hist objects, which represent histograms, and Pmf objects, which represent
PMFs. You can read the documentation at t❤✐♥❦st❛ts✳❝♦♠✴P♠❢✳❤t♠❧ and
download the code from t❤✐♥❦st❛ts✳❝♦♠✴P♠❢✳♣②.
The function ▼❛❦❡❍✐st❋r♦♠▲✐st takes a list of values and returns a new Hist
object. You can test it in Python’s interactive mode:
❃❃❃ ✐♠♣♦rt P♠❢
❃❃❃ ❤✐st ❂ P♠❢✳▼❛❦❡❍✐st❋r♦♠▲✐st✭❬✶✱ ✷✱ ✷✱ ✸✱ ✺❪✮
❃❃❃ ♣r✐♥t ❤✐st
❁P♠❢✳❍✐st ♦❜❥❡❝t ❛t ✵①❜✼✻❝❢✻✽❝❃
P♠❢✳❍✐st means that this object is a member of the Hist class, which is de-
fined in the Pmf module. In general, I use upper case letters for the names
of classes and functions, and lower case letters for variables.
Hist objects provide methods to look up values and their probabilities. ❋r❡q
takes a value and returns its frequency:
❃❃❃ ❤✐st✳❋r❡q✭✷✮
✷
If you look up a value that has never appeared, the frequency is 0.
❃❃❃ ❤✐st✳❋r❡q✭✹✮
✵
❱❛❧✉❡s returns an unsorted list of the values in the Hist:
❃❃❃ ❤✐st✳❱❛❧✉❡s✭✮
❬✶✱ ✺✱ ✸✱ ✷❪
To loop through the values in order, you can use the built-in function
s♦rt❡❞:
❢♦r ✈❛❧ ✐♥ s♦rt❡❞✭❤✐st✳❱❛❧✉❡s✭✮✮✿
♣r✐♥t ✈❛❧✱ ❤✐st✳❋r❡q✭✈❛❧✮
If you are planning to look up all of the frequencies, it is more efficient to
use ■t❡♠s, which returns an unsorted list of value-frequency pairs:
❢♦r ✈❛❧✱ ❢r❡q ✐♥ ❤✐st✳■t❡♠s✭✮✿
♣r✐♥t ✈❛❧✱ ❢r❡q
2.5. Plotting histograms
15
Exercise 2.3 The mode of a distribution is the most frequent value (see ❤tt♣✿
✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴▼♦❞❡❴✭st❛t✐st✐❝s✮). Write a function called ▼♦❞❡
that takes a Hist object and returns the most frequent value.
As a more challenging version, write a function called ❆❧❧▼♦❞❡s that takes
a Hist object and returns a list of value-frequency pairs in descending or-
der of frequency. Hint: the ♦♣❡r❛t♦r module provides a function called
✐t❡♠❣❡tt❡r which you can pass as a key to s♦rt❡❞.
2.5
Plotting histograms
There are a number of Python packages for making figures and graphs. The
one I will demonstrate is ♣②♣❧♦t, which is part of the ♠❛t♣❧♦t❧✐❜ package
at ❤tt♣✿✴✴♠❛t♣❧♦t❧✐❜✳s♦✉r❝❡❢♦r❣❡✳♥❡t.
This package is included in many Python installations. To see whether you
have it, launch the Python interpreter and run:
✐♠♣♦rt ♠❛t♣❧♦t❧✐❜✳♣②♣❧♦t ❛s ♣②♣❧♦t
♣②♣❧♦t✳♣✐❡✭❬✶✱✷✱✸❪✮
♣②♣❧♦t✳s❤♦✇✭✮
If you have ♠❛t♣❧♦t❧✐❜ you should see a simple pie chart; otherwise you
will have to install it.
Histograms and PMFs are most often plotted as bar charts. The ♣②♣❧♦t
function to draw a bar chart is ❜❛r. Hist objects provide a method called
❘❡♥❞❡r that returns a sorted list of values and a list of the corresponding
frequencies, which is the format ❜❛r expects:
❃❃❃ ✈❛❧s✱ ❢r❡qs ❂ ❤✐st✳❘❡♥❞❡r✭✮
❃❃❃ r❡❝t❛♥❣❧❡s ❂ ♣②♣❧♦t✳❜❛r✭✈❛❧s✱ ❢r❡qs✮
❃❃❃ ♣②♣❧♦t✳s❤♦✇✭✮
I wrote a module called ♠②♣❧♦t✳♣② that provides functions for plotting his-
tograms, PMFs and other objects we will see soon. You can read the doc-
umentation at t❤✐♥❦st❛ts✳❝♦♠✴♠②♣❧♦t✳❤t♠❧ and download the code from
t❤✐♥❦st❛ts✳❝♦♠✴♠②♣❧♦t✳♣②. Or you can use ♣②♣❧♦t directly, if you prefer.
Either way, you can find the documentation for ♣②♣❧♦t on the web.
Figure 2.1 shows histograms of pregnancy lengths for first babies and oth-
ers.
Histograms are useful because they make the following features immedi-
ately apparent:
16
Chapter 2. Descriptive statistics
Histogram
2500
first babies
others
2000
1500
frequency 1000
500
0
25
30
35
40
45
weeks
Figure 2.1: Histogram of pregnancy lengths.
Mode: The most common value in a distribution is called the mode. In
Figure 2.1 there is a clear mode at 39 weeks. In this case, the mode is
the summary statistic that does the best job of describing the typical
value.
Shape: Around the mode, the distribution is asymmetric; it drops off
quickly to the right and more slowly to the left. From a medical point
of view, this makes sense. Babies are often born early, but seldom later
than 42 weeks. Also, the right side of the distribution is truncated
because doctors often intervene after 42 weeks.
Outliers: Values far from the mode are called outliers. Some of these are
just unusual cases, like babies born at 30 weeks. But many of them are
probably due to errors, either in the reporting or recording of data.
Although histograms make some features apparent, they are usually not
useful for comparing two distributions. In this example, there are fewer
“first babies” than “others,” so some of the apparent differences in the his-
tograms are due to sample sizes. We can address this problem using PMFs.
2.6
Representing PMFs
P♠❢✳♣② provides a class called P♠❢ that represents PMFs. The notation can
be confusing, but here it is: P♠❢ is the name of the module and also the class,
so the full name of the class is P♠❢✳P♠❢. I often use ♣♠❢ as a variable name.
2.6. Representing PMFs
17
Finally, in the text, I use PMF to refer to the general concept of a probability
mass function, independent of my implementation.
To create a Pmf object, use ▼❛❦❡P♠❢❋r♦♠▲✐st, which takes a list of values:
❃❃❃ ✐♠♣♦rt P♠❢
❃❃❃ ♣♠❢ ❂ P♠❢✳▼❛❦❡P♠❢❋r♦♠▲✐st✭❬✶✱ ✷✱ ✷✱ ✸✱ ✺❪✮
❃❃❃ ♣r✐♥t ♣♠❢
❁P♠❢✳P♠❢ ♦❜❥❡❝t ❛t ✵①❜✼✻❝❢✻✽❝❃
Pmf and Hist objects are similar in many ways. The methods ❱❛❧✉❡s and
■t❡♠s work the same way for both types. The biggest difference is that
a Hist maps from values to integer counters; a Pmf maps from values to
floating-point probabilities.
To look up the probability associated with a value, use Pr♦❜:
❃❃❃ ♣♠❢✳Pr♦❜✭✷✮
✵✳✹
You can modify an existing Pmf by incrementing the probability associated
with a value:
❃❃❃ ♣♠❢✳■♥❝r✭✷✱ ✵✳✷✮
❃❃❃ ♣♠❢✳Pr♦❜✭✷✮
✵✳✻
Or you can multiply a probability by a factor:
❃❃❃ ♣♠❢✳▼✉❧t✭✷✱ ✵✳✺✮
❃❃❃ ♣♠❢✳Pr♦❜✭✷✮
✵✳✸
If you modify a Pmf, the result may not be normalized; that is, the probabil-
ities may no longer add up to 1. To check, you can call ❚♦t❛❧, which returns
the sum of the probabilities:
❃❃❃ ♣♠❢✳❚♦t❛❧✭✮
✵✳✾
To renormalize, call ◆♦r♠❛❧✐③❡:
❃❃❃ ♣♠❢✳◆♦r♠❛❧✐③❡✭✮
❃❃❃ ♣♠❢✳❚♦t❛❧✭✮
✶✳✵
Pmf objects provide a ❈♦♣② method so you can make and and modify a copy
without affecting the original.
18
Chapter 2. Descriptive statistics
Exercise 2.4 According to Wikipedia, “Survival analysis is a branch of statis-
tics which deals with death in biological organisms and failure in mechani-
cal systems;” see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❙✉r✈✐✈❛❧❴❛♥❛❧②s✐s.
As part of survival analysis, it is often useful to compute the remaining life-
time of, for example, a mechanical component. If we know the distribution
of lifetimes and the age of the component, we can compute the distribution
of remaining lifetimes.
Write a function called ❘❡♠❛✐♥✐♥❣▲✐❢❡t✐♠❡ that takes a Pmf of lifetimes and
an age, and returns a new Pmf that represents the distribution of remaining
lifetimes.
Exercise 2.5 In Section 2.1 we computed the mean of a sample by adding
up the elements and dividing by n. If you are given a PMF, you can still
compute the mean, but the process is slightly different:
µ = ∑ pixi
i
where the xi are the unique values in the PMF and pi=PMF(xi). Similarly,
you can compute variance like this:
2
σ = ∑ pi(xi − µ)2
i
Write functions called P♠❢▼❡❛♥ and P♠❢❱❛r that take a Pmf object and com-
pute the mean and variance. To test these methods, check that they are
consistent with the methods ▼❡❛♥ and ❱❛r in P♠❢✳♣②.
2.7
Plotting PMFs
There are two common ways to plot Pmfs:
• To plot a Pmf as a bar graph, you can use ♣②♣❧♦t✳❜❛r or ♠②♣❧♦t✳❍✐st.
Bar graphs are most useful if the number of values in the Pmf is small.
• To plot a Pmf as a line, you can use ♣②♣❧♦t✳♣❧♦t or ♠②♣❧♦t✳P♠❢. Line
plots are most useful if there are a large number of values and the Pmf
is smooth.
Figure 2.2 shows the PMF of pregnancy lengths as a bar graph. Using the
PMF, we can see more clearly where the distributions differ. First babies
2.8. Outliers
19
PMF
first babies
0.5
others
0.4
0.3
probability 0.2
0.1
0.0
25
30
35
40
45
weeks
Figure 2.2: PMF of pregnancy lengths.
seem to be less likely to arrive on time (week 39) and more likely to be a late
(weeks 41 and 42).
The code that generates the figures in this chapters is available from ❤tt♣✿
✴✴t❤✐♥❦st❛ts✳❝♦♠✴❞❡s❝r✐♣t✐✈❡✳♣②. To run it, you will need the modules
it imports and the data from the NSFG (see Section 1.3).
Note: ♣②♣❧♦t provides a function called ❤✐st that takes a sequence of val-
ues, computes the histogram and plots it. Since I use ❍✐st objects, I usually
don’t use ♣②♣❧♦t✳❤✐st.
2.8
Outliers
Outliers are values that are far from the central tendency. Outliers might
be caused by errors in collecting or processing the data, or they might be
correct but unusual measurements. It is always a good idea to check for
outliers, and sometimes it is useful and appropriate to discard them.
In the list of pregnancy lengths for live births, the 10 lowest values are {0, 4,
9, 13, 17, 17, 18, 19, 20, 21}. Values below 20 weeks are certainly errors, and
values higher than 30 weeks are probably legitimate. But values in between
are hard to interpret.
On the other end, the highest values are:
✇❡❡❦s ❝♦✉♥t
✹✸
✶✹✽
20
Chapter 2. Descriptive statistics
4
Difference in PMFs
2
) 0
- PMF other 2
4
100 (PMF first
6
34
8
36
38
40
42
44
46
weeks
Figure 2.3: Difference in percentage, by week.
✹✹
✹✻
✹✺
✶✵
✹✻
✶
✹✼
✶
✹✽
✼
✺✵
✷
Again, some values are almost certainly errors, but it is hard to know for
sure. One option is to trim the data by discarding some fraction of the high-
est and lowest values (see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❚r✉♥❝❛t❡❞❴♠❡❛♥).
2.9
Other visualizations
Histograms and PMFs are useful for exploratory data analysis; once you
have an idea what is going on, it is often useful to design a visualization
that focuses on the apparent effect.
In the NSFG data, the biggest differences in the distributions are near the
mode. So it makes sense to zoom in on that part of the graph, and to trans-
form the data to emphasize differences.
Figure 2.3 shows the difference between the PMFs for weeks 35–45. I multi-
plied by 100 to express the differences in percentage points.
This figure makes the pattern clearer: first babies are less likely to be born
in week 39, and somewhat more likely to be born in weeks 41 and 42.
2.10. Relative risk
21
2.10
Relative risk
We started with the question, “Do first babies arrive late?” To make that
more precise, let’s say that a baby is early if it is born during Week 37 or
earlier, on time if it is born during Week 38, 39 or 40, and late if it is born
during Week 41 or later. Ranges like these that are used to group data are
called bins.
Exercise 2.6 Create a file named r✐s❦✳♣②.
Write functions named
Pr♦❜❊❛r❧②, Pr♦❜❖♥❚✐♠❡ and Pr♦❜▲❛t❡ that take a PMF and compute the frac-
tion of births that fall into each bin. Hint: write a generalized function that
these functions call.
Make three PMFs, one for first babies, one for others, and one for all live
births. For each PMF, compute the probability of being born early, on time,
or late.
One way to summarize data like this is with relative risk, which is a ratio
of two probabilities. For example, the probability that a first baby is born
early is 18.2%. For other babies it is 16.8%, so the relative risk is 1.08. That
means that first babies are about 8% more likely to be early.
Write code to confirm that result, then compute the relative risks of being
born on time and being late. You can download a solution from ❤tt♣✿✴✴
t❤✐♥❦st❛ts✳❝♦♠✴r✐s❦✳♣②.
2.11
Conditional probability
Imagine that someone you know is pregnant, and it is the beginning of
Week 39. What is the chance that the baby will be born in the next week?
How much does the answer change if it’s a first baby?
We can answer these questions by computing a conditional probability,
which is (ahem!) a probability that depends on a condition. In this case, the
condition is that we know the baby didn’t arrive during Weeks 0–38.
Here’s one way to do it:
1. Given a PMF, generate a fake cohort of 1000 pregnancies. For each
number of weeks, x, the number of pregnancies with duration x is
1000 PMF(x).
2. Remove from the cohort all pregnancies with length less than 39.
22
Chapter 2. Descriptive statistics
3. Compute the PMF of the remaining durations; the result is the condi-
tional PMF.
4. Evaluate the conditional PMF at x = 39 weeks.
This algorithm is conceptually clear, but not very efficient. A simple alter-
native is to remove from the distribution the values less than 39 and then
renormalize.
Exercise 2.7 Write a function that implements either of these algorithms and
computes the probability that a baby will be born during Week 39, given
that it was not born prior to Week 39.
Generalize the function to compute the probability that a baby will be born
during Week x, given that it was not born prior to Week x, for all x. Plot this
value as a function of x for first babies and others.
You can download a solution to this problem from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳
❝♦♠✴❝♦♥❞✐t✐♦♥❛❧✳♣②.
2.12
Re