3.1
The class size paradox
At many American colleges and universities, the student-to-faculty ratio is
about 10:1. But students are often surprised to discover that their average
class size is bigger than 10. There are two reasons for the discrepancy:
• Students typically take 4–5 classes per semester, but professors often
teach 1 or 2.
• The number of students who enjoy a small class is small, but the num-
ber of students in a large class is (ahem!) large.
The first effect is obvious (at least once it is pointed out); the second is more
subtle. So let’s look at an example. Suppose that a college offers 65 classes
in a given semester, with the following distribution of sizes:
s✐③❡
❝♦✉♥t
✺✲ ✾
✽
✶✵✲✶✹
✽
✶✺✲✶✾
✶✹
✷✵✲✷✹
✹
✷✺✲✷✾
✻
✸✵✲✸✹
✶✷
✸✺✲✸✾
✽
✹✵✲✹✹
✸
✹✺✲✹✾
✷
26
Chapter 3. Cumulative distribution functions
If you ask the Dean for the average class size, he would construct a PMF,
compute the mean, and report that the average class size is 24.
But if you survey a group of students, ask them how many students are in
their classes, and compute the mean, you would think that the average class
size was higher.
Exercise 3.1 Build a PMF of these data and compute the mean as perceived
by the Dean. Since the data have been grouped in bins, you can use the
mid-point of each bin.
Now find the distribution of class sizes as perceived by students and com-
pute its mean.
Suppose you want to find the distribution of class sizes at a college, but you
can’t get reliable data from the Dean. An alternative is to choose a random
sample of students and ask them the number of students in each of their
classes. Then you could compute the PMF of their responses.
The result would be biased because large classes would be oversampled,
but you could estimate the actual distribution of class sizes by applying an
appropriate transformation to the observed distribution.
Write a function called ❯♥❜✐❛sP♠❢ that takes the PMF of the observed values
and returns a new Pmf object that estimates the distribution of class sizes.
You can download a solution to this problem from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳
❝♦♠✴❝❧❛ss❴s✐③❡✳♣②.
Exercise 3.2 In most foot races, everyone starts at the same time. If you are
a fast runner, you usually pass a lot of people at the beginning of the race,
but after a few miles everyone around you is going at the same speed.
When I ran a long-distance (209 miles) relay race for the first time, I noticed
an odd phenomenon: when I overtook another runner, I was usually much
faster, and when another runner overtook me, he was usually much faster.
At first I thought that the distribution of speeds might be bimodal; that is,
there were many slow runners and many fast runners, but few at my speed.
Then I realized that I was the victim of selection bias. The race was unusual
in two ways: it used a staggered start, so teams started at different times;
also, many teams included runners at different levels of ability.
As a result, runners were spread out along the course with little relationship
between speed and location. When I started running my leg, the runners
near me were (pretty much) a random sample of the runners in the race.
3.2. The limits of PMFs
27
So where does the bias come from? During my time on the course, the
chance of overtaking a runner, or being overtaken, is proportional to the
difference in our speeds. To see why, think about the extremes. If another
runner is going at the same speed as me, neither of us will overtake the
other. If someone is going so fast that they cover the entire course while I
am running, they are certain to overtake me.
Write a function called ❇✐❛sP♠❢ that takes a Pmf representing the actual
distribution of runners’ speeds, and the speed of a running observer, and
returns a new Pmf representing the distribution of runners’ speeds as seen
by the observer.
To test your function, get the distribution of speeds from a normal road race
(not a relay). I wrote a program that reads the results from the James Joyce
Ramble 10K in Dedham MA and converts the pace of each runner to MPH.
Download it from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴r❡❧❛②✳♣②. Run it and look at
the PMF of speeds.
Now compute the distribution of speeds you would observe if you ran a
relay race at 7.5 MPH with this group of runners. You can download a
solution from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴r❡❧❛②❴s♦❧♥✳♣②
3.2
The limits of PMFs
PMFs work well if the number of values is small. But as the number of
values increases, the probability associated with each value gets smaller and
the effect of random noise increases.
For example, we might be interested in the distribution of birth weights. In
the NSFG data, the variable t♦t❛❧✇❣t❴♦③ records weight at birth in ounces.
Figure 3.1 shows the PMF of these values for first babies and others.
Overall, these distributions resemble the familiar “bell curve,” with many
values near the mean and a few values much higher and lower.
But parts of this figure are hard to interpret. There are many spikes and
valleys, and some apparent differences between the distributions. It is hard
to tell which of these features are significant. Also, it is hard to see overall
patterns; for example, which distribution do you think has the higher mean?
These problems can be mitigated by binning the data; that is, dividing the
domain into non-overlapping intervals and counting the number of values
28
Chapter 3. Cumulative distribution functions
Birth weight PMF
0.040
first babies
0.035
others
0.030
0.025
0.020
probability 0.015
0.010
0.005
0
0.000
50
100
150
200
250
weight (ounces)
Figure 3.1: PMF of birth weights. This figure shows a limitation of PMFs:
they are hard to compare.
in each bin. Binning can be useful, but it is tricky to get the size of the bins
right. If they are big enough to smooth out noise, they might also smooth
out useful information.
An alternative that avoids these problems is the cumulative distribution
function, or CDF. But before we can get to that, we have to talk about per-
centiles.
3.3
Percentiles
If you have taken a standardized test, you probably got your results in the
form of a raw score and a percentile rank. In this context, the percentile
rank is the fraction of people who scored lower than you (or the same). So
if you are “in the 90th percentile,” you did as well as or better than 90% of
the people who took the exam.
Here’s how you could compute the percentile rank of a value, ②♦✉r❴s❝♦r❡,
relative to the scores in the sequence s❝♦r❡s:
❞❡❢ P❡r❝❡♥t✐❧❡❘❛♥❦✭s❝♦r❡s✱ ②♦✉r❴s❝♦r❡✮✿
❝♦✉♥t ❂ ✵
❢♦r s❝♦r❡ ✐♥ s❝♦r❡s✿
✐❢ s❝♦r❡ ❁❂ ②♦✉r❴s❝♦r❡✿
❝♦✉♥t ✰❂ ✶
3.4. Cumulative distribution functions
29
♣❡r❝❡♥t✐❧❡❴r❛♥❦ ❂ ✶✵✵✳✵ ✯ ❝♦✉♥t ✴ ❧❡♥✭s❝♦r❡s✮
r❡t✉r♥ ♣❡r❝❡♥t✐❧❡❴r❛♥❦
For example, if the scores in the sequence were 55, 66, 77, 88 and 99, and
you got the 88, then your percentile rank would be ✶✵✵ ✯ ✹ ✴ ✺ which is
80.
If you are given a value, it is easy to find its percentile rank; going the other
way is slightly harder. If you are given a percentile rank and you want to
find the corresponding value, one option is to sort the values and search for
the one you want:
❞❡❢ P❡r❝❡♥t✐❧❡✭s❝♦r❡s✱ ♣❡r❝❡♥t✐❧❡❴r❛♥❦✮✿
s❝♦r❡s✳s♦rt✭✮
❢♦r s❝♦r❡ ✐♥ s❝♦r❡s✿
✐❢ P❡r❝❡♥t✐❧❡❘❛♥❦✭s❝♦r❡s✱ s❝♦r❡✮ ❃❂ ♣❡r❝❡♥t✐❧❡❴r❛♥❦✿
r❡t✉r♥ s❝♦r❡
The result of this calculation is a percentile. For example, the 50th percentile
is the value with percentile rank 50. In the distribution of exam scores, the
50th percentile is 77.
Exercise 3.3 This implementation of P❡r❝❡♥t✐❧❡ is not very efficient. A bet-
ter approach is to use the percentile rank to compute the index of the corre-
sponding percentile. Write a version of P❡r❝❡♥t✐❧❡ that uses this algorithm.
You can download a solution from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴s❝♦r❡❴
❡①❛♠♣❧❡✳♣②.
Exercise 3.4 Optional: If you only want to compute one percentile, it is not
efficient to sort the scores. A better option is the selection algorithm, which
you can read about at ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❙❡❧❡❝t✐♦♥❴❛❧❣♦r✐t❤♠.
Write (or find) an implementation of the selection algorithm and use it to
write an efficient version of P❡r❝❡♥t✐❧❡.
3.4
Cumulative distribution functions
Now that we understand percentiles, we are ready to tackle the cumulative
distribution function (CDF). The CDF is the function that maps values to
their percentile rank in a distribution.
30
Chapter 3. Cumulative distribution functions
The CDF is a function of x, where x is any value that might appear in the
distribution. To evaluate CDF(x) for a particular value of x, we compute the
fraction of the values in the sample less than (or equal to) x.
Here’s what that looks like as a function that takes a sample, t, and a value,
①:
❞❡❢ ❈❞❢✭t✱ ①✮✿
❝♦✉♥t ❂ ✵✳✵
❢♦r ✈❛❧✉❡ ✐♥ t✿
✐❢ ✈❛❧✉❡ ❁❂ ①✿
❝♦✉♥t ✰❂ ✶✳✵
♣r♦❜ ❂ ❝♦✉♥t ✴ ❧❡♥✭t✮
r❡t✉r♥ ♣r♦❜
This function should look familiar; it is almost identical to P❡r❝❡♥t✐❧❡❘❛♥❦,
except that the result is in a probability in the range 0–1 rather than a per-
centile rank in the range 0–100.
As an example, suppose a sample has the values {1, 2, 2, 3, 5}. Here are some
values from its CDF:
CDF(0) = 0
CDF(1) = 0.2
CDF(2) = 0.6
CDF(3) = 0.8
CDF(4) = 0.8
CDF(5) = 1
We can evaluate the CDF for any value of x, not just values that appear in
the sample. If x is less than the smallest value in the sample, CDF(x) is 0. If
x is greater than the largest value, CDF(x) is 1.
Figure 3.2 is a graphical representation of this CDF. The CDF of a sample is
a step function. In the next chapter we will see distributions whose CDFs
are continuous functions.
3.5
Representing CDFs
I have written a module called ❈❞❢ that provides a class named ❈❞❢ that
represents CDFs.
You can read the documentation of this module at
3.5. Representing CDFs
31
1.0
CDF
0.8
0.6
CDF(x) 0.4
0.2
0
0.0
1
2
3
4
5
6
x
Figure 3.2: Example of a CDF.
❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴❈❞❢✳❤t♠❧ and you can download it from ❤tt♣✿
✴✴t❤✐♥❦st❛ts✳❝♦♠✴❈❞❢✳♣②.
Cdfs are implemented with two sorted lists: ①s, which contains the values,
and ♣s, which contains the probabilities. The most important methods Cdfs
provide are:
Pr♦❜✭①✮: Given a value x, computes the probability p = CDF(x).
❱❛❧✉❡✭♣✮: Given a probability p, computes the corresponding value, x; that
is, the inverse CDF of p.
Because ①s and ♣s are sorted, these operations can use the bisection algo-
rithm, which is efficient. The run time is proportional to the logarithm of
the number of values; see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❚✐♠❡❴❝♦♠♣❧❡①✐t②.
Cdfs also provide ❘❡♥❞❡r, which returns two lists, ①s and ♣s, suitable for
plotting the CDF. Because the CDF is a step function, these lists have two
elements for each unique value in the distribution.
The Cdf module provides several functions for making Cdfs, including
▼❛❦❡❈❞❢❋r♦♠▲✐st, which takes a sequence of values and returns their Cdf.
Finally, ♠②♣❧♦t✳♣② provides functions named ❈❞❢ and ❈❞❢s that plot Cdfs as
lines.
Exercise 3.5 Download ❈❞❢✳♣② and r❡❧❛②✳♣② (see Exercise 3.2) and generate
a plot that shows the CDF of running speeds. Which gives you a better sense
of the shape of the distribution, the PMF or the CDF? You can download a
solution from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴r❡❧❛②❴❝❞❢✳♣②.
32
Chapter 3. Cumulative distribution functions
Birth weight CDF
1.0
first babies
others
0.8
0.6
probability 0.4
0.2
0
0.0
50
100
150
200
weight (ounces)
Figure 3.3: CDF of birth weights.
3.6
Back to the survey data
Figure 3.3 shows the CDFs of birth weight for first babies and others in the
NSFG dataset.
This figure makes the shape of the distributions, and the differences be-
tween them, much clearer. We can see that first babies are slightly lighter
throughout the distribution, with a larger discrepancy above the mean.
Exercise 3.6 How much did you weigh at birth? If you don’t know, call your
mother or someone else who knows. Using the pooled data (all live births),
compute the distribution of birth weights and use it to find your percentile
rank. If you were a first baby, find your percentile rank in the distribution
for first babies. Otherwise use the distribution for others. If you are in the
90th percentile or higher, call your mother back and apologize.
Exercise 3.7 Suppose you and your classmates compute the percentile rank
of your birth weights and then compute the CDF of the percentile ranks.
What do you expect it to look like? Hint: what fraction of the class do you
expect to be above the median?
3.7
Conditional distributions
A conditional distribution is the distribution of a subset of the data which
is selected according to a condition.
3.8. Random numbers
33
For example, if you are above average in weight, but way above average in
height, then you might be relatively light for your height. Here’s how you
could make that claim more precise.
1. Select a cohort of people who are the same height as you (within some
range).
2. Find the CDF of weight for those people.
3. Find the percentile rank of your weight in that distribution.
Percentile ranks are useful for comparing measurements from different
tests, or tests applied to different groups.
For example, people who compete in foot races are usually grouped by age
and gender. To compare people in different groups, you can convert race
times to percentile ranks.
Exercise 3.8 I recently ran the James Joyce Ramble 10K in Dedham MA.
The results are available from ❤tt♣✿✴✴❝♦♦❧r✉♥♥✐♥❣✳❝♦♠✴r❡s✉❧ts✴✶✵✴♠❛✴
❆♣r✷✺❴✷✼t❤❆♥❴s❡t✶✳s❤t♠❧. Go to that page and find my results. I came
in 97th in a field of 1633, so what is my percentile rank in the field?
In my division (M4049 means “male between 40 and 49 years of age”) I
came in 26th out of 256. What is my percentile rank in my division?
If I am still running in 10 years (and I hope I am), I will be in the M5059
division. Assuming that my percentile rank in my division is the same,
how much slower should I expect to be?
I maintain a friendly rivalry with a student who is in the F2039 division.
How fast does she have to run her next 10K to “beat” me in terms of per-
centile ranks?
3.8
Random numbers
CDFs are useful for generating random numbers with a given distribution.
Here’s how:
• Choose a random probability in the range 0–1.
• Use ❈❞❢✳❱❛❧✉❡ to find the value in the distribution that corresponds
to the probability you chose.
34
Chapter 3. Cumulative distribution functions
It might not be obvious why this works, but since it is easier to implement
than to explain, let’s try it out.
Exercise 3.9 Write a function called ❙❛♠♣❧❡, that takes a Cdf and an integer,
n, and returns a list of n values chosen at random from the Cdf. Hint: use
r❛♥❞♦♠✳r❛♥❞♦♠. You will find a solution to this exercise in ❈❞❢✳♣②.
Using the distribution of birth weights from the NSFG, generate a random
sample with 1000 elements. Compute the CDF of the sample. Make a plot
that shows the original CDF and the CDF of the random sample. For large
values of n, the distributions should be the same.
This process, generating a random sample based on a measured sample, is
called resampling.
There are two ways to draw a sample from a population: with and without
replacement. If you imagine drawing marbles from an urn1, “replacement”
means putting the marbles back as you go (and stirring), so the population
is the same for every draw. “Without replacement,” means that each marble
can only be drawn once, so the remaining population is different after each
draw.
In Python, sampling with replacement can be implemented with
r❛♥❞♦♠✳r❛♥❞♦♠ to choose a percentile rank, or r❛♥❞♦♠✳❝❤♦✐❝❡ to choose an
element from a sequence. Sampling without replacement is provided by
r❛♥❞♦♠✳s❛♠♣❧❡.
Exercise 3.10 The numbers generated by r❛♥❞♦♠✳r❛♥❞♦♠ are supposed to be
uniform between 0 and 1; that is, every value in the range should have the
same probability.
Generate 1000 numbers from r❛♥❞♦♠✳r❛♥❞♦♠ and plot their PMF and CDF.
Can you tell whether they are uniform?
You can read about the uniform distribution at ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴
✇✐❦✐✴❯♥✐❢♦r♠❴❞✐str✐❜✉t✐♦♥❴✭❞✐s❝r❡t❡✮.
3.9
Summary statistics revisited
Once you have computed a CDF, it is easy to compute other summary statis-
tics. The median is just the 50th percentile2. The 25th and 75th percentiles
1The marbles-in-an-urn scenario is a standard model for random sampling processes
(see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❯r♥❴♣r♦❜❧❡♠).
2You might see other definitions of the median. In particular, some sources suggest
that if you have an even number of elements in a sample, the median is the average of
3.10. Glossary
35
are often used to check whether a distribution is symmetric, and their dif-
ference, which is called the interquartile range, measures the spread.
Exercise 3.11 Write a function called ▼❡❞✐❛♥ that takes a Cdf and computes
the median, and one called ■♥t❡rq✉❛rt✐❧❡ that computes the interquartile
range.
Compute the 25th, 50th, and 75th percentiles of the birth weight CDF. Do
these values suggest that the distribution is symmetric?
3.10
Glossary
percentile rank: The percentage of values in a distribution that are less than
or equal to a given value.
CDF: Cumulative distribution function, a function that maps from values
to their percentile ranks.
percentile: The value associated with a given percentile rank.
conditional distribution: A distribution computed under the assumption
that some condition holds.
resampling: The process of generating a random sample from a distribu-
tion that was computed from a sample.
replacement: During a sampling process, “replacement” indicates that the
population is the same for every sample. “Without replacement” in-
dicates that each element can be selected only once.
interquartile range: A measure of spread, the difference between the 75th
and 25th percentiles.
the middle two elements. This is an unnecessary special case, and it has the odd effect of generating a value that is not in the sample. As far as I’m concerned, the median is the 50th percentile. Period.
36
Chapter 3. Cumulative distribution functions