8.1
The estimation game
Let’s play a game. I’ll think of a distribution, and you have to guess what it
is. We’ll start out easy and work our way up.
I’m thinking of a distribution. I’ll give you two hints; it’s a normal distribution,
and here’s a random sample drawn from it:
{−0.441, 1.774, −0.101, −1.138, 2.975, −2.138}
What do you think is the mean parameter, µ, of this distribution?
One choice is to use the sample mean to estimate µ. Up until now we have
used the symbol µ for both the sample mean and the mean parameter, but
now to distinguish them I will use ¯x for the sample mean. In this example,
¯x is 0.155, so it would be reasonable to guess µ = 0.155.
This process is called estimation, and the statistic we used (the sample
mean) is called an estimator.
Using the sample mean to estimate µ is so obvious that it is hard to imagine
a reasonable alternative. But suppose we change the game by introducing
outliers.
I’m thinking of a distribution. It’s a normal distribution, and here’s a sam-
ple that was collected by an unreliable surveyor who occasionally puts the
decimal point in the wrong place.
{−0.441, 1.774, −0.101, −1.138, 2.975, −213.8}
94
Chapter 8. Estimation
Now what’s your estimate of µ? If you use the sample mean your guess is
−35.12. Is that the best choice? What are the alternatives?
One option is to identify and discard outliers, then compute the sample
mean of the rest. Another option is to use the median as an estimator.
Which estimator is the best depends on the circumstances (for example,
whether there are outliers) and on what the goal is. Are you trying to mini-
mize errors, or maximize your chance of getting the right answer?
If there are no outliers, the sample mean minimizes the mean squared error
(MSE). If we play the game many times, and each time compute the error
¯x − µ, the sample mean minimizes
1
MSE =
∑( ¯x − µ)2
m
Where m is the number of times you play the estimation game (not to be
confused with n, which is the size of the sample used to compute ¯x).
Minimizing MSE is a nice property, but it’s not always the best strategy.
For example, suppose we are estimating the distribution of wind speeds
at a building site. If we guess too high, we might overbuild the structure,
increasing its cost. But if we guess too low, the building might collapse.
Because cost as a function of error is asymmetric, minimizing MSE is not
the best strategy.
As another example, suppose I roll three six-sided dice and ask you to pre-
dict the total. If you get it exactly right, you get a prize; otherwise you get
nothing. In this case the value that minimizes MSE is 10.5, but that would
be a terrible guess. For this game, you want an estimator that has the high-
est chance of being right, which is a maximum likelihood estimator (MLE).
If you pick 10 or 11, your chance of winning is 1 in 8, and that’s the best you
can do.
Exercise 8.1 Write a function that draws 6 values from a normal distribution
with µ = 0 and σ = 1. Use the sample mean to estimate µ and compute the error ¯x − µ. Run the function 1000 times and compute MSE.
Now modify the program to use the median as an estimator. Compute MSE
again and compare to the MSE for ¯x.
8.2
Guess the variance
I’m thinking of a distribution. It’s a normal distribution, and here’s a (familiar)
8.3. Understanding errors
95
sample:
{−0.441, 1.774, −0.101, −1.138, 2.975, −2.138}
What do you think is the variance, 2
σ , of my distribution? Again, the ob-
vious choice is to use the sample variance as an estimator. I will use S2to
denote the sample variance, to distinguish from the unknown parameter
2
σ .
1
S2 =
∑(x
n
i − ¯
x)2
For large samples, S2is an adequate estimator, but for small samples it tends
to be too low. Because of this unfortunate property, it is called a biased
estimator.
An estimator is unbiased if the expected total (or mean) error, after many
iterations of the estimation game, is 0. Fortunately, there is another simple
statistic that is an unbiased estimator of 2
σ :
1
S2n−1 =
∑(x
n − 1
i − ¯
x)2
The biggest problem with this estimator is that its name and symbol are
used inconsistently. The name “sample variance” can refer to either S2or
S2n− , and the symbol S2is used for either or both.
1
For an explanation of why S2is biased, and a proof that S2n− is unbiased, see
1
❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❇✐❛s❴♦❢❴❛♥❴❡st✐♠❛t♦r.
Exercise 8.2 Write a function that draws 6 values from a normal distribution
with
2
µ = 0 and σ = 1. Use the sample variance to estimate σ and compute the error S2 − 2
σ . Run the function 1000 times and compute mean error (not
squared).
Now modify the program to use the unbiased estimator S2
. Compute the
n−1
mean error again and see if it converges to zero as you increase the number
of games.
8.3
Understanding errors
Before we go on, let’s clear up a common source of confusion. Properties
like MSE and bias are long-term expectations based on many iterations of
the estimation game.
96
Chapter 8. Estimation
While you are playing the game, you don’t know the errors. That is, if I
give you a sample and ask you to estimate a parameter, you can compute
the value of the estimator, but you can’t compute the error. If you could,
you wouldn’t need the estimator!
The reason we talk about estimation error is to describe the behavior of
different estimators in the long run. In this chapter we run experiments to
examine those behaviors; these experiments are artificial in the sense that
we know the actual values of the parameters, so we can compute errors.
But when you work with real data, you don’t, so you can’t.
Now let’s get back to the game.
8.4
Exponential distributions
I’m thinking of a distribution. It’s an exponential distribution, and here’s a
sample:
{5.384, 4.493, 19.198, 2.790, 6.122, 12.844}
What do you think is the parameter, λ, of this distribution?
In general, the mean of an exponential distribution is 1/ λ, so working back-
wards, we might choose
ˆ λ = 1 / ¯x
It is common to use hat notation for estimators, so ˆ λ is an estimator of λ.
And not just any estimator; it is also the MLE estimator1. So if you want to
maximize your chance of guessing λ exactly, ˆ λ is the way to go.
But we know that ¯x is not robust in the presence of outliers, so we expect ˆ λ
to have the same problem.
Maybe we can find an alternative based on the sample median. Remember
that the median of an exponential distribution is log(2) / λ, so working
backwards again, we can define an estimator
ˆ λ 1/2 = log(2)/ µ 1/2
where µ 1/2 is the sample median.
Exercise 8.3 Run an experiment to see which of ˆ λ and ˆ λ 1/2 yields lower MSE. Test whether either of them is biased.
1See
❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❊①♣♦♥❡♥t✐❛❧❴❞✐str✐❜✉t✐♦♥★▼❛①✐♠✉♠❴
❧✐❦❡❧✐❤♦♦❞.
8.5. Confidence intervals
97
8.5
Confidence intervals
So far we have looked at estimators that generate single values, known as
point estimates. For many problems, we might prefer an interval that spec-
ifies an upper and lower bound on the unknown parameter.
Or, more generally, we might want that whole distribution; that is, the range
of values the parameter could have, and for each value in the range, a notion
of how likely it is.
Let’s start with confidence intervals.
I’m thinking of a distribution. It’s an exponential distribution, and here’s a
sample:
{5.384, 4.493, 19.198, 2.790, 6.122, 12.844}
I want you to give me a range of values that you think is likely to contain the
unknown parameter λ. More specifically, I want a 90% confidence interval,
which means that if we play this game over and over, your interval will
contain λ 90% of the time.
It turns out that this version of the game is hard, so I’m going to tell you the
answer, and all you have to do is test it.
Confidence intervals are usually described in terms of the miss rate, α, so a
90% confidence interval has miss rate α = 0.1. The confidence interval for
the λ parameter of an exponential distribution is
2
2
ˆ χ (2n, 1 − α/2)
χ (2n, α/2)
λ
, ˆ λ
2n
2n
where n is the sample size, ˆ λ is the mean-based estimator from the previous
section, and 2
χ (k, x) is the CDF of a chi-squared distribution with k degrees
of freedom, evaluated at x (see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❈❤✐✲sq✉❛r❡❴
❞✐str✐❜✉t✐♦♥).
In general, confidence intervals are hard to compute analytically, but rel-
atively easy to estimate using simulation. But first we need to talk about
Bayesian estimation.
8.6
Bayesian estimation
If you collect a sample and compute a 90% confidence interval, it is tempting
to say that the true value of the parameter has a 90% chance of falling in the
98
Chapter 8. Estimation
interval. But from a frequentist point of view, that is not correct because the
parameter is an unknown but fixed value. It is either in the interval you
computed or not, so the frequentist definition of probability doesn’t apply.
So let’s try a different version of the game.
I’m thinking of a distribution. It’s an exponential distribution, and I chose
λ from a uniform distribution between 0.5 and 1.5. Here’s a sample, which
I’ll call X:
{2.675, 0.198, 1.152, 0.787, 2.717, 4.269}
Based on this sample, what value of λ do you think I chose?
In this version of the game, λ is a random quantity, so we can reasonably talk
about its distribution, and we can compute it easily using Bayes’s theorem.
Here are the steps:
1. Divide the range (0.5, 1.5) into a set of equal-sized bins. For each bin,
we define Hi, which is the hypothesis that the actual value of λ falls in
the i th bin. Since λ was drawn from a uniform distribution, the prior
probability, P(Hi), is the same for all i.
2. For each hypothesis, we compute the likelihood, P(X|Hi), which is
the chance of drawing the sample X given Hi.
P(X | Hi) = ∏ expo( λ i, xj)
j
where expo( λ, x) is a function that computes the PDF of the exponen-
tial distribution with parameter λ, evaluated at x.
PDFexpo( λ, x) = λ e− λ x
The symbol ∏ represents the product of a sequence (see ❤tt♣✿✴✴
✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴▼✉❧t✐♣❧✐❝❛t✐♦♥★❈❛♣✐t❛❧❴P✐❴♥♦t❛t✐♦♥).
3. Then by Bayes’s theorem the posterior distribution is
P(Hi|X) = P(Hi) P(X|Hi) / f
where f is the normalization factor
f = ∑ P(Hi)P(X | Hi)
i
8.7. Implementing Bayesian estimation
99
Given a posterior distribution, it is easy to compute a confidence interval.
For example, to compute a 90% CI, you can use the 5th and 95th percentiles
of the posterior.
Bayesian confidence intervals are sometimes called credible intervals; for a
discussion of the differences, see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❈r❡❞✐❜❧❡❴
✐♥t❡r✈❛❧.
8.7
Implementing Bayesian estimation
To represent the prior distribution, we could use a Pmf, Cdf, or any other
representation of a distribution, but since we want to map from a hypothesis
to a probability, a Pmf is a natural choice.
Each value in the Pmf represents a hypothesis; for example, the value 0.5
represents the hypothesis that λ is 0.5. In the prior distribution, all hypothe-
ses have the same probability. So we can construct the prior like this:
❞❡❢ ▼❛❦❡❯♥✐❢♦r♠❙✉✐t❡✭❧♦✇✱ ❤✐❣❤✱ st❡♣s✮✿
❤②♣♦s ❂ ❬❧♦✇ ✰ ✭❤✐❣❤✲❧♦✇✮ ✯ ✐ ✴ ✭st❡♣s✲✶✳✵✮ ❢♦r ✐ ✐♥ r❛♥❣❡✭st❡♣s✮❪
♣♠❢ ❂ P♠❢✳▼❛❦❡P♠❢❋r♦♠▲✐st✭❤②♣♦s✮
r❡t✉r♥ ♣♠❢
This function makes and returns a Pmf that represents a collection of related
hypotheses, called a suite. Each hypothesis has the same probability, so the
distribution is uniform.
The arguments ❧♦✇ and ❤✐❣❤ specify the range of values; st❡♣s is the num-
ber of hypotheses.
To perform the update, we take a suite of hypotheses and a body of evi-
dence:
❞❡❢ ❯♣❞❛t❡✭s✉✐t❡✱ ❡✈✐❞❡♥❝❡✮✿
❢♦r ❤②♣♦ ✐♥ s✉✐t❡✳❱❛❧✉❡s✭✮✿
❧✐❦❡❧✐❤♦♦❞ ❂ ▲✐❦❡❧✐❤♦♦❞✭❡✈✐❞❡♥❝❡✱ ❤②♣♦✮
s✉✐t❡✳▼✉❧t✭❤②♣♦✱ ❧✐❦❡❧✐❤♦♦❞✮
s✉✐t❡✳◆♦r♠❛❧✐③❡✭✮
For each hypothesis in the suite, we multiply the prior probability by the
likelihood of the evidence. Then we normalize the suite.
In this function, s✉✐t❡ has to be a Pmf, but ❡✈✐❞❡♥❝❡ can be any type, as
long as ▲✐❦❡❧✐❤♦♦❞ knows how to interpret it.
100
Chapter 8. Estimation
Here’s the likelihood function:
❞❡❢ ▲✐❦❡❧✐❤♦♦❞✭❡✈✐❞❡♥❝❡✱ ❤②♣♦✮✿
♣❛r❛♠ ❂ ❤②♣♦
❧✐❦❡❧✐❤♦♦❞ ❂ ✶
❢♦r ① ✐♥ ❡✈✐❞❡♥❝❡✿
❧✐❦❡❧✐❤♦♦❞ ✯❂ ❊①♣♦P❞❢✭①✱ ♣❛r❛♠✮
r❡t✉r♥ ❧✐❦❡❧✐❤♦♦❞
In ▲✐❦❡❧✐❤♦♦❞ we assume that ❡✈✐❞❡♥❝❡ is a sample from an exponential
distribution and compute the product in the previous section.
❊①♣♦P❞❢ evaluates the PDF of the exponential distribution at ①:
❞❡❢ ❊①♣♦P❞❢✭①✱ ♣❛r❛♠✮✿
♣ ❂ ♣❛r❛♠ ✯ ♠❛t❤✳❡①♣✭✲♣❛r❛♠ ✯ ①✮
r❡t✉r♥ ♣
Putting it all together, here’s the code that creates the prior and computes
the posterior:
❡✈✐❞❡♥❝❡ ❂ ❬✷✳✻✼✺✱ ✵✳✶✾✽✱ ✶✳✶✺✷✱ ✵✳✼✽✼✱ ✷✳✼✶✼✱ ✹✳✷✻✾❪
♣r✐♦r ❂ ▼❛❦❡❯♥✐❢♦r♠❙✉✐t❡✭✵✳✺✱ ✶✳✺✱ ✶✵✵✮
♣♦st❡r✐♦r ❂ ♣r✐♦r✳❈♦♣②✭✮
❯♣❞❛t❡✭♣♦st❡r✐♦r✱ ❡✈✐❞❡♥❝❡✮
You can download the code in this section from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴
❡st✐♠❛t❡✳♣②.
When I think of Bayesian estimation, I imagine a room full of people, where
each person has a different guess about whatever you are trying to estimate.
So in this example they each have a guess about the correct value of λ.
Initially, each person has a degree of confidence about their own hypothesis.
After seeing the evidence, each person updates their confidence based on
P(E|H), the likelihood of the evidence, given their hypothesis.
Most often the likelihood function computes a probability, which is at most
1, so initially everyone’s confidence goes down (or stays the same). But then
we normalize, which increases everyone’s confidence.
So the net effect is that some people get more confident, and some less,
depending on the relative likelihood of their hypothesis.
8.8. Censored data
101
8.8
Censored data
The following problem appears in Chapter 3 of David MacKay’s Informa-
tion Theory, Inference and Learning Algorithms, which you can download from
❤tt♣✿✴✴✇✇✇✳✐♥❢❡r❡♥❝❡✳♣❤②✳❝❛♠✳❛❝✳✉❦✴♠❛❝❦❛②✴✐t♣r♥♥✴♣s✴.
Unstable particles are emitted from a source and decay at a dis-
tance x, a real number that has an exponential probability distri-
bution with [parameter] λ. Decay events can only be observed
if they occur in a window extending from x = 1 cm to x = 20 cm.
n decays are observed at locations { x1, ... , xN }. What is λ?
This is an example of an estimation problem with censored data; that is, we
know that some data are systematically excluded.
One of the strengths of Bayesian estimation is that it can deal with censored
data with relative ease. We can use the method from the previous section
with only one change: we have to replace PDFexpo with the conditional dis-
tribution:
PDFcond( λ, x) = λ e− λ x/Z( λ)
for 1 < x < 20, and 0 otherwise, with
20
Z( λ) =
λ e− λ x dx = e− λ − e−20 λ
1
You might remember Z( λ) from Exercise 6.5. I told you to keep it handy.
Exercise 8.4 Download ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴❡st✐♠❛t❡✳♣②, which con-
tains the code from the previous section, and make a copy named ❞❡❝❛②✳♣②.
Modify ❞❡❝❛②✳♣② to compute the posterior distribution of λ for the sample
X = {1.5, 2, 3, 4, 5, 12}. For the prior you can use a uniform distribution
between 0 and 1.5 (not including 0).
You can download a solution to this problem from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳
❝♦♠✴❞❡❝❛②✳♣②.
Exercise 8.5 In the 2008 Minnesota Senate race the final vote count was
1,212,629 votes for Al Franken and 1,212,317 votes for Norm Coleman.
Franken was declared the winner, but as Charles Seife points out in Proofi-
ness, the margin of victory was much smaller than the margin of error, so
the result should have been considered a tie.
102
Chapter 8. Estimation
Assuming that there is a chance that any vote might be lost and a chance that
any vote might be double-counted, what is the probability that Coleman
actually received more votes?
Hint: you will have to fill in some details to model the error process.
8.9
The locomotive problem
The locomotive problem is a classic estimation problem also known as the
“German tank problem.” Here is the version that appears in Mosteller, Fifty
Challenging Problems in Probability:
“A railroad numbers its locomotives in order 1..N. One day you
see a locomotive with the number 60. Estimate how many loco-
motives the railroad has.”
Before you read the rest of this section, try to answer these questions:
1. For a given estimate, ˆ
N, what is the likelihood of the evidence,
P(E| ˆ
N)? What is the maximum likelihood estimator?
2. If we see train i it seems reasonable that we would guess some mul-
tiple of i so let’s assume ˆ
N = ai. What value of a minimizes mean
squared error?
3. Still assuming that ˆ
N = ai can you find a value of a that makes ˆ
N an
unbiased estimator?
4. For what value of N is 60 the average value?
5. What is the Bayesian posterior distribution assuming a prior distribu-
tion that is uniform from 1 to 200?
For best results, you should take some time to work on these questions be-
fore you continue.
For a given estimate, ˆ
N, the likelihood of seeing train i is 1/ ˆ
N if i ≤ ˆ
N, and
0 otherwise. So the MLE is ˆ
N = i. In other words, if you see train 60 and
you want to maximize your chance of getting the answer exactly right, you
should guess that there are 60 trains.
But this estimator doesn’t do very well in terms of MSE. We can do better
by choosing ˆ
N = ai; all we have to do is find a good value for a.
8.9. The locomotive problem
103
Suppose that there are, in fact, N trains. Each time we play the estimation
game, we see train i and guess ai, so the squared error is (ai − N)2.
If we play the game N times and see each train once, the mean squared error
is
1 N
MSE =
∑(ai − N)2
N i=1
To minimize MSE, we take the derivative with respect to a:
dMSE
1 N
=
∑ 2i(ai − N) = 0
da
N i=1
And solve for a.
3N
a = 2N + 1
At first glance, that doesn’t seem very useful, because N appears on the
right-hand side, which suggests that we need to know N to choose a, but if
we knew N, we wouldn’t need an estimator in the first place.
However, for large values of N, the optimal value for a converges to 3/2, so
we could choose ˆ
N = 3i/ 2.
To find an unbiased estimator, we can