Think Stats: Probability and Statistics for Programmers by Allen Downey - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

Chapter 8

Estimation

8.1

The estimation game

Let’s play a game. I’ll think of a distribution, and you have to guess what it

is. We’ll start out easy and work our way up.

I’m thinking of a distribution. I’ll give you two hints; it’s a normal distribution,

and here’s a random sample drawn from it:

{−0.441, 1.774, −0.101, −1.138, 2.975, −2.138}

What do you think is the mean parameter, µ, of this distribution?

One choice is to use the sample mean to estimate µ. Up until now we have

used the symbol µ for both the sample mean and the mean parameter, but

now to distinguish them I will use ¯x for the sample mean. In this example,

¯x is 0.155, so it would be reasonable to guess µ = 0.155.

This process is called estimation, and the statistic we used (the sample

mean) is called an estimator.

Using the sample mean to estimate µ is so obvious that it is hard to imagine

a reasonable alternative. But suppose we change the game by introducing

outliers.

I’m thinking of a distribution. It’s a normal distribution, and here’s a sam-

ple that was collected by an unreliable surveyor who occasionally puts the

decimal point in the wrong place.

{−0.441, 1.774, −0.101, −1.138, 2.975, −213.8}

94

Chapter 8. Estimation

Now what’s your estimate of µ? If you use the sample mean your guess is

−35.12. Is that the best choice? What are the alternatives?

One option is to identify and discard outliers, then compute the sample

mean of the rest. Another option is to use the median as an estimator.

Which estimator is the best depends on the circumstances (for example,

whether there are outliers) and on what the goal is. Are you trying to mini-

mize errors, or maximize your chance of getting the right answer?

If there are no outliers, the sample mean minimizes the mean squared error

(MSE). If we play the game many times, and each time compute the error

¯x − µ, the sample mean minimizes

1

MSE =

∑( ¯x − µ)2

m

Where m is the number of times you play the estimation game (not to be

confused with n, which is the size of the sample used to compute ¯x).

Minimizing MSE is a nice property, but it’s not always the best strategy.

For example, suppose we are estimating the distribution of wind speeds

at a building site. If we guess too high, we might overbuild the structure,

increasing its cost. But if we guess too low, the building might collapse.

Because cost as a function of error is asymmetric, minimizing MSE is not

the best strategy.

As another example, suppose I roll three six-sided dice and ask you to pre-

dict the total. If you get it exactly right, you get a prize; otherwise you get

nothing. In this case the value that minimizes MSE is 10.5, but that would

be a terrible guess. For this game, you want an estimator that has the high-

est chance of being right, which is a maximum likelihood estimator (MLE).

If you pick 10 or 11, your chance of winning is 1 in 8, and that’s the best you

can do.

Exercise 8.1 Write a function that draws 6 values from a normal distribution

with µ = 0 and σ = 1. Use the sample mean to estimate µ and compute the error ¯x − µ. Run the function 1000 times and compute MSE.

Now modify the program to use the median as an estimator. Compute MSE

again and compare to the MSE for ¯x.

8.2

Guess the variance

I’m thinking of a distribution. It’s a normal distribution, and here’s a (familiar)

8.3. Understanding errors

95

sample:

{−0.441, 1.774, −0.101, −1.138, 2.975, −2.138}

What do you think is the variance, 2

σ , of my distribution? Again, the ob-

vious choice is to use the sample variance as an estimator. I will use S2to

denote the sample variance, to distinguish from the unknown parameter

2

σ .

1

S2 =

∑(x

n

i − ¯

x)2

For large samples, S2is an adequate estimator, but for small samples it tends

to be too low. Because of this unfortunate property, it is called a biased

estimator.

An estimator is unbiased if the expected total (or mean) error, after many

iterations of the estimation game, is 0. Fortunately, there is another simple

statistic that is an unbiased estimator of 2

σ :

1

S2n−1 =

∑(x

n − 1

i − ¯

x)2

The biggest problem with this estimator is that its name and symbol are

used inconsistently. The name “sample variance” can refer to either S2or

S2n− , and the symbol S2is used for either or both.

1

For an explanation of why S2is biased, and a proof that S2n− is unbiased, see

1

❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❇✐❛s❴♦❢❴❛♥❴❡st✐♠❛t♦r.

Exercise 8.2 Write a function that draws 6 values from a normal distribution

with

2

µ = 0 and σ = 1. Use the sample variance to estimate σ and compute the error S2 − 2

σ . Run the function 1000 times and compute mean error (not

squared).

Now modify the program to use the unbiased estimator S2

. Compute the

n−1

mean error again and see if it converges to zero as you increase the number

of games.

8.3

Understanding errors

Before we go on, let’s clear up a common source of confusion. Properties

like MSE and bias are long-term expectations based on many iterations of

the estimation game.

96

Chapter 8. Estimation

While you are playing the game, you don’t know the errors. That is, if I

give you a sample and ask you to estimate a parameter, you can compute

the value of the estimator, but you can’t compute the error. If you could,

you wouldn’t need the estimator!

The reason we talk about estimation error is to describe the behavior of

different estimators in the long run. In this chapter we run experiments to

examine those behaviors; these experiments are artificial in the sense that

we know the actual values of the parameters, so we can compute errors.

But when you work with real data, you don’t, so you can’t.

Now let’s get back to the game.

8.4

Exponential distributions

I’m thinking of a distribution. It’s an exponential distribution, and here’s a

sample:

{5.384, 4.493, 19.198, 2.790, 6.122, 12.844}

What do you think is the parameter, λ, of this distribution?

In general, the mean of an exponential distribution is 1/ λ, so working back-

wards, we might choose

ˆ λ = 1 / ¯x

It is common to use hat notation for estimators, so ˆ λ is an estimator of λ.

And not just any estimator; it is also the MLE estimator1. So if you want to

maximize your chance of guessing λ exactly, ˆ λ is the way to go.

But we know that ¯x is not robust in the presence of outliers, so we expect ˆ λ

to have the same problem.

Maybe we can find an alternative based on the sample median. Remember

that the median of an exponential distribution is log(2) / λ, so working

backwards again, we can define an estimator

ˆ λ 1/2 = log(2)/ µ 1/2

where µ 1/2 is the sample median.

Exercise 8.3 Run an experiment to see which of ˆ λ and ˆ λ 1/2 yields lower MSE. Test whether either of them is biased.

1See

❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❊①♣♦♥❡♥t✐❛❧❴❞✐str✐❜✉t✐♦♥★▼❛①✐♠✉♠❴

❧✐❦❡❧✐❤♦♦❞.

8.5. Confidence intervals

97

8.5

Confidence intervals

So far we have looked at estimators that generate single values, known as

point estimates. For many problems, we might prefer an interval that spec-

ifies an upper and lower bound on the unknown parameter.

Or, more generally, we might want that whole distribution; that is, the range

of values the parameter could have, and for each value in the range, a notion

of how likely it is.

Let’s start with confidence intervals.

I’m thinking of a distribution. It’s an exponential distribution, and here’s a

sample:

{5.384, 4.493, 19.198, 2.790, 6.122, 12.844}

I want you to give me a range of values that you think is likely to contain the

unknown parameter λ. More specifically, I want a 90% confidence interval,

which means that if we play this game over and over, your interval will

contain λ 90% of the time.

It turns out that this version of the game is hard, so I’m going to tell you the

answer, and all you have to do is test it.

Confidence intervals are usually described in terms of the miss rate, α, so a

90% confidence interval has miss rate α = 0.1. The confidence interval for

the λ parameter of an exponential distribution is

2

2

ˆ χ (2n, 1 − α/2)

χ (2n, α/2)

λ

, ˆ λ

2n

2n

where n is the sample size, ˆ λ is the mean-based estimator from the previous

section, and 2

χ (k, x) is the CDF of a chi-squared distribution with k degrees

of freedom, evaluated at x (see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❈❤✐✲sq✉❛r❡❴

❞✐str✐❜✉t✐♦♥).

In general, confidence intervals are hard to compute analytically, but rel-

atively easy to estimate using simulation. But first we need to talk about

Bayesian estimation.

8.6

Bayesian estimation

If you collect a sample and compute a 90% confidence interval, it is tempting

to say that the true value of the parameter has a 90% chance of falling in the

98

Chapter 8. Estimation

interval. But from a frequentist point of view, that is not correct because the

parameter is an unknown but fixed value. It is either in the interval you

computed or not, so the frequentist definition of probability doesn’t apply.

So let’s try a different version of the game.

I’m thinking of a distribution. It’s an exponential distribution, and I chose

λ from a uniform distribution between 0.5 and 1.5. Here’s a sample, which

I’ll call X:

{2.675, 0.198, 1.152, 0.787, 2.717, 4.269}

Based on this sample, what value of λ do you think I chose?

In this version of the game, λ is a random quantity, so we can reasonably talk

about its distribution, and we can compute it easily using Bayes’s theorem.

Here are the steps:

1. Divide the range (0.5, 1.5) into a set of equal-sized bins. For each bin,

we define Hi, which is the hypothesis that the actual value of λ falls in

the i th bin. Since λ was drawn from a uniform distribution, the prior

probability, P(Hi), is the same for all i.

2. For each hypothesis, we compute the likelihood, P(X|Hi), which is

the chance of drawing the sample X given Hi.

P(X | Hi) = ∏ expo( λ i, xj)

j

where expo( λ, x) is a function that computes the PDF of the exponen-

tial distribution with parameter λ, evaluated at x.

PDFexpo( λ, x) = λ e− λ x

The symbol ∏ represents the product of a sequence (see ❤tt♣✿✴✴

✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴▼✉❧t✐♣❧✐❝❛t✐♦♥★❈❛♣✐t❛❧❴P✐❴♥♦t❛t✐♦♥).

3. Then by Bayes’s theorem the posterior distribution is

P(Hi|X) = P(Hi) P(X|Hi) / f

where f is the normalization factor

f = ∑ P(Hi)P(X | Hi)

i

8.7. Implementing Bayesian estimation

99

Given a posterior distribution, it is easy to compute a confidence interval.

For example, to compute a 90% CI, you can use the 5th and 95th percentiles

of the posterior.

Bayesian confidence intervals are sometimes called credible intervals; for a

discussion of the differences, see ❤tt♣✿✴✴✇✐❦✐♣❡❞✐❛✳♦r❣✴✇✐❦✐✴❈r❡❞✐❜❧❡❴

✐♥t❡r✈❛❧.

8.7

Implementing Bayesian estimation

To represent the prior distribution, we could use a Pmf, Cdf, or any other

representation of a distribution, but since we want to map from a hypothesis

to a probability, a Pmf is a natural choice.

Each value in the Pmf represents a hypothesis; for example, the value 0.5

represents the hypothesis that λ is 0.5. In the prior distribution, all hypothe-

ses have the same probability. So we can construct the prior like this:

❞❡❢ ▼❛❦❡❯♥✐❢♦r♠❙✉✐t❡✭❧♦✇✱ ❤✐❣❤✱ st❡♣s✮✿

❤②♣♦s ❂ ❬❧♦✇ ✰ ✭❤✐❣❤✲❧♦✇✮ ✯ ✐ ✴ ✭st❡♣s✲✶✳✵✮ ❢♦r ✐ ✐♥ r❛♥❣❡✭st❡♣s✮❪

♣♠❢ ❂ P♠❢✳▼❛❦❡P♠❢❋r♦♠▲✐st✭❤②♣♦s✮

r❡t✉r♥ ♣♠❢

This function makes and returns a Pmf that represents a collection of related

hypotheses, called a suite. Each hypothesis has the same probability, so the

distribution is uniform.

The arguments ❧♦✇ and ❤✐❣❤ specify the range of values; st❡♣s is the num-

ber of hypotheses.

To perform the update, we take a suite of hypotheses and a body of evi-

dence:

❞❡❢ ❯♣❞❛t❡✭s✉✐t❡✱ ❡✈✐❞❡♥❝❡✮✿

❢♦r ❤②♣♦ ✐♥ s✉✐t❡✳❱❛❧✉❡s✭✮✿

❧✐❦❡❧✐❤♦♦❞ ❂ ▲✐❦❡❧✐❤♦♦❞✭❡✈✐❞❡♥❝❡✱ ❤②♣♦✮

s✉✐t❡✳▼✉❧t✭❤②♣♦✱ ❧✐❦❡❧✐❤♦♦❞✮

s✉✐t❡✳◆♦r♠❛❧✐③❡✭✮

For each hypothesis in the suite, we multiply the prior probability by the

likelihood of the evidence. Then we normalize the suite.

In this function, s✉✐t❡ has to be a Pmf, but ❡✈✐❞❡♥❝❡ can be any type, as

long as ▲✐❦❡❧✐❤♦♦❞ knows how to interpret it.

100

Chapter 8. Estimation

Here’s the likelihood function:

❞❡❢ ▲✐❦❡❧✐❤♦♦❞✭❡✈✐❞❡♥❝❡✱ ❤②♣♦✮✿

♣❛r❛♠ ❂ ❤②♣♦

❧✐❦❡❧✐❤♦♦❞ ❂ ✶

❢♦r ① ✐♥ ❡✈✐❞❡♥❝❡✿

❧✐❦❡❧✐❤♦♦❞ ✯❂ ❊①♣♦P❞❢✭①✱ ♣❛r❛♠✮

r❡t✉r♥ ❧✐❦❡❧✐❤♦♦❞

In ▲✐❦❡❧✐❤♦♦❞ we assume that ❡✈✐❞❡♥❝❡ is a sample from an exponential

distribution and compute the product in the previous section.

❊①♣♦P❞❢ evaluates the PDF of the exponential distribution at ①:

❞❡❢ ❊①♣♦P❞❢✭①✱ ♣❛r❛♠✮✿

♣ ❂ ♣❛r❛♠ ✯ ♠❛t❤✳❡①♣✭✲♣❛r❛♠ ✯ ①✮

r❡t✉r♥ ♣

Putting it all together, here’s the code that creates the prior and computes

the posterior:

❡✈✐❞❡♥❝❡ ❂ ❬✷✳✻✼✺✱ ✵✳✶✾✽✱ ✶✳✶✺✷✱ ✵✳✼✽✼✱ ✷✳✼✶✼✱ ✹✳✷✻✾❪

♣r✐♦r ❂ ▼❛❦❡❯♥✐❢♦r♠❙✉✐t❡✭✵✳✺✱ ✶✳✺✱ ✶✵✵✮

♣♦st❡r✐♦r ❂ ♣r✐♦r✳❈♦♣②✭✮

❯♣❞❛t❡✭♣♦st❡r✐♦r✱ ❡✈✐❞❡♥❝❡✮

You can download the code in this section from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴

❡st✐♠❛t❡✳♣②.

When I think of Bayesian estimation, I imagine a room full of people, where

each person has a different guess about whatever you are trying to estimate.

So in this example they each have a guess about the correct value of λ.

Initially, each person has a degree of confidence about their own hypothesis.

After seeing the evidence, each person updates their confidence based on

P(E|H), the likelihood of the evidence, given their hypothesis.

Most often the likelihood function computes a probability, which is at most

1, so initially everyone’s confidence goes down (or stays the same). But then

we normalize, which increases everyone’s confidence.

So the net effect is that some people get more confident, and some less,

depending on the relative likelihood of their hypothesis.

8.8. Censored data

101

8.8

Censored data

The following problem appears in Chapter 3 of David MacKay’s Informa-

tion Theory, Inference and Learning Algorithms, which you can download from

❤tt♣✿✴✴✇✇✇✳✐♥❢❡r❡♥❝❡✳♣❤②✳❝❛♠✳❛❝✳✉❦✴♠❛❝❦❛②✴✐t♣r♥♥✴♣s✴.

Unstable particles are emitted from a source and decay at a dis-

tance x, a real number that has an exponential probability distri-

bution with [parameter] λ. Decay events can only be observed

if they occur in a window extending from x = 1 cm to x = 20 cm.

n decays are observed at locations { x1, ... , xN }. What is λ?

This is an example of an estimation problem with censored data; that is, we

know that some data are systematically excluded.

One of the strengths of Bayesian estimation is that it can deal with censored

data with relative ease. We can use the method from the previous section

with only one change: we have to replace PDFexpo with the conditional dis-

tribution:

PDFcond( λ, x) = λ e− λ x/Z( λ)

for 1 < x < 20, and 0 otherwise, with

20

Z( λ) =

λ e− λ x dx = e− λ − e−20 λ

1

You might remember Z( λ) from Exercise 6.5. I told you to keep it handy.

Exercise 8.4 Download ❤tt♣✿✴✴t❤✐♥❦st❛ts✳❝♦♠✴❡st✐♠❛t❡✳♣②, which con-

tains the code from the previous section, and make a copy named ❞❡❝❛②✳♣②.

Modify ❞❡❝❛②✳♣② to compute the posterior distribution of λ for the sample

X = {1.5, 2, 3, 4, 5, 12}. For the prior you can use a uniform distribution

between 0 and 1.5 (not including 0).

You can download a solution to this problem from ❤tt♣✿✴✴t❤✐♥❦st❛ts✳

❝♦♠✴❞❡❝❛②✳♣②.

Exercise 8.5 In the 2008 Minnesota Senate race the final vote count was

1,212,629 votes for Al Franken and 1,212,317 votes for Norm Coleman.

Franken was declared the winner, but as Charles Seife points out in Proofi-

ness, the margin of victory was much smaller than the margin of error, so

the result should have been considered a tie.

102

Chapter 8. Estimation

Assuming that there is a chance that any vote might be lost and a chance that

any vote might be double-counted, what is the probability that Coleman

actually received more votes?

Hint: you will have to fill in some details to model the error process.

8.9

The locomotive problem

The locomotive problem is a classic estimation problem also known as the

“German tank problem.” Here is the version that appears in Mosteller, Fifty

Challenging Problems in Probability:

“A railroad numbers its locomotives in order 1..N. One day you

see a locomotive with the number 60. Estimate how many loco-

motives the railroad has.”

Before you read the rest of this section, try to answer these questions:

1. For a given estimate, ˆ

N, what is the likelihood of the evidence,

P(E| ˆ

N)? What is the maximum likelihood estimator?

2. If we see train i it seems reasonable that we would guess some mul-

tiple of i so let’s assume ˆ

N = ai. What value of a minimizes mean

squared error?

3. Still assuming that ˆ

N = ai can you find a value of a that makes ˆ

N an

unbiased estimator?

4. For what value of N is 60 the average value?

5. What is the Bayesian posterior distribution assuming a prior distribu-

tion that is uniform from 1 to 200?

For best results, you should take some time to work on these questions be-

fore you continue.

For a given estimate, ˆ

N, the likelihood of seeing train i is 1/ ˆ

N if i ≤ ˆ

N, and

0 otherwise. So the MLE is ˆ

N = i. In other words, if you see train 60 and

you want to maximize your chance of getting the answer exactly right, you

should guess that there are 60 trains.

But this estimator doesn’t do very well in terms of MSE. We can do better

by choosing ˆ

N = ai; all we have to do is find a good value for a.

8.9. The locomotive problem

103

Suppose that there are, in fact, N trains. Each time we play the estimation

game, we see train i and guess ai, so the squared error is (ai − N)2.

If we play the game N times and see each train once, the mean squared error

is

1 N

MSE =

∑(ai − N)2

N i=1

To minimize MSE, we take the derivative with respect to a:

dMSE

1 N

=

∑ 2i(ai − N) = 0

da

N i=1

And solve for a.

3N

a = 2N + 1

At first glance, that doesn’t seem very useful, because N appears on the

right-hand side, which suggests that we need to know N to choose a, but if

we knew N, we wouldn’t need an estimator in the first place.

However, for large values of N, the optimal value for a converges to 3/2, so

we could choose ˆ

N = 3i/ 2.

To find an unbiased estimator, we can