Collaborative Statistics (MT230-Spring 2012) by Barbara Illowsky, Ph.D., Susan Dean - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

Chapter 11The Chi-Square Distribution

11.1The Chi-Square Distribution*

This module provides an introduction to Chi-Square Distribution as a part of Collaborative Statistics collection (col10522) by Barbara Illowsky and Susan Dean.

Student Learning Outcomes

By the end of this chapter, the student should be able to:

  • Interpret the chi-square probability distribution as the sample size changes.

  • Conduct and interpret chi-square goodness-of-fit hypothesis tests.

  • Conduct and interpret chi-square test of independence hypothesis tests.

  • Conduct and interpret chi-square homogeneity hypothesis tests.

  • Conduct and interpret chi-square single variance hypothesis tests.

Introduction

Have you ever wondered if lottery numbers were evenly distributed or if some numbers occurred with a greater frequency? How about if the types of movies people preferred were different across different age groups? What about if a coffee machine was dispensing approximately the same amount of coffee each time? You could answer these questions by conducting a hypothesis test.

You will now study a new distribution, one that is used to determine the answers to the above examples. This distribution is called the Chi-square distribution.

In this chapter, you will learn the three major applications of the Chi-square distribution:

  • The goodness-of-fit test, which determines if data fit a particular distribution, such as with the lottery example

  • The test of independence, which determines if events are independent, such as with the movie example

  • The test of a single variance, which tests variability, such as with the coffee example

Though the Chi-square calculations depend on calculators or computers for most of the calculations, there is a table available (see the Table of Contents 15. Tables). TI-83+ and TI-84 calculator instructions are included in the text.

Optional Collaborative Classroom Activity

Look in the sports section of a newspaper or on the Internet for some sports data (baseball averages, basketball scores, golf tournament scores, football odds, swimming times, etc.). Plot a histogram and a boxplot using your data. See if you can determine a probability distribution that your data fits. Have a discussion with the class about your choice.

11.2Notation*

This module provides an overview of Chi-Square Distribution Notation as a part of Collaborative Statistics collection (col10522) by Barbara Illowsky and Susan Dean.

The notation for the chi-square distribution is:

χ2 ~ χ2df

where df = degrees of freedom depend on how chi-square is being used. (If you want to practice calculating chi-square probabilities then use df = n – 1 . The degrees of freedom for the three major uses are each calculated differently.)

For the χ2 distribution, the population mean is μ = df and the population standard deviation is _autogen-svg2png-0007.png.

The random variable is shown as χ2 but may be any upper case letter.

The random variable for a chi-square distribution with k degrees of freedom is the sum of k independent, squared standard normal variables.

_autogen-svg2png-0011.png

11.3Facts About the Chi-Square Distribution*

  1. The curve is nonsymmetrical and skewed to the right.

  2. There is a different chi-square curve for each df.

    Example of a nonsymmetrical chi-square curve that has a different df from the graph on the right. The curve begins at (0,∞) and slopes downwards to (∞,0).
    (a)
    Example of a nonsymmetrical and skewed to the right, the peak is closer to the left and more values are in the tail on the right, chi-square curve which has a different df from the graph on the left.
    (b)
    Figure 11.1

  3. The test statistic for any test is always greater than or equal to zero.

  4. When df > 90 , the chi-square curve approximates the normal. For X ~ χ10002 the mean, μ = df = 1000 and the standard deviation, _autogen-svg2png-0006.png. Therefore, X ~ N ( 1000 , 44.7 ) , approximately.

  5. The mean, μ , is located just to the right of the peak.

    Example of how the mean is located to the right of the peak with a nonsymmetrical chi-square curve skewed to the right with the mean on the x-axis.
    Figure 11.2

In the next sections, you will learn about four different applications of the Chi-Square Distribution. These hypothesis tests are almost always right-tailed tests. In order to understand why the tests are mostly right-tailed, you will need to look carefully at the actual definition of the test statistic. Think about the following while you study the next four sections. If the expected and observed values are "far" apart, then the test statistic will be "large" and we will reject in the right tail. The only way to obtain a test statistic very close to zero, would be if the observed and expected values are very, very close to each other. A left-tailed test could be used to determine if the fit were "too good." A "too good" fit might occur if data had been manipulated or invented. Think about the implications of right-tailed versus left-tailed hypothesis tests as you learn the applications of the Chi-Square Distribution.

11.4Goodness-of-Fit Test*

This module describes how the chi-square distribution is used to conduct goodness-of-fit test.

In this type of hypothesis test, you determine whether the data "fit" a particular distribution or not. For example, you may suspect your unknown data fit a binomial distribution. You use a chi-square test (meaning the distribution for the hypothesis test is chi-square) to determine if there is a fit or not. The null and the alternate hypotheses for this test may be written in sentences or may be stated as equations or inequalities.

The test statistic for a goodness-of-fit test is:

(11.1)
_autogen-svg2png-0001.png

where:

  • O = observed values (data)

  • E = expected values (from theory)

  • k = the number of different data cells or categories

The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are n terms of the form _autogen-svg2png-0006.png.

The degrees of freedom are df = (number of categories - 1).

The goodness-of-fit test is almost always right tailed. If the observed values and the corresponding expected values are not close to each other, then the test statistic can get very large and will be way out in the right tail of the chi-square curve.

The expected value for each cell needs to be at least 5 in order to use this test.

Example 11.1

Absenteeism of college students from math classes is a major concern to math instructors because missing class appears to increase the drop rate. Suppose that a study was done to determine if the actual student absenteeism follows faculty perception. The faculty expected that a group of 100 students would miss class according to the following chart.

Table 11.1.
Number absences per termExpected number of students
0 - 250
3 - 530
6 - 812
9 - 116
12+2

A random survey across all mathematics courses was then done to determine the actual number (observed) of absences in a course. The next chart displays the result of that survey.

Table 11.2.
Number absences per termActual number of students
0 - 235
3 - 540
6 - 820
9 - 111
12+4

Determine the null and alternate hypotheses needed to conduct a goodness-of-fit test.

Ho: Student absenteeism fits faculty perception.

The alternate hypothesis is the opposite of the null hypothesis.

Ha: Student absenteeism does not fit faculty perception.

Can you use the information as it appears in the charts to conduct the goodness-of-fit test?

No. Notice that the expected number of absences for the "12+" entry is less than 5 (it is 2). Combine that group with the "9 - 11" group to create new tables where the number of students for each entry are at least 5. The new tables are below.
Table 11.3.
Number absences per termExpected number of students
0 - 250
3 - 530
6 - 812
9+8
Table 11.4.
Number absences per termActual number of students
0 - 235
3 - 540
6 - 820
9+5

What are the degrees of freedom (df)?

There are 4 "cells" or categories in each of the new tables.

df = number of cells - 1 = 4 - 1 = 3

Example 11.2

Employers particularly want to know which days of the week employees are absent in a five day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week did they have the highest number of employee absences. The results were distributed as follows:

Table 11.5. Day of the Week Employees were most Absent
 MondayTuesdayWednesdayThursdayFriday
Number of Absences15129915

For the population of employees, do the days for the highest number of absences occur with equal frequencies during a five day work week? Test at a 5% significance level.

The null and alternate hypotheses are:

  • Ho: The absent days occur with equal frequencies, that is, they fit a uniform distribution.

  • Ha: The absent days occur with unequal frequencies, that is, they do not fit a uniform distribution.

If the absent days occur with equal frequencies, then, out of 60 absent days (the total in the sample: 15 + 12 + 9 + 9 + 15 = 60), there would be 12 absences on Monday, 12 on Tuesday, 12 on Wednesday, 12 on Thursday, and 12 on Friday. These numbers are the expected (E) values. The values in the table are the observed (O) values or data.

This time, calculate the χ2 test statistic by hand. Make a chart with the following headings and fill in the columns:

  • Expected (E) values (12, 12, 12, 12, 12)

  • Observed (O) values (15, 12, 9, 9, 15)