Tuesday, March 17, 2009

Class notes for 3/16

One of the major uses of statistics is to find confidence intervals for parameters from statistics, which means to find the approximate range of a value of the population from taking the associated value from a sample, and assigning a confidence level to that range. For the average of a numerical variable, this means finding a range for mux using x-bar, sx, n and looking up a Confidence Level Multiplier on the Student's t-score table, the fifth page of yellow handout, labeled Table A-3.

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

There are some things we need to test before we create the interval. We have two questions to ask about the data, and if we get a yes answer to either question, we can proceed. If we get no answers to both, then we should not use this data set to create a confidence interval.

Question #1: Is n > 30?
Question #2: Is the underlying data set normally distributed?

Question #1 is easy to answer, and if the answer is yes, we don't have to bother with Question #2. It might be that you are simply told by the person who collected the data that the answer is yes. But if that is not the case, here are two tests the data set should pass to get a yes answer to Question #2. Both of these tests are using arbitrary cut-off points, the second test being a preview of tests we are going to do later in the class called goodness of fit.

Test #1: The Mid-range Outlier test. Recall that the mid-range is (high + low)/2. Take the z-score for the mid-range, (mid-range - x-bar)/sx, and if this value is more than 0.5 or less than -0.5, we will answer no to Question #2. If we answer yes, this means there is an outlier either high or low, that either the low value is one full standard deviation farther from the average than the high value, or vice versa. In a small data set, this can skew the statistics and make us less confident about the values being near the parameters from the underlying population.

Test #2: The Goodness of Fit test. We can do this test as long as n is 10 or more. If the data set has less than 10 entries and it failed Test #1, you shouldn't try to do the confidence interval. In this test, count the number of entries that are more than x-bar and call that value above. If the underlying data set is normally distributed, we would expect about half the entries would be above average and half below average, so about n/2 in each group. The Goodness of Fit test for this data set is done by plugging in these values into this formula, and making this check.

Is (above - n/2)^2 < face="courier new">1.2 4.7 13.1 3.2 8.8 18.2 3.5 12.5 4.8 8.2 0.7
17.1 0.2 11.4 3.2 8.6 6.3 17.4 17.1 17.8
1.2 0.2 0.1 8.9 4.0 8.6 0.7 2.8 0.5 0.4

x-bar = 6.85
sx = 6.17
n = 30
mid-range = (18.2 + 0.1)/2 = 9.15

Test #1: z(mid-range) = (9.15-6.85)/6.17 = 0.37

Because this is less than one half, we can answer yes and move on to test #2.

Test #2: number of values above average, which we call above = 13. n/2 = 15, so the test is

(13-15)^2/15 = 4/15, which rounds to 0.267 and is below 1.353, so we can answer yes to this and proceed.

n = 30 so degrees of freedom = 29. Here are the three most commonly used CLM values.

CLM90% = 1.699
CLM95% = 2.045
CLM99% = 2.756

Using the formula. here are the values for each of the confidence levels.

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

90% confidence:
6.85 - 1.699*6.17/sqrt(30) < mux < style="font-size:100%;">mux < style="font-weight: bold;">


95% confidence:
6.85 - 2.045*6.17/sqrt(30) < mux 6.85 + 2.045*6.17/sqrt(30)
4.55 < mux < style="font-weight: bold;">99% confidence:
6.85 - 2.756*6.17/sqrt(30) < mux < 6.85 + 2.756*6.17/sqrt(30)
3.75 < mux < face="georgia">The sentence explaining the confidence interval would go as follows.

Given a sample of the carries per game of 30 running backs from the 2008 NFL season, we are 99% confident the true average number of carries per game for all NFL running backs that season is between 3.75 and 9.75.

Practice problem set

Here is the data for average yards per carry for the same 30 randomly selected NFL running backs for the 2008 season.


4.6 2.6 3.9 4.4 4.2 3.8 5.5 3.6 3.1 3.7
6.8 5.5 2.0 5.6 4.4
4.8 5.7 3.6 4.3 3.5
5.0 0.0 -2.0 2.8 3.3 4.4 2.0 6.0 1.3 4.0


Problem #1: Find x-bar and sx, rounded to two places after the decimal point.
Problem #2: Do the z(mid-range) test. Does the set pass or fail?
Problem #3: Do the second test for normal distribution. Does the set pass or fail?
Problem #4: Find the endpoints for the 95% confidence interval.
Problem #5: Write the sentence explaining the interval.

Answers in the comments.

1 comment:

Prof. Hubbard said...

Problem #1: Find x-bar and s_x, rounded to two places after the decimal point.

x-bar = 3.75
s_x = 1.81

Problem #2: Do the z(mid-range) test. Does the set pass or fail?

midrange = (6.8 + -2.)/2 = 2.4
(2.4 - 3.75)/1.81 = -.74
It fails this test, so we shouldn't do use the set for a confidence interval, but we will go ahead anyway, just for practice.

Problem #3: Do the second test for normal distribution. Does the set pass or fail?

above = 17, so (17-15)^2/15 = 4/15, which rounds to 0.267, so the data set passes this test.

Problem #4: Find the endpoints for the 95% confidence interval.

Low value:
3.75 - 2.045*1.81/sqrt(30) = 3.07

High value:
3.75 + 2.045*1.81/sqrt(30) = 4.43

Problem #5: Write the sentence explaining the interval.

Technically, because we failed the mid-range test we shouldn't use this set of data, but if we did, the sentence would be as follows.

Given a sample of the yards per carry of 30 running backs from the 2008 NFL season, we are 95% confident the true average number of yards per carry for all NFL running backs that season is between 3.07 and 4.43.