Statistics on a budget: Class notes for 3/18

Thursday, March 19, 2009

Class notes for 3/18

Tests before confidence interval for average of population

We have discussed finding a confidence interval for mux by using x-bar, the average of a sample taken from that population. Because we don't know the average of the population, we can't know the standard deviation of the population, since you need the average to compute the standard deviation. Because of this uncertainty, we use Student's t-scores instead of z-scores for the Confidence Level Multipliers for xx%, or CLMxx% for short.

There are times when we aren't allowed to use this method for approximation. We have two questions, and if we answer no to both of them, this method should not be used and only non-parametric methods are useful when dealing with this data set. (We will learn a few non-parametric methods later in the class, but most of these methods are taught in more advanced stats classes.)

Question #1: Is n > 30? If a sample is big enough, Gossett decided that his t-scores are going to give a reasonable approximation. This is always the first question because it is so easy to answer, and if we get a yes here, we don't even need to ask Question #2.

Question #2: Is the underlying data set normally distributed? It might be that we will be told the answer to this question without having to do any work to test this ourselves, and if we are told "yes" we can proceed. If we aren't told, here are two tests. There are other tests than can be done, but these two are easy and useful, and if the data set passes both these tests, we will assume we can move forward.

The z(mid-range) test. Recall that the mid-range is the average of the highest and lowest values in the data set. If one of these scores is a lot farther away from average than the other, we can consider that the data set may be too skewed to be reliable. What this test does is look to see if one extreme value (the highest or lowest) is more than one standard deviation farther away from average than the other extreme value. Understand that one standard deviation is an arbitrary value. The formula in this inequality is z(mid-range), which is where the test's name comes from.

The mean to median test. If the data is normally distributed, then the mean and median of the underlying population should be equal. If the mean and median of the sample are significantly different, then this test should produce a number greater than 1.353. (This is a simple version of a goodness-of-fit test, which we will be exploring in greater depth later in the semester.)

Once we have the average value x-bar, count how many values are greater than x-bar, and call this value above. We should expect that random samples from a normally distributed set should have about a 50%-50% split of values above and below average, so about n/2 in each group.

Again, the cut-off point 1.353 is arbitrary, and it is derived from the chi squared table. If we get a yes to both these questions, we can answer yes to the question about the normally distributed underlying population and proceed to find the interval for mux, which is the following inequality.

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

Confidence interval for standard deviation of population

The confidence interval formula for sigmax is completely different than the formula for mux, and uses values from a new table, Table A-4, known as the chi squared table. Chi is pronounced like the chi in chiropractor, "kai", not "chee" or "chai".

Note that we don't have to test for normal distribution of the underlying data.

Let's give an example. If we have a set where n = 13 and sx = 3.21, we use n-1 as our degrees of freedom. Here is the line from Table A-4 for the row that corresponds to 12.

_____0.995___0.99___0.975___0.95___0.90___0.10____0.05___0.025____0.01___0.005
_12__3.074__3.571___4.404__5.226__6.304__18.549__21.026__23.337__26.217__28.299

The values of chi^2R and chi^2L are taken from the table as follows.

90% confidence: The right value is from the 0.05 column, the left value s from the 0.95 column. (Note that 0.95 - 0.05 = 0.90 or 90%)

95% confidence: The right value is from the 0.025 column, the left value s from the 0.975 column. (Note that 0.975 - 0.025 = 0.95 or 95%)99% confidence: The right value is from the 0.005 column, the left value s from the 0.995 column. (Note that 0.995 - 0.005 = 0.99 or 99%)

In this instance, here are the confidence intervals for each of the standard percentages of confidence.

90% confidence: sqrt(3.21^2*12/21.026) < sigmax < sqrt(3.21^2*12/5.226)95% confidence: sqrt(3.21^2*12/23.337) < sigmax < sqrt(3.21^2*12/4.404)99% confidence: sqrt(3.21^2*12/28.299) < sigmax < sqrt(3.21^2*12/3.074)

For practice, use your calculator to give the answers above, rounded to two places after the decimal. Answers in the comments.

More practice sets will be posted later today.

1 comment:

Prof. Hubbard said...: 90% confidence:
2.43 < sigmax < 4.86

95% confidence:
2.30 < sigmax < 5.30

99% confidence:
2.09 < sigmax < 6.34; March 19, 2009 at 10:24 AM

Statistics on a budget

Thursday, March 19, 2009

Class notes for 3/18

1 comment:

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter