Thursday, March 12, 2009

Class notes for 3/11

The positive and negative z-score tables connect z-scores to proportions, numbers between 0 and 1 written to four decimal places of accuracy. For example, z = 1.23 corresponds to .8907, which means that 89.07% of data in a normally distributed set should have a z-score of 1.23 or less, while (100-89.07)% = 10.93% of data will have a z-score of 1.23 or more. Moreover, the normal curve is symmetric around z = 0. This means when z = -1.23, that is the cut-off point between the low 10.93% of the data and the high 89.07% of the data, just the opposite of the percentages that correspond to z = 1.23.



Critical values, or Confidence Level Multipliers (CLMxx%)

Sometimes, instead of being interested in the highest n% or the lowest, we will need to deal with the middle n% for what is known as a confidence interval. In the bottom right hand corner of the Positive z Scores table (first page of yellow handout), there is a small table labeled Common Critical Values. On tests and homework, I will call these Confidence Level Multipliers, or CLMxx%, where xx% is the percentage of confidence associated with the z-scores. The table gives the z-scores I will be calling CLM90%, CLM95% and CLM99%.

CLM90%: The end points are -1.645 and +1.645
CLM95%: The end points are -1.96 and +1.96
CLM99%: The end points are -2.575 and +2.575

For example, what this means for the case of CLM90% is that the low 5% of the data is below z=-1.645, the middle 90% is between z=-1.645 and z=+1.645, and the high 5% is above z=+1.645.


Student's t-scores

Confidence intervals are used over and over again in statistics, most especially in trying to find out what value the parameter of the population has, when all we can effectively gather is a statistic from a sample. For numerical data, we aren't allowed to use the normal distribution table for this process, because the standard deviation sx of a sample isn't a very precise estimator of sigmax of the underlying population. To deal with this extra level of uncertainty, a statistician named William Gossett came up with the t-score distribution, also known as Student's t-score because Gossett published all his work under the pseudonym Student. He used this fake name for publishing to get around a ban on publishing in journals established by his superiors at the Guinness Brewing Company where he worked.

The critical t-score values are published on table A-3. The values depend on the Degrees of Freedom, which in the case of a single sample set of data is equal to n-1. For every degree of freedom, we could have another positive and negative t-score table two pages long, just like the z-score table, but that would take up way too much room, so statistics textbooks have reverted instead to publishing just the highlights. There are five columns on the table, each column labeled with a number from "Area in One Tail" and "Area in Two Tails". Let's look at the degrees of freedom row for the number 13.

1 tail___0.005______0.01_____0.025______0.05______0.10
2 tails__0.01_______0.02_____0.05_______0.10______0.20

13_______3.012_____2.650_____2.160_____1.771_____1.333

What this means is that if we have a sample of size 14, then the degrees of freedom are 13 and we can use these numbers to find the cut-off points for certain percentages. The formula for t-scores looks like the formula for z-scores, t = (x - x-bar)/sigmax, but we use the different look-up table to decide what these numbers mean. For example, the second column in row 13 is the number 2.650. This means that in a sample of 14, a Student's t-score of -2.650 is the cutoff for the bottom 1%, the t-score of +2.650 is the cutoff for the top 1% and the middle 98% is between t-scores of -2.650 and +2.650.

As the degrees of freedom get larger, the numbers in the columns get smaller. The last row has the label Large and reads as follows.

Large____2.576_____2.326_____1.960_____1.645_____1.282

These values exactly correspond to the z-score table. As the data set size gets larger, the differences between the z-distribution and the t-distribution shrink down to nothing.

When can we use t-scores?

Because we don't know sigmax, we are prohibited from using the z-score tables. But there are cases when we shouldn't use the t-score tables either. Here is the decision method.

Step 1: Is n at least 30? If yes, we are good. If no, go to Step 2.

Step 2: Is the sample normally distributed or can we assume the underlying data set is normally distributed? If yes, we can continue. If no, we would only be able to use non-parametric statistical techniques, which are not covered in this course.

For example, the cotinine data sets we have on the handout sheet have one set that looks normally distributed, the smokers data, and two that do not look normally distributed, the exposed and unexposed non-smoker data. Because n=40 for all the sets, we can use the t-score method because we answered yes to Step 1. If the data sets had less than 30 subjects, we would not be able to use the non-smoker data because we would have answered no to the questions from both Step 1 and Step 2.


The formula for a confidence interval for the mean of a population given the mean of a sample.

The confidence interval for the mean of a population given the mean of the sample is a formula that gives us two endpoints as follows

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

Let's take an example. We have the heights of males from Data Set #1. The statistics from that set are as follows.

n = 18
x-bar = 71.17
sx = 2.57

The sample size is 18, which is less than 30, so since we answer no in Step 1, we have to move on to Step 2. Here we can answer yes, because we can assume that human height is a normally distributed set. Let's now move on to finding the 95% confidence interval for the average male height given this sample.

Since n = 18, degrees of freedom = 17, and the CLM95% = 2.110. This is because the area in two tails of 5% is the same as the middle region having 95%. Here is our formula with these numbers plugged in.

71.17 - 2.110*2.57/sqrt(18) < mux < style="font-style: italic;">mux < 72.4

From census information, we know the average height of males in the United States is 69.5 inches, so this interval does not contain the true value. This semester, both the data sets had averages for male heights well above average, largely because of how many football and baseball players are enrolled in the classes, as well as a few other tall males who are not on the sports teams. This is a good example that a confidence interval is NOT a promise of a correct answer, and that statistical methods include confidence intervals exactly for this reason. (note: If we changed the interval to the 99% confidence interval, our confidence level multiplier would be 2.898 instead of 2.110, and the 99% confidence interval would contain the correct answer.)

No comments: