Wednesday, March 4, 2009

Class notes for 3/4

If we have a data set that is normally distributed, we turn raw scores, what we usually call the x values into z-scores. The z-score is to answer the question "How many standard deviations away from the average is the raw score x?"
In any set of numerical data, we can find both the average (x-bar in a sample or mux in a population) and the standard deviation (sx in a sample or sigmax in a population). The standard deviation will always be a non-negative number, and it can only be zero if the data set is really boring and all the values are exactly the same, a very rare situation if the data is random.

Not every data set is normally distributed. The signs of a normally distributed set include, but are not limited to

a) the average and the median are very close to equal
b) most of the data is near the average, with only a few values either far above average or far below average

For example, the data concerning cotinine levels for smokers, exposed non-smokers and unexposed non-smokers are very different from one another. The smoking data is very near the normal distribution, with x-bar at 172.475 and the median at 170. The two non-smoking data sets have a much bigger split between the average and the median. With the exposed, x-bar is 60.575, while the median is 1.5, while the unexposed have an x-bar of 16.35 and a median of 0.

With normally distributed data, the z-scores can be confidently mapped to percentages, using the positive and negative z-score tables, the first two pages of the yellow sheets.

Raw score to z-score (formula) and z-score to percentage (table lookup)

Example #1: What percentage of U.S. women are 5 feet tall or less? The average height of women in the United States is 63.6 inches (5'3.6") and the standard deviation is 2.5 inches. The z-score for 5 feet, or 60 inches, is (60-63.6)/2.5 = -1.44. We now look that number up on the negative z-score table, where the row is the -1.4 row and the column is the .04 column. (This is akin to stem and leaf plots where the ones place and the tenths place are the stem, while the hundredths place is the leaf.)

___...___.03____.04____.05
...
-1.4____.0764 _.0749 _.0735 ...


So the z-score -1.44 corresponds to .0749, which can also be written as 7.49%. That means that 7.49% of women in the U.S. are under five feet tall. Conversely, the percentage of women above five feet tall is (100 - 7.49)% = 92.51%. The "under" number is always the value you will find on the yellow sheet, while the over number will equal (100 - table value)%. Because the normal distribution is symmetrical, 92.51% is the percentage that corresponds to a z-score of 1.44, the opposite of -1.44 on the other side of zero.

Example #2: What percentage of U.S. women are 6 feet tall or more? The z-score for 6 feet, or 72 inches, is (72-63.6)/2.5 = 3.36. We now look that number up on the positive z-score table, where the row is the 3.3 row and the column is the .06 column.

___...___.05____.06____.07
...
3.3_____.9996 _.9996 _.9996 ...


The z-score 3.36 corresponds to .9996, which can also be written as 99.96%. That means that 99.96% of women in the U.S. are under six feet tall. Conversely, percentage of women above six feet tall is (100 - 99.96)% = 0.04%. About 4 in every 10,000 women in the U.S. is over six feet tall.

Example #3: What percentage of U.S. women are between 5 and 6 feet tall? Using the information from examples 1 and 2, we get .9996 - .0749 = .9247, or 92.47% of women in the U.S. are between five feet and six feet tall.

Percentage to z-score (backwards table lookup)

Here's a different question. What is the z-score that corresponds to the 53rd percentile? What this means is we want the z-score that corresponds as closely as possible to .5300 on our look-up table. Since 53% is more than 50%, we will look on the positive z-scores side.

The percentage for z = 0.07 is .5279, while the percentage for z = 0.08 is .5319. The first one is .0021 below .5300 and the second is .0019 above, but we are going to add a third option, the average between them. The average of 0.07 and 0.08 is 0.075. The average of .5279 and .5319 is .5299, which is much closer than the other values, since it is only .0001 below .5300. For that reason, we will choose .0075 as the z-score than corresponds most closely to the 53rd percentile.

z-scores to raw scores

If we know that the z-score for the 53rd percentile is 0.075, what height for U.S. women corresponds to that z-score? We use the formulas below.


This would say in our case that the height that corresponds to 53rd percentile is x = 63.6 + 0.075*2.5 = 63.7875 inches. Only a tiny bit taller, still under 64 inches or 5'4", the 53rd percentile is very close to the 50th percentile, not surprisingly.

We will continue this discussion on Monday.

No comments: