Statistics on a budget: March 2009

Tuesday, March 31, 2009

Class notes for 3/30

One of the most common uses of statistics is in opinion polling, which is most popular in election years. TV, radio and newspapers are nearly constantly reporting on the results of polls, new ones being released every day. The numbers for the candidates, or at least how much of a lead one candidate has over the major competitor, is reported up front, and sometimes at the end of the report the margin of error will be given. Almost never is the margin of error given with a confidence level attached. The one major exception to this oversight is The New York Times, which does explain the confidence level in a sidebar, that confidence level always being 95% in opinion polls.

As we see in the sidebar above, the true percentage of the population p is expected to be inside an interval that surrounds p-hat, the percentage of our sample. The Confidence Level Multipliers are taken from the z-score table instead of the t-score table. They are given in the lower right hand corner of the Positive z Score table (Table A-2), where they are labeled Common Critical Values.

CLM90% = 1.645
CLM95% = 1.96
CLM99% = 2.575

Let's take some polling data from last year's election and find the margins of error.

Final poll from Florida - 2008
n = 678
p-hat(Obama) = 49%
p-hat(McCain) = 48%
p-hat(undecided or other candidates) = 3%

Margin of error for Obama = 1.96*sqrt(.49*.51/678) = 0.037629145... ~ 3.7%
Margin of error for McCain = 1.96*sqrt(.48*.52/678) = 0.037606552... ~ 3.7%

The margin of error is typically rounded to the nearest tenth of a percent, and unless there is a lot of undecided or support for other candidates, it is very common in a two person race that the margin or error for each candidate will round to the same number.

The correct sentence to explain the following data would be as follows: If the election were held the day the poll was taken, we are 95% confident that Obama would get between 45.2% to 52.7% of the vote, while McCain would garner between 44.2% to 51.7% of the vote.

The final election results in Florida, rounded to the nearest thousand, were as follows.

Obama 4,282, 000 51.0%
McCain 4,045,000 48.2%
Other 63,000 0.8%

Both candidates were inside the 95% confidence intervals stated in the final poll. The other candidate vote is significantly lower than the 3% of the final poll, but that included undecided voters, who may have chosen one candidate over another, or not voted at all.

Final poll from North Dakota - 2008
n = 500
p-hat(Obama) = 46%
p-hat(McCain) = 47%
p-hat(undecided or other candidates) = 7%

Margin of error for Obama = 1.96*sqrt(.46*.54/500) = 0.043686461... ~ 4.4%
Margin of error for McCain = 1.96*sqrt(.47*.53/500) = 0.043747973... ~ 4.4%

Again, the two margins of error round to the same tenth of a percent.

The correct sentence to explain the following data would be as follows: If the election were held the day the poll was taken, we are 95% confident that Obama would get between 41.6% to 50.4% of the vote, while McCain would garner between 42.6% to 51.4% of the vote.

The final election results in North Dakota, rounded to the nearest thousand, were as follows.

Obama 141,000 44.6%
McCain 169,000 53.5%
Other 6,000 1.9%

While Obama's result was inside the 95% confidence interval stated in the final poll, McCain did better than expected. 95% confidence means there will be mistakes about 5% of the time, which is about 1 chance in 20 of being wrong. If the polling company had used the 99% confidence interval, and in polling data no one ever does, the numbers would have been as follows.

Margin of error for Obama = 2.575*sqrt(.46*.54/500) = 0.057394203... ~ 5.7%
Margin of error for McCain = 2.575*sqrt(.47*.53/500) = 0.057475015... ~ 5.7%

McCain's 53.5% is still above the high end of the 99% confidence interval.

In class, we did samples of m&m's. Here are the totals for the second class, with both samples put together.

n= 1700
p-hat(red) = 221/1700 = 13.0%
p-hat(blue) = 382/1700 ~ 22.5%

sp-hat(red) = sqrt(.13*.87/1700) = 0.008156556...
sp-hat(blue) = sqrt(.225*.775/1700) = 0.010127859...

Here are the confidence intervals for the percentage of red m&m's in the current world population of milk chocolate m&m's, or at least those manufactured in Hackettstown, New Jersey

90% confidence interval: 13% +/- 1.645*sqrt(.13*.87/1700) = 13% +/- 1.3% = 11.7% to 14.3%
95% confidence interval: 13% +/- 1.96*sqrt(.13*.87/1700) = 13% +/- 1.6% = 11.4% to 14.6%
99% confidence interval: 13% +/- 2.575*sqrt(.13*.87/1700) = 13% +/- 2.1% = 10.9% to 15.1%

Practice problems:

Find the confidence intervals for the true percentage of blue milk chocolate m&m's, for 90% confidence, 95% confidence and 99% confidence. Do the 99% confidence intervals for blue and red overlap?

Answers in the comments.

Tuesday, March 24, 2009

Class notes for 3/23

So far, we have been using numerical data and coming up with confidence intervals for the average or standard deviation of the underlying population using formulas involving n, x-bar and sx, as well as multipliers from the Student's t-score tables (Table A-3) and the Chi square table (Table A-4). We use Student's t-scores instead of z-scores because the standard deviation of the sample (sx) might not be very close to the standard deviation of the population (sigmax), and the t-scores numerically deal with the extra uncertainty created by this variation.

When we use categorical data, the most important statistic we try to predict is the proportion of a value in the population, which we call p, which we will estimate using the proportion from the sample, known as p-hat.

Again we will create a confidence interval, and the formula for standard deviation is very different.

sp-hat = sqrt(p-hat * q-hat/n)

The confidenced level multipliers for xx% are taken from the z-score table (Table A-2) instead of the t-score table, and this is because the standard deviation for the sample and the standard deviation for the population are expected to be relatively close to one another. The values for the CLMxx% are given on the first page of your yellow sheets in the lower left hand corner.

CLM90% = 1.645
CLM95% = 1.96
CLM99% = 2.575

Example: Consider data sets #1 and #2, and the proportion of males. Let's find the 95% confidence interval for the underlying population, which we will limit to students at Laney who take statistics.

Data set #1:
n = 38
f(males) = 18
p-hat(males) = 18/38 ~= .474
q-hat(males) = 1 - p-hat(males) ~= .526

.474 - 1.96*sqrt(.474*.526/38) < p < .474 + 1.96*sqrt(.474*.526/38)
.315 < p < .633

Given this sample of 38 students, we are 95% confident the percentage of male students taking statistics at Laney is between 31.5% and 63.3%.

Data set #2: n = 42
f(males) = 12
p-hat(males) = 12/42 ~= .286
q-hat(males) = 1 - p-hat(males) ~= .714

.286 - 1.96*sqrt(.286*.714/42) < p < .286 + 1.96*sqrt(.286*.714/42)
.149 < p < .423

Given this sample of 42 students, we are 95% confident the percentage of male students taking statistics at Laney is between 14.9% and 42.3%.

Data sets #1 and #2 combined: n = 80
f(males) = 30
p-hat(males) = 30/80 = .375
q-hat(males) = 1 - p-hat(males) = .625

.375 - 1.96*sqrt(.375*.625/80) < p < .375 + 1.96*sqrt(.375*.625/38)
.269 < p < .481

Given this sample of 40 students, we are 95% confident the percentage of male students taking statistics at Laney is between 26.9% and 48.1%.

Notice how much our intervals disagree with one another. This is because our best point estimates from the three sets are .474, .286 and .375. Also notice that the width of the 95% confidence interval tends to be smaller as n gets bigger. When the sample size is 38, the width of the confidence interval is .318. At n = 42, it is .274 wide. At n = 80, the width is .212. The most common way to make a confidence interval narrower is to increase the size of the sample.

There are two other ways to change the width. If you ask for a higher confidence level, the interval will get wider. If p and q are close to 50%, the confidence interval will be wider than if the are both far away from 50%.

Question:

1. Use data set #1 to find the 95% confidence interval for the proportion of 20-29 year olds among Laney statistic students. Write the sentence explaining the confidence interval.

2. Use data set #2 to find the 95% confidence interval for the proportion of 20-29 year olds among Laney statistic students. Write the sentence explaining the confidence interval.

3. Use data sets #1 and #2 combined to find the 95% confidence interval for the proportion of 20-29 year olds among Laney statistic students. Write the sentence explaining the confidence interval.

Answers in the comments.

Sunday, March 22, 2009

Practice problems for population percentiles

Based on data from the National Health Survey, the average height in inches for women in the United States is 63.6 inches and the standard deviation is 2.5 inches. These are mux and sigmax, respectively. If we ask how many women are exactly some height, like 5'0" (60 inches) for example, the answer is assumed to be 0%. But if a woman says she is 5'0" tall, and she has measured herself correctly, she could be anywhere between 59.5 and 60.5 inches tall, which is to say 4' 11.5" and 5' 0.5". Let's find the z-scores for 59.5 and 60.5.

z(59.5) = (59.5-63.6)/2.5 = -1.64
z(59.5) = (59.5-63.6)/2.5 = -1.24

Using the Negative z-score table, we look up the percentages for each of these z-scores.

-1.64 -> .0505
-1.24 -> .1075

subtracting the percentages, we get .1075 - .0505 = .0570, which says 5.7% of American women list their height as 5'0".

Questions

1. What percentage of American women list their height as 5'1"?
2. What percentage of American women list their height as 5'2"?
3. What percentage of American women list their height as 5'3"?
4. What percentage of American women list their height as 5'4"?
5. What percentage of American women list their height as 5'5"?
6. What percentage of American women list their height as 5'6"?
7. What percentage of American women list their height as 5'7"?
8. What percentage of American women list their height as 5'8"?
9. What percentage of American women list their height as 5'9"?

Answers in the comments.

Thursday, March 19, 2009

practice data for confidence intervals

Data set

70 69 68 67 73 77 75 68 72 71
71 72 72 72 71 73 67 72 72 78
72 68 65 67 73 77 77 72 72 69

Find the following statistics. If any statistic is not a whole number, round to the nearest hundredth.

n =
x-bar =
sx =
mid-range =
above =

Does the data pass the mean to median test?

Does the data pass the z(mid-range) test?

If it passes both tests, find the 95% confidence interval for mux, and write it as a sentence.

Find the 95% confidence interval for sigmax, and write it as a sentence.

Answers in the comments.

Class notes for 3/18

Tests before confidence interval for average of population

We have discussed finding a confidence interval for mux by using x-bar, the average of a sample taken from that population. Because we don't know the average of the population, we can't know the standard deviation of the population, since you need the average to compute the standard deviation. Because of this uncertainty, we use Student's t-scores instead of z-scores for the Confidence Level Multipliers for xx%, or CLMxx% for short.

There are times when we aren't allowed to use this method for approximation. We have two questions, and if we answer no to both of them, this method should not be used and only non-parametric methods are useful when dealing with this data set. (We will learn a few non-parametric methods later in the class, but most of these methods are taught in more advanced stats classes.)

Question #1: Is n > 30? If a sample is big enough, Gossett decided that his t-scores are going to give a reasonable approximation. This is always the first question because it is so easy to answer, and if we get a yes here, we don't even need to ask Question #2.

Question #2: Is the underlying data set normally distributed? It might be that we will be told the answer to this question without having to do any work to test this ourselves, and if we are told "yes" we can proceed. If we aren't told, here are two tests. There are other tests than can be done, but these two are easy and useful, and if the data set passes both these tests, we will assume we can move forward.

The z(mid-range) test. Recall that the mid-range is the average of the highest and lowest values in the data set. If one of these scores is a lot farther away from average than the other, we can consider that the data set may be too skewed to be reliable. What this test does is look to see if one extreme value (the highest or lowest) is more than one standard deviation farther away from average than the other extreme value. Understand that one standard deviation is an arbitrary value. The formula in this inequality is z(mid-range), which is where the test's name comes from.

The mean to median test. If the data is normally distributed, then the mean and median of the underlying population should be equal. If the mean and median of the sample are significantly different, then this test should produce a number greater than 1.353. (This is a simple version of a goodness-of-fit test, which we will be exploring in greater depth later in the semester.)

Once we have the average value x-bar, count how many values are greater than x-bar, and call this value above. We should expect that random samples from a normally distributed set should have about a 50%-50% split of values above and below average, so about n/2 in each group.

Again, the cut-off point 1.353 is arbitrary, and it is derived from the chi squared table. If we get a yes to both these questions, we can answer yes to the question about the normally distributed underlying population and proceed to find the interval for mux, which is the following inequality.

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

Confidence interval for standard deviation of population

The confidence interval formula for sigmax is completely different than the formula for mux, and uses values from a new table, Table A-4, known as the chi squared table. Chi is pronounced like the chi in chiropractor, "kai", not "chee" or "chai".

Note that we don't have to test for normal distribution of the underlying data.

Let's give an example. If we have a set where n = 13 and sx = 3.21, we use n-1 as our degrees of freedom. Here is the line from Table A-4 for the row that corresponds to 12.

_____0.995___0.99___0.975___0.95___0.90___0.10____0.05___0.025____0.01___0.005
_12__3.074__3.571___4.404__5.226__6.304__18.549__21.026__23.337__26.217__28.299

The values of chi^2R and chi^2L are taken from the table as follows.

90% confidence: The right value is from the 0.05 column, the left value s from the 0.95 column. (Note that 0.95 - 0.05 = 0.90 or 90%)

95% confidence: The right value is from the 0.025 column, the left value s from the 0.975 column. (Note that 0.975 - 0.025 = 0.95 or 95%)99% confidence: The right value is from the 0.005 column, the left value s from the 0.995 column. (Note that 0.995 - 0.005 = 0.99 or 99%)

In this instance, here are the confidence intervals for each of the standard percentages of confidence.

90% confidence: sqrt(3.21^2*12/21.026) < sigmax < sqrt(3.21^2*12/5.226)95% confidence: sqrt(3.21^2*12/23.337) < sigmax < sqrt(3.21^2*12/4.404)99% confidence: sqrt(3.21^2*12/28.299) < sigmax < sqrt(3.21^2*12/3.074)

For practice, use your calculator to give the answers above, rounded to two places after the decimal. Answers in the comments.

More practice sets will be posted later today.

Wednesday, March 18, 2009

CORRECTION ON HOMEWORK 8

The formula for the confidence interval for the standard deviation given on the homework is wrong, but the formula given in class today is correct. In the numerator, we need sx ^ 2, not just sx as is written on the homework. Make sure to make this correction to the formula before working on the answer.

Tuesday, March 17, 2009

Class notes for 3/16

One of the major uses of statistics is to find confidence intervals for parameters from statistics, which means to find the approximate range of a value of the population from taking the associated value from a sample, and assigning a confidence level to that range. For the average of a numerical variable, this means finding a range for mux using x-bar, sx, n and looking up a Confidence Level Multiplier on the Student's t-score table, the fifth page of yellow handout, labeled Table A-3.

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

There are some things we need to test before we create the interval. We have two questions to ask about the data, and if we get a yes answer to either question, we can proceed. If we get no answers to both, then we should not use this data set to create a confidence interval.

Question #1: Is n > 30?
Question #2: Is the underlying data set normally distributed?

Question #1 is easy to answer, and if the answer is yes, we don't have to bother with Question #2. It might be that you are simply told by the person who collected the data that the answer is yes. But if that is not the case, here are two tests the data set should pass to get a yes answer to Question #2. Both of these tests are using arbitrary cut-off points, the second test being a preview of tests we are going to do later in the class called goodness of fit.

Test #1: The Mid-range Outlier test. Recall that the mid-range is (high + low)/2. Take the z-score for the mid-range, (mid-range - x-bar)/sx, and if this value is more than 0.5 or less than -0.5, we will answer no to Question #2. If we answer yes, this means there is an outlier either high or low, that either the low value is one full standard deviation farther from the average than the high value, or vice versa. In a small data set, this can skew the statistics and make us less confident about the values being near the parameters from the underlying population.

Test #2: The Goodness of Fit test. We can do this test as long as n is 10 or more. If the data set has less than 10 entries and it failed Test #1, you shouldn't try to do the confidence interval. In this test, count the number of entries that are more than x-bar and call that value above. If the underlying data set is normally distributed, we would expect about half the entries would be above average and half below average, so about n/2 in each group. The Goodness of Fit test for this data set is done by plugging in these values into this formula, and making this check.

Is (above - n/2)^2 < face="courier new">1.2 4.7 13.1 3.2 8.8 18.2 3.5 12.5 4.8 8.2 0.7
17.1 0.2 11.4 3.2 8.6 6.3 17.4 17.1 17.8
1.2 0.2 0.1 8.9 4.0 8.6 0.7 2.8 0.5 0.4

x-bar = 6.85
sx = 6.17
n = 30
mid-range = (18.2 + 0.1)/2 = 9.15

Test #1: z(mid-range) = (9.15-6.85)/6.17 = 0.37

Because this is less than one half, we can answer yes and move on to test #2.

Test #2: number of values above average, which we call above = 13. n/2 = 15, so the test is

(13-15)^2/15 = 4/15, which rounds to 0.267 and is below 1.353, so we can answer yes to this and proceed.

n = 30 so degrees of freedom = 29. Here are the three most commonly used CLM values.

CLM90% = 1.699
CLM95% = 2.045
CLM99% = 2.756

Using the formula. here are the values for each of the confidence levels.

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

90% confidence:
6.85 - 1.699*6.17/sqrt(30) < mux < style="font-size:100%;">mux < style="font-weight: bold;">

95% confidence:
6.85 - 2.045*6.17/sqrt(30) < mux 6.85 + 2.045*6.17/sqrt(30)
4.55 < mux < style="font-weight: bold;">99% confidence:
6.85 - 2.756*6.17/sqrt(30) < mux < 6.85 + 2.756*6.17/sqrt(30)
3.75 < mux < face="georgia">The sentence explaining the confidence interval would go as follows.

Given a sample of the carries per game of 30 running backs from the 2008 NFL season, we are 99% confident the true average number of carries per game for all NFL running backs that season is between 3.75 and 9.75.

Practice problem set

Here is the data for average yards per carry for the same 30 randomly selected NFL running backs for the 2008 season.

4.6 2.6 3.9 4.4 4.2 3.8 5.5 3.6 3.1 3.7
6.8 5.5 2.0 5.6 4.4 4.8 5.7 3.6 4.3 3.5
5.0 0.0 -2.0 2.8 3.3 4.4 2.0 6.0 1.3 4.0

Problem #1: Find x-bar and sx, rounded to two places after the decimal point.
Problem #2: Do the z(mid-range) test. Does the set pass or fail?
Problem #3: Do the second test for normal distribution. Does the set pass or fail?
Problem #4: Find the endpoints for the 95% confidence interval.
Problem #5: Write the sentence explaining the interval.

Answers in the comments.

Sunday, March 15, 2009

using the TI30X IIs for confidence intervals

Let's say we have the following statistics from a data set and we want to find the 95% confidence interval for mux, the underlying population's true average value.

n = 36
x-bar = 25.35
sx = 6.78

The degrees of freedom will be n-1 = 35, but 35 is not one of the choices on Table A-3, so we have to go with row 34 instead. The Confidence Level Multiplier for 95%, or CLM95%, is 2.032. If you have a TI-30xIIs, key in the following numbers into your calculator to get the high threshold.

25.35+2.032*6.78/[2nd][x^2]36[enter]

The answer should be 27.64616, which we can round to 27.65.

All we have to do to get the low threshold is to change the + to a -. Key in the following.

[up][right][right][right][right][right]-[enter]

After you key this in, the equation line should look like

25.35-2.032*6.78/sqrt(36

and the answer is 23.05384, which rounds to 23.05.

This says we are 95% confident using this data set that the true average of the underlying population is between 23.05 and 27.65.

Thursday, March 12, 2009

Class notes for 3/11

The positive and negative z-score tables connect z-scores to proportions, numbers between 0 and 1 written to four decimal places of accuracy. For example, z = 1.23 corresponds to .8907, which means that 89.07% of data in a normally distributed set should have a z-score of 1.23 or less, while (100-89.07)% = 10.93% of data will have a z-score of 1.23 or more. Moreover, the normal curve is symmetric around z = 0. This means when z = -1.23, that is the cut-off point between the low 10.93% of the data and the high 89.07% of the data, just the opposite of the percentages that correspond to z = 1.23.

Critical values, or Confidence Level Multipliers (CLMxx%)

Sometimes, instead of being interested in the highest n% or the lowest, we will need to deal with the middle n% for what is known as a confidence interval. In the bottom right hand corner of the Positive z Scores table (first page of yellow handout), there is a small table labeled Common Critical Values. On tests and homework, I will call these Confidence Level Multipliers, or CLMxx%, where xx% is the percentage of confidence associated with the z-scores. The table gives the z-scores I will be calling CLM90%, CLM95% and CLM99%.

CLM90%: The end points are -1.645 and +1.645
CLM95%: The end points are -1.96 and +1.96
CLM99%: The end points are -2.575 and +2.575

For example, what this means for the case of CLM90% is that the low 5% of the data is below z=-1.645, the middle 90% is between z=-1.645 and z=+1.645, and the high 5% is above z=+1.645.

Student's t-scores

Confidence intervals are used over and over again in statistics, most especially in trying to find out what value the parameter of the population has, when all we can effectively gather is a statistic from a sample. For numerical data, we aren't allowed to use the normal distribution table for this process, because the standard deviation sx of a sample isn't a very precise estimator of sigmax of the underlying population. To deal with this extra level of uncertainty, a statistician named William Gossett came up with the t-score distribution, also known as Student's t-score because Gossett published all his work under the pseudonym Student. He used this fake name for publishing to get around a ban on publishing in journals established by his superiors at the Guinness Brewing Company where he worked.

The critical t-score values are published on table A-3. The values depend on the Degrees of Freedom, which in the case of a single sample set of data is equal to n-1. For every degree of freedom, we could have another positive and negative t-score table two pages long, just like the z-score table, but that would take up way too much room, so statistics textbooks have reverted instead to publishing just the highlights. There are five columns on the table, each column labeled with a number from "Area in One Tail" and "Area in Two Tails". Let's look at the degrees of freedom row for the number 13.

1 tail___0.005______0.01_____0.025______0.05______0.10
2 tails__0.01_______0.02_____0.05_______0.10______0.20

13_______3.012_____2.650_____2.160_____1.771_____1.333

What this means is that if we have a sample of size 14, then the degrees of freedom are 13 and we can use these numbers to find the cut-off points for certain percentages. The formula for t-scores looks like the formula for z-scores, t = (x - x-bar)/sigmax, but we use the different look-up table to decide what these numbers mean. For example, the second column in row 13 is the number 2.650. This means that in a sample of 14, a Student's t-score of -2.650 is the cutoff for the bottom 1%, the t-score of +2.650 is the cutoff for the top 1% and the middle 98% is between t-scores of -2.650 and +2.650.

As the degrees of freedom get larger, the numbers in the columns get smaller. The last row has the label Large and reads as follows.

Large____2.576_____2.326_____1.960_____1.645_____1.282

These values exactly correspond to the z-score table. As the data set size gets larger, the differences between the z-distribution and the t-distribution shrink down to nothing.

When can we use t-scores?

Because we don't know sigmax, we are prohibited from using the z-score tables. But there are cases when we shouldn't use the t-score tables either. Here is the decision method.

Step 1: Is n at least 30? If yes, we are good. If no, go to Step 2.

Step 2: Is the sample normally distributed or can we assume the underlying data set is normally distributed? If yes, we can continue. If no, we would only be able to use non-parametric statistical techniques, which are not covered in this course.

For example, the cotinine data sets we have on the handout sheet have one set that looks normally distributed, the smokers data, and two that do not look normally distributed, the exposed and unexposed non-smoker data. Because n=40 for all the sets, we can use the t-score method because we answered yes to Step 1. If the data sets had less than 30 subjects, we would not be able to use the non-smoker data because we would have answered no to the questions from both Step 1 and Step 2.

The formula for a confidence interval for the mean of a population given the mean of a sample.

The confidence interval for the mean of a population given the mean of the sample is a formula that gives us two endpoints as follows

x-bar - CLMxx%*sx/sqrt(n) < mux < x-bar + CLMxx%*sx/sqrt(n)

Let's take an example. We have the heights of males from Data Set #1. The statistics from that set are as follows.

n = 18
x-bar = 71.17
sx = 2.57

The sample size is 18, which is less than 30, so since we answer no in Step 1, we have to move on to Step 2. Here we can answer yes, because we can assume that human height is a normally distributed set. Let's now move on to finding the 95% confidence interval for the average male height given this sample.

Since n = 18, degrees of freedom = 17, and the CLM95% = 2.110. This is because the area in two tails of 5% is the same as the middle region having 95%. Here is our formula with these numbers plugged in.

71.17 - 2.110*2.57/sqrt(18) < mux < style="font-style: italic;">mux < 72.4

From census information, we know the average height of males in the United States is 69.5 inches, so this interval does not contain the true value. This semester, both the data sets had averages for male heights well above average, largely because of how many football and baseball players are enrolled in the classes, as well as a few other tall males who are not on the sports teams. This is a good example that a confidence interval is NOT a promise of a correct answer, and that statistical methods include confidence intervals exactly for this reason. (note: If we changed the interval to the 99% confidence interval, our confidence level multiplier would be 2.898 instead of 2.110, and the 99% confidence interval would contain the correct answer.)

Tuesday, March 10, 2009

Class notes for 3/9

Not all data sets are normally distributed, but it has been proven that the set of averages of subsets of a fixed size of a data set are normally distributed around the average of the whole data set. In the language we have used in class, this means that if we take a sample and get an average x-bar, it should be relatively close to the average of the population mux. This is called the Central Limit Theorem.

It's a little hard to see in the picture here, but the equation reads sigmax-bar = sigmax/sqrt(n). As n gets larger, sigmax-bar gets smaller.

To find z(x-bar), we subtract mux from x-bar and divide by sigmax-bar. On a calculator it's easiest to type in (x-bar-mux)/sigmax*sqrt(n).

Here is an example of the difference between z(x) and z(x-bar). We know that for IQ scores, the data set is normally distributed, mux = 100 and sigmax = 15. This means an IQ of 115, has a z-score of (115-100)/15 = 1. The z-score of 1 corresponds to the proportion .8413, which says that the percentage of people with IQs below 115 is 84.13% and the percentage with IQs above 115 is (100-84.13)% = 15.87%.

The Central Limit Theorem z-score answers a different question. What if we have a group of 8 people whose average IQ is 115. How often does that happen? Now, the formula changes to (115-100)/15*sqrt(8) = 2.828..., which rounds to 2.83. The proportion that corresponds to a z-score of 2.83 is .9977, which means that about 99.77% of all groups of eight people have average IQs under 115, while only (100-99.77)% = 0.23% of groups of eight have average IQs at 115 or over.

A standard usage of the Central Limit Theorem is to take a data set and see if the result is unusual or not. This is done by the following procedure.

1. Choose an outlying value, either a z-score or the percentage that corresponds to it.
2. Take a data sample from a population where you know the average and standard deviation already. If the Central Limit Theorem z-score gives us a value beyond the outlying value, we flag the sample we took as an outlying sample.

Saturday, March 7, 2009

Practice problems for standard deviation

The data for IQ test scores is normally distributed. The average IQ is 100 and the standard deviation is 15.

1. What are the z-scores for the following IQ scores?

a. 90
b. 112
c. 120

2. Using the answers from part 1 and the Positive and Negative z-score tables.

a. What percentage of the population has an IQ less than 90?

b. What percentage of the population has an IQ higher than 120?

c. What percentage of the population has an IQ between 112 and 120?

3. Percentiles

a. Find the z-score that corresponds to the 63rd percentile. Find the IQ that corresponds to the 63rd percentile, rounded to the nearest whole number.

b. Find the z-score that corresponds the 13th percentile. Find the IQ that corresponds to the 13th percentile, rounded to the nearest whole number.

Answers in the comments.

Wednesday, March 4, 2009

Class notes for 3/4

If we have a data set that is normally distributed, we turn raw scores, what we usually call the x values into z-scores. The z-score is to answer the question "How many standard deviations away from the average is the raw score x?"

In any set of numerical data, we can find both the average (x-bar in a sample or mux in a population) and the standard deviation (sx in a sample or sigmax in a population). The standard deviation will always be a non-negative number, and it can only be zero if the data set is really boring and all the values are exactly the same, a very rare situation if the data is random.

Not every data set is normally distributed. The signs of a normally distributed set include, but are not limited to

a) the average and the median are very close to equal
b) most of the data is near the average, with only a few values either far above average or far below average

For example, the data concerning cotinine levels for smokers, exposed non-smokers and unexposed non-smokers are very different from one another. The smoking data is very near the normal distribution, with x-bar at 172.475 and the median at 170. The two non-smoking data sets have a much bigger split between the average and the median. With the exposed, x-bar is 60.575, while the median is 1.5, while the unexposed have an x-bar of 16.35 and a median of 0.

With normally distributed data, the z-scores can be confidently mapped to percentages, using the positive and negative z-score tables, the first two pages of the yellow sheets.

Raw score to z-score (formula) and z-score to percentage (table lookup)

Example #1: What percentage of U.S. women are 5 feet tall or less? The average height of women in the United States is 63.6 inches (5'3.6") and the standard deviation is 2.5 inches. The z-score for 5 feet, or 60 inches, is (60-63.6)/2.5 = -1.44. We now look that number up on the negative z-score table, where the row is the -1.4 row and the column is the .04 column. (This is akin to stem and leaf plots where the ones place and the tenths place are the stem, while the hundredths place is the leaf.)

___...___.03____.04____.05
...
-1.4____.0764 _.0749 _.0735 ...

So the z-score -1.44 corresponds to .0749, which can also be written as 7.49%. That means that 7.49% of women in the U.S. are under five feet tall. Conversely, the percentage of women above five feet tall is (100 - 7.49)% = 92.51%. The "under" number is always the value you will find on the yellow sheet, while the over number will equal (100 - table value)%. Because the normal distribution is symmetrical, 92.51% is the percentage that corresponds to a z-score of 1.44, the opposite of -1.44 on the other side of zero.

Example #2: What percentage of U.S. women are 6 feet tall or more? The z-score for 6 feet, or 72 inches, is (72-63.6)/2.5 = 3.36. We now look that number up on the positive z-score table, where the row is the 3.3 row and the column is the .06 column.

___...___.05____.06____.07
...
3.3_____.9996 _.9996 _.9996 ...

The z-score 3.36 corresponds to .9996, which can also be written as 99.96%. That means that 99.96% of women in the U.S. are under six feet tall. Conversely, percentage of women above six feet tall is (100 - 99.96)% = 0.04%. About 4 in every 10,000 women in the U.S. is over six feet tall.

Example #3: What percentage of U.S. women are between 5 and 6 feet tall? Using the information from examples 1 and 2, we get .9996 - .0749 = .9247, or 92.47% of women in the U.S. are between five feet and six feet tall.

Percentage to z-score (backwards table lookup)

Here's a different question. What is the z-score that corresponds to the 53rd percentile? What this means is we want the z-score that corresponds as closely as possible to .5300 on our look-up table. Since 53% is more than 50%, we will look on the positive z-scores side.

The percentage for z = 0.07 is .5279, while the percentage for z = 0.08 is .5319. The first one is .0021 below .5300 and the second is .0019 above, but we are going to add a third option, the average between them. The average of 0.07 and 0.08 is 0.075. The average of .5279 and .5319 is .5299, which is much closer than the other values, since it is only .0001 below .5300. For that reason, we will choose .0075 as the z-score than corresponds most closely to the 53rd percentile.

z-scores to raw scores

If we know that the z-score for the 53rd percentile is 0.075, what height for U.S. women corresponds to that z-score? We use the formulas below.

This would say in our case that the height that corresponds to 53rd percentile is x = 63.6 + 0.075*2.5 = 63.7875 inches. Only a tiny bit taller, still under 64 inches or 5'4", the 53rd percentile is very close to the 50th percentile, not surprisingly.

We will continue this discussion on Monday.

Tuesday, March 3, 2009

Class notes for 3/2

For most of the rest of the semester, we will be working with the idea of standard deviation, a way to measure how spread out a set of data is. There are going to be a lot of different ways to compute standard deviation depending on what kind of set we are dealing with, and the first two we will learn are sx and sigmax, which are the standard deviations for a sample of numerical data and a population of numerical data, respectively. I am going to go through the steps of calculating these numbers with a small set of data first, then show how to key in the data to the TI-30XIIs, which is a huge time saver.

Data set #1: 1, 2, 3, 4, 5, 6

Step #1: Find the average. 1+2+3+4+5+6 = 21, and 21/6 = 3.5. So x-bar or mux is 3.5, depending on whether we have a sample or a population.

Step #2: Take the squares of all the values of data minus the average, then add them together.

(1-3.5)^2 = (-2.5)^2 = 6.25
(2-3.5)^2 = (-1.5)^2 = 2.25
(3-3.5)^2 = (-0.5)^2 = 0.25
(4-3.5)^2 = 0.5^2 = 0.25
(5-3.5)^2 = 1.5^2 = 2.25
(6-3.5)^2 = 2.5^2 = 6.25
sum = 17.5

Step #3: Divide the sum by N or n-1, depending on population or sample.

17.5/6 = 2.91666...
17.5/5 = 3.5

Step #4: Take the square root of the value from Step #3.

sigmax = sqrt(2.91666...) ~ 1.707825...

sx = sqrt(3.5) ~ 1.870828...

For a small set of data and an average that is exact, this isn't so hard. As data sets get larger, this becomes a lot of work to do by hand, which is why a calculator is so valuable.

Steps for TI-30x

Step #1: Get into one variable mode and clear the data set.
If the word STAT is on your screen, this key sequence will do the trick.

[2ND][STATVAR][ENTER][2ND][DATA][ENTER]

If the word STAT is not on your screen, type these key strokes.

[2ND][DATA][ENTER]

Now you are ready to enter in the data. Type in the stuff written in red, where the stuff in black is what is already on the screen.

[DATA]
X1= 1 [DOWN]
FRQ = 1 [DOWN]
X2= 2 [DOWN]
FRQ = 1 [DOWN]
X3= 3 [DOWN]
FRQ = 1 [DOWN]
X4= 4 [DOWN]
FRQ = 1 [DOWN]
X5= 5 [DOWN]
FRQ = 1 [DOWN]
X6= 6 [DOWN]
FRQ = 1 [DOWN]
[STATVAR]

The read out will now give you the following information as your scroll left and right.

n = 6
x-bar (or mux) = 3.5
sx = 1.8708...
sigmax = 1.7078...
sum(x) = 21
sum (x^2) = 91

If you move the underline to the x-bar and press [ENTER], the equation line will now have the symbol x-bar on it, which means the calculator can do equations with the exact values of the average and the standard deviations in them, which will be useful in calculating z-scores.

The standard deviations are roughly equal to the average distance away from the average of all the data in the set. The reason for the n-1 instead of n in the sx equation is the idea of degrees of freedom in a data set. If you know the average of a set of data and the size of the set, you know the total, and if I give you the total of all but one of the values of a set, you can subtract to find the last value.

Why we work with these rough approximations of the average distance instead of the exact value comes from calculus. The normal curve is a bell shaped curve that has area = 1 under the curve from negative infinity to infinity, so any vertical line we draw can cut the area into two parts, where the area under the curve to the left of the line is x and the area to the right of the line is 1-x. A lot of data sets, though not all, have this kind of distribution, and by using z-scores of the raw scores, which is (raw-average)/(standard deviation), we can compare two data sets that are normally distributed, even if they have different averages and different standard deviations. We will look at this in greater detail next class.

Monday, March 2, 2009

Hint for smoker data set on salmon colored sheet

For the three data sets where you are trying to find x-bar and sx

, here are the values of sigmax. If your calculator has a different value for sigmax, you need to check the data list to make sure you haven't made a mistake. The size of each of the data sets is n = 40. Round your answers to two places after the decimal.

Smokers: sigmax = 117.9951244
Exposed non-smokers: sigmax = 136.3469632
Non-exposed non-smokers: sigmax = 61.74688251

Statistics on a budget