Tuesday, March 31, 2009

Class notes for 3/30


One of the most common uses of statistics is in opinion polling, which is most popular in election years. TV, radio and newspapers are nearly constantly reporting on the results of polls, new ones being released every day. The numbers for the candidates, or at least how much of a lead one candidate has over the major competitor, is reported up front, and sometimes at the end of the report the margin of error will be given. Almost never is the margin of error given with a confidence level attached. The one major exception to this oversight is The New York Times, which does explain the confidence level in a sidebar, that confidence level always being 95% in opinion polls.

As we see in the sidebar above, the true percentage of the population p is expected to be inside an interval that surrounds p-hat, the percentage of our sample. The Confidence Level Multipliers are taken from the z-score table instead of the t-score table. They are given in the lower right hand corner of the Positive z Score table (Table A-2), where they are labeled Common Critical Values.

CLM90% = 1.645
CLM95% = 1.96
CLM99% = 2.575

Let's take some polling data from last year's election and find the margins of error.

Final poll from Florida - 2008
n = 678
p-hat(Obama) = 49%
p-hat(McCain) = 48%
p-hat(undecided or other candidates) = 3%

Margin of error for Obama = 1.96*sqrt(.49*.51/678) = 0.037629145... ~ 3.7%
Margin of error for McCain = 1.96*sqrt(.48*.52/678) = 0.037606552... ~ 3.7%

The margin of error is typically rounded to the nearest tenth of a percent, and unless there is a lot of undecided or support for other candidates, it is very common in a two person race that the margin or error for each candidate will round to the same number.

The correct sentence to explain the following data would be as follows: If the election were held the day the poll was taken, we are 95% confident that Obama would get between 45.2% to 52.7% of the vote, while McCain would garner between 44.2% to 51.7% of the vote.

The final election results in Florida, rounded to the nearest thousand, were as follows.

Obama 4,282, 000 51.0%
McCain 4,045,000 48.2%
Other 63,000 0.8%

Both candidates were inside the 95% confidence intervals stated in the final poll. The other candidate vote is significantly lower than the 3% of the final poll, but that included undecided voters, who may have chosen one candidate over another, or not voted at all.

Final poll from North Dakota - 2008
n = 500
p-hat(Obama) = 46%
p-hat(McCain) = 47%
p-hat(undecided or other candidates) = 7%

Margin of error for Obama = 1.96*sqrt(.46*.54/500) = 0.043686461... ~ 4.4%
Margin of error for McCain = 1.96*sqrt(.47*.53/500) = 0.043747973... ~ 4.4%

Again, the two margins of error round to the same tenth of a percent.

The correct sentence to explain the following data would be as follows: If the election were held the day the poll was taken, we are 95% confident that Obama would get between 41.6% to 50.4% of the vote, while McCain would garner between 42.6% to 51.4% of the vote.

The final election results in North Dakota, rounded to the nearest thousand, were as follows.

Obama 141,000 44.6%
McCain 169,000 53.5%
Other 6,000 1.9%

While Obama's result was inside the 95% confidence interval stated in the final poll, McCain did better than expected. 95% confidence means there will be mistakes about 5% of the time, which is about 1 chance in 20 of being wrong. If the polling company had used the 99% confidence interval, and in polling data no one ever does, the numbers would have been as follows.

Margin of error for Obama = 2.575*sqrt(.46*.54/500) = 0.057394203... ~ 5.7%
Margin of error for McCain = 2.575*sqrt(.47*.53/500) = 0.057475015... ~ 5.7%

McCain's 53.5% is still above the high end of the 99% confidence interval.

In class, we did samples of m&m's. Here are the totals for the second class, with both samples put together.

n= 1700
p-hat(red) = 221/1700 = 13.0%
p-hat(blue) = 382/1700 ~ 22.5%

sp-hat(red) = sqrt(.13*.87/1700) = 0.008156556...
sp-hat(blue) = sqrt(.225*.775/1700) = 0.010127859...

Here are the confidence intervals for the percentage of red m&m's in the current world population of milk chocolate m&m's, or at least those manufactured in Hackettstown, New Jersey

90% confidence interval: 13% +/- 1.645*sqrt(.13*.87/1700) = 13% +/- 1.3% = 11.7% to 14.3%
95% confidence interval: 13% +/- 1.96*sqrt(.13*.87/1700) = 13% +/- 1.6% = 11.4% to 14.6%
99% confidence interval: 13% +/- 2.575*sqrt(.13*.87/1700) = 13% +/- 2.1% = 10.9% to 15.1%

Practice problems:

Find the confidence intervals for the true percentage of blue milk chocolate m&m's, for 90% confidence, 95% confidence and 99% confidence. Do the 99% confidence intervals for blue and red overlap?

Answers in the comments.

1 comment:

Prof. Hubbard said...

90% confidence interval:
22.5% +/- 1.645*sqrt(.225*.775/1700) =
22.5% +/- 1.7% = 20.8% to 24.2%

95% confidence interval:
22.5% +/- 1.96*sqrt(.225*.775/1700) =
22.5% +/- 2.0% = 20.5% to 24.5%

99% confidence interval:
22.5% +/- 2.575*sqrt(.225*.775/1700) =
22.5% +/- 2.6% = 19.9% to 25.1%

The two 99% confidence intervals do not overlap, since blue m&m's are expected to be at least 19.9%, while red m&m's are expected to be at most 15.1%. Recall that both of these intervals are given 99% confidence, so there is 1 chance in 100 for each sample that it missed the mark. Still, red would have to be much higher than expected or blue much lower for there actually to be more red milk chocolate m&m's in the world than blue at this time.