Thursday, April 2, 2009

Class notes for 4/1

We have learned how to find a confidence interval for a proportion in a population. Unlike numerical data, where the confidence level multipliers, known in this class as CLMxx%, are taken from the t-score tables, the CLMxx% for proportions are from the z-score table.

CLM90% = 1.645
CLM95% = 1.96
CLM99% = 2.575

The most common use of confidence intervals for proportions are in opinion polls, though when the people in the media talk about "margin of error", they rarely say that the margin of error is associated with a confidence level, which in opinion polls is always 95%. For example, in early 2008 the opinion polls before the New Hampshire election showed Barack Obama with a lead in a multi-candidate race in the Democratic primary, but the primary was won by Hillary Clinton. There were four candidates who polled well over 1% of the voters, Clinton, Obama, John Edwards and Bill Richardson. Clinton's final total was not within the 95% confidence interval set for her by the poll results, but the other three were in the 95% confidence intervals set for them. These things happen. The idea of the 95% confidence interval is that is leaves open the possibility that it will get the numbers wrong about 1 time in every 20.

Final opinion poll (true result in parentheses)
n = 500
Obama: 39% +/- 4.3% (36.5%, inside the confidence interval)
Clinton: 34% +/- 4.2% (39.1%, outside the confidence interval)
Edwards: 15% +/- 3.1% (16.9%, inside the confidence interval)
Richardson: 4% +/- 1.7% (4.6%, inside the confidence interval)

Obama did worse than predicted, everyone else did better, especially Hillary, whose result was well outside her 95% confidence interval. Again, remember that the 95% confidence interval is not a promise. It says it will be right about 19 times out of 20, but it never knows when that 1 time in 20 that it will be wrong will happen. If Bill Richardson outperformed expectations and gotten 9% of the vote, it would still have been a mistake by the polls, but no one would have paid much attention, because it wouldn't have changed the outcome of who finished first.

Of all the websites, TV shows and newspapers that report on opinion polls, the only one that consistently explains them correctly is The New York Times, which has a standard sidebar it puts next to opinion poll results.


Finding n when the margin of error is given

Opinion polling companies always report the margin of error, but the public does not completely understand it, largely because the media does not explain it. In general, the lower the margin of error the better, but the simplest way to guarantee a low margin of error, given that you can't change the industry standard confidence level of 95%, is to increase the sample size. While this is simple, it's also expensive. If the polling company in New Hampshire had wanted all the margins of error to be no more than 3.0%, it could have used the formula above, with MoE95% = .03 and p-hat = 39%, the best estimate for the leader, who they assumed was Barack Obama.

n >= 1.96^2*.39*.61/.03^2 = 1,015.46...

This says a sample of 1,016 likely voters would have produced a margin of error for each candidate of no more than 3.0%. If we did not have the previous information that the highest percentage expected was about 39%, we would have had to assume someone might be close to 50%, which would increase the needed sample size.

n >= 1.96^2*.5*.5/.03^2 = 1,067.11...

In this case, n would have to be 1,068 to guarantee the margin of error of 3.0% or less.


Confidence of victory

The margin of error is the industry standard, but the people who use these numbers, most especially the news media, really don't understand them very well. Here is a different method to produce a more useful piece of mathematical information which this author has developed, called the confidence of victory method.

Confidence of victory should only be used if the top two vote getters combined are getting 90% or more of the respondents to the opinion poll. So in the New Hampshire primary, we could not use this method. In the final poll taken in New Hampshire before the general election, these were the results.

n = 700
Obama 51%
McCain 44%

Since they add up to 95% of the respondents, we can use the confidence of victory method. What we do is effectively ignore the 5% who are either voting for third party candidates, are preferring none of the candidates or are still telling pollsters they are undecided. We figure out how many people in the poll said they prefer Obama and how many prefer McCain by multiplying the percents by the size of the poll.

f(Obama) = 700 * .51 = 357
f(McCain) = 700 * .44 = 308

new n = 357+308 = 665

p-hat(Obama) = 357/665 ~ 53.7%
p-hat(McCain) = 308/665 ~ 46.3%

sp-hat = sqrt(.537*.463/665) ~ 1.93%

z(Obama) = (53.7 - 50)%/1.93% ~ 1.91

This says Obama's percentage is about 1.91 standard deviations above 50%. The percentage he will get in the actual election may be higher or lower than what we see here. We assume there's about a half a chance he will do better than the final opinion poll, and a half a chance he will do worse. What the public actually cares about is whether he wins or loses. What the confidence of victory method does is find the percentage that corresponds to the z-score. That number is the confidence level we have that the true percentages from the population polled will show that the leader in the poll will be the winner of the election. In this example, z=1.91 corresponds to .9719 on our positive z-score table. Because the confidence of victory method is sensitive to small changes, we should round to the nearest percent and use this sentence to describe the results.

If the election were held when the poll was taken, we are 97% confident that Obama will hold on to the lead shown in the poll and win the election in New Hampshire.

In the actual election, Obama outpolled McCain 55% to 44%, which is to say he did better than expected. Confidence of victory is not concerned with the margin of victory, just whether the favored candidate in the polls wins the actual elections.

In 2008, the final polls in the 50 states and Washington D.C. had two states too close to call, Missouri and Indiana. Both elections were very close, called late in the evening, Missouri for McCain and Indiana for Obama. In the other 49 contests where confidence of victory claimed an advantage for one side or the other, 48 contests were won by the person leading in the most recent poll, which is to say the confidence of victory method was vindicated about 98% of the time. The only state where the confidence of victory method did not get the right result was North Carolina. McCain had a 60% confidence of victory in North Carolina, but Obama actually won the state.

In 2004, there were two states that the confidence of victory method got wrong, Ohio and Florida which looked to be favoring John Kerry in the final polls. Of course, 2004 was a much closer election, and either of those states could have turned the tide. 2008 was an electoral college landslide. Even if McCain had won North Carolina, he still would have lost the election.

Practice problems

Here are some final poll numbers from the 2008 election where the totals favoring either Obama or McCain add up to over 90%. Use the confidence of victory method.

Colorado
n = 600
Obama 52%
McCain 45%

Arizona
n = 600
Obama 46%
McCain 50%

Answers in the comments.

1 comment:

Prof. Hubbard said...

Colorado
n = 600
Obama 52%
McCain 45%

f(Obama) = 600 * .52 = 312
f(McCain) = 600 * .45 = 270

new n = 312+270 = 582

p-hat(Obama) = 312/582 ~ 53.6%
p-hat(McCain) = 270/582 ~ 46.4%
s-phat = sqrt(.536*.464/582) ~
2.1%

z(Obama) = (.536-5)/.021 ~
1.74

This corresponds to .9591, so we say the confidence of victory number for Obama in Colorado is 96%.



Arizona
n = 600
Obama 46%
McCain 50%

f(Obama) = 600 * .46 = 276
f(McCain) = 600 * .50 = 300

new n = 276+300 = 576

p-hat(Obama) = 276/576 ~ 47.9%
p-hat(McCain) = 300/576 ~ 52.1%

s-phat = sqrt(.521*.479/576) ~
2.1%

z(McCain) = (.521-5)/.021 ~
1.00

This corresponds to .8413, so we say the confidence of victory number for McCain in Arizona is 84%.