Saturday, March 1, 2014

Notes for February 25 and 27


The dependent probabilities in a 52 card deck

One of the simplest mathematical models of dependency is sampling without replacement, which is the way most card games or lotteries or the game of Bingo works. You have a set of outcomes which get effectively randomized and a trial is performed, meaning a card is taken from the deck or a ping pong ball is removed from the hopper or a bingo marker is removed from the spinner. Once removed, the number of possible outcomes has been reduced by one and probabilities for success and failure of certain outcomes change.

Looking for an ace: There are 52 cards is a standard deck and 4 of them are aces. If I draw a card from a randomized deck, the chances are 4/52 = 1/13 ~= 7.7% that the card will be an ace. What are the chances the second card is an ace?

That depends on the first card.

Probability that the second card is an ace, given the first card is an ace is 3/51 = 1/17 ~= 5.9%.

Probability that the second card is an ace, given the first card is not an ace is 4/51 ~= 7.8%.

Unlike the mathematical model of free throw shooting where we re-calculate the probabilities by adding the most recent make or miss into the percentage, which means a miss brings the odds down and a make brings the odds up, not getting an ace makes the odds a little better next time, and getting an ace makes the odds worse.


This is the formula for the dependent probability model of sampling without replacement is given at the left. The two numbers in parentheses are a binomial coefficient, the numbers you get when you use nCr on your calculator, which I pronounce "n choose r" in class. The pairs of numbers that look like a base and an exponent, except that the exponent is underlined, is the convention developed by Donald Knuth at Stanford for writing the numbers that you get on your calculator using the nPr function, which I pronounce "n fall r", referring to the name "the falling factorial". If we think about a deck of cards, the lowercase letters refer to the size of the hand n, where r is the number of successful trials (r for right) and w is the number of unsuccessful trials (w for wrong), and r+w=n. The uppercase letters refer to the size of the deck, where T is the size of the deck, G is the number of cards we consider success if we draw them and B is the number of cards we consider a failed trial if we draw them. The letter T stands for Total, G for Good and B for Bad. Again, we have an equation, G+B=T.

Example: If we want consider drawing a heart a success and anything else a failure, what is the probability of drawing three hearts and two non hearts in a five card hand from a well-shuffled 52 card deck.

Here are the six numbers we need.
n = 5
r = 3
w = 2
T = 52
G = 13
B = 39

On a TI-30XIIs, here are the keys you would press.

5[prb][right]3×13[prb]3×39[prb]2÷52[prb]5[enter]

The calculator will read as follows.

5 nCr 3*13 nPr 3*39 nPr 2/52 nPr 5
0.081542617

This means the probability of exactly three hearts and two cards of some other suit is about 8.15%.

Let's say instead the deck had 10,000 cards and 2,500 hearts. our numbers would change.
n = 5
r = 3
w = 2
T = 10000
G = 2500
B = 7500

On a TI-30XIIs, here are the keys you would press.

5[prb][right]3×2500[prb]3×7500[prb]2÷10000[prb]5[enter]

The calculator will read as follows.

5 nCr 3*2500 nPr 3*7500 nPr 2/10000 nPr 5
0.0878613102

The difference is small, but the second number is much closer to the odds of 3 out of 5 when the probability of success is .25 every time


5 nCr 3*.25^3*.75^2
0.087890625

The point of this is that as the size of the deck gets larger, dependent and independent probabilities get closer together.

When we use categorical data, the most important statistic we try to predict is the proportion of a value in the population, which we call p, which we will estimate using the proportion from the sample, known as p-hat.

Again we will create a confidence interval, and the formula for standard deviation is very different.

sp-hat = sqrt(p-hat * q-hat/n)

The confidence level multipliers for xx% are taken from the z-score table (Table A-2) instead of the t-score table, and this is because the standard deviation for the sample and the standard deviation for the population are expected to be relatively close to one another. The values for the CLMxx% are given on the first page of your yellow sheets in the lower left hand corner.

CLM90% = 1.645
CLM95% = 1.96
CLM99% = 2.575

Example: Consider data sets #1 and #2, and the proportion of males. Let's find the 95% confidence interval for the underlying population, which we will limit to students at Laney who take statistics.

Data set #1:
n = 38
f(males) = 18
p-hat(males) = 18/38 ~= .474
q-hat(males) = 1 - p-hat(males) ~= .526

.474 - 1.96*sqrt(.474*.526/38) < p < .474 + 1.96*sqrt(.474*.526/38)
.315 < p < .633

Given this sample of 38 students, we are 95% confident the percentage of male students taking statistics at Laney is between 31.5% and 63.3%.


Data set #2: n = 42
f(males) = 12
p-hat(males) = 12/42 ~= .286
q-hat(males) = 1 - p-hat(males) ~= .714

.286 - 1.96*sqrt(.286*.714/42) < p < .286 + 1.96*sqrt(.286*.714/42)
.149 < p < .423

Given this sample of 42 students, we are 95% confident the percentage of male students taking statistics at Laney is between 14.9% and 42.3%.

Data sets #1 and #2 combined: n = 80
f(males) = 30
p-hat(males) = 30/80 = .375
q-hat(males) = 1 - p-hat(males) = .625

.375 - 1.96*sqrt(.375*.625/80) < p < .375 + 1.96*sqrt(.375*.625/38)
.269 < p < .481


Given this sample of 40 students, we are 95% confident the percentage of male students taking statistics at Laney is between 26.9% and 48.1%.

Notice how much our intervals disagree with one another. This is because our best point estimates from the three sets are .474, .286 and .375. Also notice that the width of the 95% confidence interval tends to be smaller as n gets bigger. When the sample size is 38, the width of the confidence interval is .318. At n = 42, it is .274 wide. At n = 80, the width is .212. The most common way to make a confidence interval narrower is to increase the size of the sample.

There are two other ways to change the width. If you ask for a higher confidence level, the interval will get wider. If p and q are close to 50%, the confidence interval will be wider than if the are both far away from 50%.


No comments: