Statistics on a budget: Notes for February 18 and 20

A probability is a number between 0 and 1, inclusive, and is represent by p(event) if part of a sample or p-hat(event) if part of a population.

A probability space or event space is a list of what we call the simple events. The list follows two rules.

1. Every simple event is mutually exclusive from every other simple event, which means the cannot both happen simultaneously. For example, if I flip one coin, it cannot land "heads" and "tails" simultaneously.

2. When we add up the probabilities of all the simple events, the sum is 1. What this means is the simple events (sometimes called simple outcomes).

There are ways to create new probabilities from a set of simple events. The first methods we will discuss are AND, OR and NOT. Let's start with the most basic, NOT.

p(NOT x) is the probability that the event x will not happen. It is sometimes written as q(x) instead. It is always true that

p(x) + p(NOT x) = p(x) + q(x) = 1

Example #1: If we flip a coin, we have two possibilities, heads and tails. We usually assume that p(heads) = .5 and p(tails) = .5, but that doesn't have to be true. What is true in this cases with a categorical variable that has only two legitimate values is that p(NOT heads) = q(heads) = p(tails), and the sum is 1. This means if p(heads) = .51, then q(heads) = 1 - .51 = .49.

If we roll a six-sided die, we have six possible outcomes, a 1, a 2, a 3, a 4, a 5 or a 6. The probability for not rolling a 1 is

p(NOT 1) = p(2) + p(3) + p(4) + p(5) + p(6)

Because these are simple events, we don't have to worry about overlap. And again, instead of calling it p(NOT 1), we can call it q(1).

If two events A and B are mutually exclusive, then

p(A OR B) = p(A) + p(B)

If they are not mutually exclusive, that means p(A AND B) does not equal zero, and the rule for finding OR changes to

p(A OR B) = p(A) + p(B) - p(A AND B)

This is most easily explained with a contingency table. Let's say we have two variables for a game, one called Result (which is either a win or a loss) and Setting (which is either home or away). The easiest way to represent this situation is a contingency table, like the one below listing wins and losses for the Warriors so far this season, broken into home record and away record.

___H || A|| Total
W |16||15|| 31
L |10||12|| 22
|26||27|| 53 = grand total

p(Home) = 26/53, which rounded to the nearest thousandth is .491
p(Away) = 27/53, which rounded to the nearest thousandth is .509. Notice that p(Home) = q(Away) and vice versa.
p(Wins) = 31/53, which rounded to the nearest thousandth is .585
p(Losses) = 22/53, which rounded to the nearest thousandth is .415. Notice that p(Wins) = q(Losses) and vice versa.

p(Wins AND Home) = 16/53, which rounded to the nearest thousandth is .302. What this represents is the number of home wins divided by the total number of games.

p(Wins OR Home) is the probability that a game picked at random is either a home game or a win. We will get this by adding all the wins to all the home games, but we have to subtract the home wins, because they were counted twice.

p(Wins OR Home) = 31/53 + 26/53 - 16/53 = 41/53 or .774 rounded to the nearest thousandth.

Besides AND, OR and NOT, we have the qualifier GIVEN. In a contingency table, this means we only look at the number is a single row or column.

p(Wins GIVEN Home) = 16/26 or .615 rounded to the nearest thousandth.

p(Home GIVEN Wins) = 16/31 or .516 rounded to the nearest thousandth.

Notice that p(Wins) = .585 but p(Wins GIVEN Home) = .615. When these numbers are different, we say that the two categories Wins and Home are dependent. If they were the same, we would say the categories are independent. A contingency table that was independent might look like this

___H || A|| Total
W |16|| 8|| 24
L |10|| 5|| 15
|26||13|| 39 = grand total

Now, p(Wins) = p(Wins GIVEN Home) = p(Wins GIVEN Away). This hypothetical win-loss record is at the same proportion whether on the road or at home.

Binomial distribution of an independent variable

If I say someone is a 70% free throw shooter, is every free throw attempt independent of what happened before? Often, we set up such an experiment assuming independence just to make our work simpler, but the human factor is involved, so in reality it's very likely to be dependent. Some people get frustrated after a few misses and will do worse. Others will learn from the mistakes of a few misses and figure out what they are doing wrong and make improvements. A player might be having a bad day for some reason, or might instead have excellent concentration or just really good luck that day. But again, these kinds of experiments are often set up as though each free throw trial is independent of what came before.

Let's look at flipping coins. A list of all possible events is called the event space. Here are some examples of event spaces.

Event space for flipping one coin
Heads (H)
Tails (T)

ways to get one head = 1
ways to get no heads = 1

Event space for flipping two coins
HH
HT
TH
TT

ways to get two heads = 1
ways to get one head = 2
ways to get no heads = 1

Event space for flipping three coins
HHH
HHT
HTH
HTT
THH
THT
TTH
TTT

ways to get three heads = 1
ways to get two heads = 3
ways to get one head = 3
ways to get no heads = 1

The list of numbers of ways to get r successes in n trials is often written in the pattern of the picture shown here, and this pattern is called Pascal's Triangle, at least in most of the world. The Italians call it Tartaglia's Triangle and the Chinese call it Yanghui's Triangle. None of these people actually invented it or claimed to have invented it. It's been around since before the time of Christ, and it has been studied all around the world.

While it is very common to see it presented in the form here as an equilateral triangle, it can also be presented where the first numbers in each row are lined up straight as follows

1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
... etc.

It is standard to start counting the top row as row 0, and the left most column as column 0. For example, the 6 we see in the middle of the last row I typed in is row 4, column 2. Instead of having a copy of Pascal's Triangle around, our calculators have these numbers available. On Texas Instruments calculators, the function is under the probability menu. On the TI-30XIIs, the way to get that 6 is to type

4 [prb][right arrow]2[enter]

The calculator will read

4 nCr 2
6

All scientific calculators should have this function available, but all of them are slightly different. The TI-89 writes it as nCr(4,2) and Casio calculators write it as 4 C 2. I will pronounce it "4 choose 2", and when I type on the blog, I will type C(4,2). When I write it on the board or on tests, I will put a 4 on top of a 2 and surround both numbers with a large parentheses. These numbers are called the binomial coefficients.

The formula for finding the probability for exactly r successes in n independent trials where the probability of success on any single trial is p is shown here. In some books, they don't use the letter q, instead replacing it with (1-p). Likewise, sometimes w is replaced with (n-r). I use the extra letters and include the relationships between them. The letters r and w stand for right and wrong. The letter p and q are standard in probability texts for the probability of a success or a failure.

Let's do an example. You are given a four question multiple choice test, each question having five possible answers. The test is given in a language you do not read, so all you can do is guess. Each question is independent from the others, meaning that if C is the right answer to the first question, it's also possibly the answer to the second. The probability p of a correct guess is 1 chance in 5, or .2, The probability of failure q is 1-.2 = .8, and of course p + q = 1.

Probability of no correct answers = C(4,0)*.2^0*.8^4 = .4096
Probability of exactly one correct answer = C(4,1)*.2^1*.8^3 = .4096
Probability of exactly two correct answers = C(4,2)*.2^2*.8^2 = .1536
Probability of exactly three correct answers = C(4,3)*.2^3*.8^1 = .0256
Probability of four correct answers = C(4,4)*.2^4*.8^0 = .0016

The expected value of correct answers is n*p, so in this case it's 4*.2 = .8, which isn't possible. You can't get a fraction of correct answers on a multiple choice test. The expected value in this case says that over the long run, a test like this should average .8 right answers out of four. As we can see, the most likely thing to happen is actually a tie for first, where getting either no answers right or one answer right both have a probability of about 41%. If you need to get three answers right to pass the test, the odds are less than 3% to get either three or four right, and the odds of getting everything right by chance is a very slim 16 chances in 10,000.

If you have a TI-83 or TI-84, there is a function under the distribution menu called binompdf(n,p,r). All you have to is enter the function, then the three values in the order given, separated by commas.

The function for three right in four trials with probability .2 at each trial is binompdf(4, .2, 3), which as we see above is .0256.

Practice problem.
The test is changed. There are now five multiple choice questions and four choices for each, but it is still given in a language you do not read.

Round the probabilities to four places after the decimal.

1. What is the expected value?

2. What is the probability of no correct answers?

3. What is the probability of exactly one correct answer?

4. What is the probability of exactly two correct answers?

5. What is the probability of exactly three correct answers?

6. What is the probability of exactly four correct answers?

7. What is the probability of five correct answers?

Answers in the comments.

1 comment:

Prof. Hubbard said...: 1. What is the expected value?
5 * .25 = 1.25

2. What is the probability of no correct answers?
C(5,0)*.25^0*.75^5 ~= .2373

3. What is the probability of exactly one correct answer?
C(5,1)*.25^1*.75^4 ~= .3955

4. What is the probability of exactly two correct answers?
C(5,2)*.25^2*.75^3 ~= .2637

5. What is the probability of exactly three correct answers?
C(5,3)*.25^3*.75^2 ~= .0879

6. What is the probability of exactly four correct answers?
C(5,4)*.25^4*.75^1 ~= .0146

7. What is the probability of five correct answers?
C(5,5)*.25^5*.75^0 ~= .0010

(The sum of the rounded probabilities = 1.); February 20, 2014 at 9:38 AM

Statistics on a budget

Thursday, February 20, 2014

Notes for February 18 and 20

1 comment:

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter