Statistics on a budget: February 2014

Thursday, February 20, 2014

Notes for February 18 and 20

A probability is a number between 0 and 1, inclusive, and is represent by p(event) if part of a sample or p-hat(event) if part of a population.

A probability space or event space is a list of what we call the simple events. The list follows two rules.

1. Every simple event is mutually exclusive from every other simple event, which means the cannot both happen simultaneously. For example, if I flip one coin, it cannot land "heads" and "tails" simultaneously.

2. When we add up the probabilities of all the simple events, the sum is 1. What this means is the simple events (sometimes called simple outcomes).

There are ways to create new probabilities from a set of simple events. The first methods we will discuss are AND, OR and NOT. Let's start with the most basic, NOT.

p(NOT x) is the probability that the event x will not happen. It is sometimes written as q(x) instead. It is always true that

p(x) + p(NOT x) = p(x) + q(x) = 1

Example #1: If we flip a coin, we have two possibilities, heads and tails. We usually assume that p(heads) = .5 and p(tails) = .5, but that doesn't have to be true. What is true in this cases with a categorical variable that has only two legitimate values is that p(NOT heads) = q(heads) = p(tails), and the sum is 1. This means if p(heads) = .51, then q(heads) = 1 - .51 = .49.

If we roll a six-sided die, we have six possible outcomes, a 1, a 2, a 3, a 4, a 5 or a 6. The probability for not rolling a 1 is

p(NOT 1) = p(2) + p(3) + p(4) + p(5) + p(6)

Because these are simple events, we don't have to worry about overlap. And again, instead of calling it p(NOT 1), we can call it q(1).

If two events A and B are mutually exclusive, then

p(A OR B) = p(A) + p(B)

If they are not mutually exclusive, that means p(A AND B) does not equal zero, and the rule for finding OR changes to

p(A OR B) = p(A) + p(B) - p(A AND B)

This is most easily explained with a contingency table. Let's say we have two variables for a game, one called Result (which is either a win or a loss) and Setting (which is either home or away). The easiest way to represent this situation is a contingency table, like the one below listing wins and losses for the Warriors so far this season, broken into home record and away record.

___H || A|| Total
W |16||15|| 31
L |10||12|| 22
|26||27|| 53 = grand total

p(Home) = 26/53, which rounded to the nearest thousandth is .491
p(Away) = 27/53, which rounded to the nearest thousandth is .509. Notice that p(Home) = q(Away) and vice versa.
p(Wins) = 31/53, which rounded to the nearest thousandth is .585
p(Losses) = 22/53, which rounded to the nearest thousandth is .415. Notice that p(Wins) = q(Losses) and vice versa.

p(Wins AND Home) = 16/53, which rounded to the nearest thousandth is .302. What this represents is the number of home wins divided by the total number of games.

p(Wins OR Home) is the probability that a game picked at random is either a home game or a win. We will get this by adding all the wins to all the home games, but we have to subtract the home wins, because they were counted twice.

p(Wins OR Home) = 31/53 + 26/53 - 16/53 = 41/53 or .774 rounded to the nearest thousandth.

Besides AND, OR and NOT, we have the qualifier GIVEN. In a contingency table, this means we only look at the number is a single row or column.

p(Wins GIVEN Home) = 16/26 or .615 rounded to the nearest thousandth.

p(Home GIVEN Wins) = 16/31 or .516 rounded to the nearest thousandth.

Notice that p(Wins) = .585 but p(Wins GIVEN Home) = .615. When these numbers are different, we say that the two categories Wins and Home are dependent. If they were the same, we would say the categories are independent. A contingency table that was independent might look like this

___H || A|| Total
W |16|| 8|| 24
L |10|| 5|| 15
|26||13|| 39 = grand total

Now, p(Wins) = p(Wins GIVEN Home) = p(Wins GIVEN Away). This hypothetical win-loss record is at the same proportion whether on the road or at home.

Binomial distribution of an independent variable

If I say someone is a 70% free throw shooter, is every free throw attempt independent of what happened before? Often, we set up such an experiment assuming independence just to make our work simpler, but the human factor is involved, so in reality it's very likely to be dependent. Some people get frustrated after a few misses and will do worse. Others will learn from the mistakes of a few misses and figure out what they are doing wrong and make improvements. A player might be having a bad day for some reason, or might instead have excellent concentration or just really good luck that day. But again, these kinds of experiments are often set up as though each free throw trial is independent of what came before.

Let's look at flipping coins. A list of all possible events is called the event space. Here are some examples of event spaces.

Event space for flipping one coin
Heads (H)
Tails (T)

ways to get one head = 1
ways to get no heads = 1

Event space for flipping two coins
HH
HT
TH
TT

ways to get two heads = 1
ways to get one head = 2
ways to get no heads = 1

Event space for flipping three coins
HHH
HHT
HTH
HTT
THH
THT
TTH
TTT

ways to get three heads = 1
ways to get two heads = 3
ways to get one head = 3
ways to get no heads = 1

The list of numbers of ways to get r successes in n trials is often written in the pattern of the picture shown here, and this pattern is called Pascal's Triangle, at least in most of the world. The Italians call it Tartaglia's Triangle and the Chinese call it Yanghui's Triangle. None of these people actually invented it or claimed to have invented it. It's been around since before the time of Christ, and it has been studied all around the world.

While it is very common to see it presented in the form here as an equilateral triangle, it can also be presented where the first numbers in each row are lined up straight as follows

1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
... etc.

It is standard to start counting the top row as row 0, and the left most column as column 0. For example, the 6 we see in the middle of the last row I typed in is row 4, column 2. Instead of having a copy of Pascal's Triangle around, our calculators have these numbers available. On Texas Instruments calculators, the function is under the probability menu. On the TI-30XIIs, the way to get that 6 is to type

4 [prb][right arrow]2[enter]

The calculator will read

4 nCr 2
6

All scientific calculators should have this function available, but all of them are slightly different. The TI-89 writes it as nCr(4,2) and Casio calculators write it as 4 C 2. I will pronounce it "4 choose 2", and when I type on the blog, I will type C(4,2). When I write it on the board or on tests, I will put a 4 on top of a 2 and surround both numbers with a large parentheses. These numbers are called the binomial coefficients.

The formula for finding the probability for exactly r successes in n independent trials where the probability of success on any single trial is p is shown here. In some books, they don't use the letter q, instead replacing it with (1-p). Likewise, sometimes w is replaced with (n-r). I use the extra letters and include the relationships between them. The letters r and w stand for right and wrong. The letter p and q are standard in probability texts for the probability of a success or a failure.

Let's do an example. You are given a four question multiple choice test, each question having five possible answers. The test is given in a language you do not read, so all you can do is guess. Each question is independent from the others, meaning that if C is the right answer to the first question, it's also possibly the answer to the second. The probability p of a correct guess is 1 chance in 5, or .2, The probability of failure q is 1-.2 = .8, and of course p + q = 1.

Probability of no correct answers = C(4,0)*.2^0*.8^4 = .4096
Probability of exactly one correct answer = C(4,1)*.2^1*.8^3 = .4096
Probability of exactly two correct answers = C(4,2)*.2^2*.8^2 = .1536
Probability of exactly three correct answers = C(4,3)*.2^3*.8^1 = .0256
Probability of four correct answers = C(4,4)*.2^4*.8^0 = .0016

The expected value of correct answers is n*p, so in this case it's 4*.2 = .8, which isn't possible. You can't get a fraction of correct answers on a multiple choice test. The expected value in this case says that over the long run, a test like this should average .8 right answers out of four. As we can see, the most likely thing to happen is actually a tie for first, where getting either no answers right or one answer right both have a probability of about 41%. If you need to get three answers right to pass the test, the odds are less than 3% to get either three or four right, and the odds of getting everything right by chance is a very slim 16 chances in 10,000.

If you have a TI-83 or TI-84, there is a function under the distribution menu called binompdf(n,p,r). All you have to is enter the function, then the three values in the order given, separated by commas.

The function for three right in four trials with probability .2 at each trial is binompdf(4, .2, 3), which as we see above is .0256.

Practice problem.
The test is changed. There are now five multiple choice questions and four choices for each, but it is still given in a language you do not read.

Round the probabilities to four places after the decimal.

1. What is the expected value?

2. What is the probability of no correct answers?

3. What is the probability of exactly one correct answer?

4. What is the probability of exactly two correct answers?

5. What is the probability of exactly three correct answers?

6. What is the probability of exactly four correct answers?

7. What is the probability of five correct answers?

Answers in the comments.

Monday, February 17, 2014

Notes for February 11 and 13

Reverse look-up of z scores: Percentiles to z-scores, z-scores to raw scores

So far, we only use the percentile look-up table if we know a data set is normally distributed. In these cases, we are given the average mux and the standard deviation sigmax and find a z-score z(x) for a raw score x using this formula.

z(x) = (x - mux)/sigmax

We can then use this z-score to look up a percentile on the orange look-up table.

Example: For men, the average height is 70.5 inches (5' 10.5") and the standard deviation is 2.8 inches. What is the z-score for a man 6'0" tall (72 inches)?

z(72) = (72 -70.5)/2.8 = 1.5/2.8 = 0.535714..., which rounded to the nearest hundredth is 0.54. Using our lookup table in row 0.5 and column 0.04, we get .7054, which means a man 6'0" is as tall or taller than 70.54% of the male population. Subtracting from 100%, we can also say that 29.46% of the male population is 6'0" or taller.

What if instead we wanted to find the cut-off height for the 75th percentile of men. What we have to find is when the table has two consecutive numbers that look like this

... .74xx .75xx ...

What that would mean is that between those two z-score will be the cutoff for the 75th percentile. Looking on the positive side of the chart we find

0.6 ... .7486 .7517 ...

In the columns 0.06 and 0.07. This means the 75th percentile lies in between 0.66 and 0.67. Since .7486 is .0014 below .7500 and .7517 is .0017 above, it's fair to say the 75th percentile is about half way in between, and 0.665 is a good approximation.

Now that we have a z-score, we use this formula to find the raw score x.

x = mux + z(x) × sigmax

In this case we get

x = 70.5" + 0.665 × 2.8 = 70.5" + 1.862" = 72.362" = 6" 0.4"

This says the 75th percentile of men's heights is at just below six feet and one half inch tall.

What about the percentage of men between 5'11" and 6'2". Here were find the percentages for both hieghts, then subtract the smaller percentage from the larger.

6'2" = 74", and z(74) = (74-70.5)/2.8 = 3.5/2.8 = 1.25

z = 1.25 corresponds to 0.8944

5'11" = 71" and z(71) = (71-70.5)/2.8 = 0.5/2.8 = 0.17857... = 0.18

z = 0.18 corresponds to 0.5714

0.8944 - .5714 = 0.323, which says about 32.3% of men fall between exactly 5'11" and exactly 6'2" tall.

The Central Limit Theorem

If we have a normally distributed population and take a random sample, z(x-bar) is different from z(x) by changing the standard deviation. The simplest way to compute z(x-bar) is to take the z-score the normal way then multiply by the square root of n, the size of the sample. For example, if 10 men averaged to a height of 6'0", they would have a different z-score and correspond to a different proportion of the population.

z(72) = (72 -70.5)/2.8 × sqrt(10) = 1.694..., which rounds to 1.69. Looking up in row 1.6 and column 0.09, we get .9545. This says a sample of 10 men averaging 6'0" tall is taller than 95.45% of all samples of ten men, and only 4.55% of all samples of ten men are shorter than that.

This rule is called the Central Limit Theorem. We will be using it to try to get estimates of the average of a population using the average of a sample that is assumed to be representative.

Saturday, February 8, 2014

Notes for 4 February 2014

Two different versions of standard deviation of a numerical data set, sx and sigmax

The two formulas here are standard deviations for a numerical data set. When we have a sample, we use sx. When the data set is a population, we use sigmax. We get two different values because the denominator in the first case is n-1 and N in the latter case. The reason for this is degrees of freedom.

The idea of degrees of freedom is to count how much information you need to get an answer. For example, if a football game ends in regulation we know that

final score = score in 1st + score in 2nd + score in 3rd + score in 4th

There are five pieces of information, but if you have any four of these numbers, you can find the fifth. Here we would say the degrees of freedom are 4, which is 5-1. For example, if we know the Seahawks scored 43 points total in the Super Bowl, and the scored 8 points in the first quarter, 14 points in the second quarter and 14 points in the third, we can figure out how much they scored in the 4th quarter without being told.

score in 4th = final score - score in 1st - score in 2nd - score in 3rd = 43 - 8 - 14 - 14 = 7

This is a situation where the degrees of freedom are n-1, just like with the standard deviation of the sample. The idea is that if we know the average and somehow we get all the scores except for one, we can find the last score by multiplying the average by n then subtracting all the scores we know to find the one score we don't know.

A new way to measure data: z-scores

We can compare two sets of data against one another using the averages (x-bar or mux) and the standard deviation (sx or sigmax) with a formula known as the z-score. We subtract the average from the raw score x. If x is above average, we will get a positive number, if x is below average, we will get a negative number. (if x is exactly at teh average, we'll get zero.) We then divide by the standard deviation to get the z-score. This tells us how many standard deviations we are away from average, either high or low.

Here is an example using the American and National League final standings last year. The data set we will check out is the number of wins. I will treat both leagues as samples of all of Major League Baseball and the reason there is a difference in the average wins is because of inter-league play.

American League:
x-bar = 81.3
sx = 13.7

National League:
x-bar = 80.7
sx = 11.1

The AL had slightly more wins, but their standard deviation is higher because the data is more spread out, largely because the Houston Astros were so terrible. In any data set, we can use z-scores to find outliers.

z is greater than 3: The score is very unusually high
z is greater than 2: The score is unusually high
z is less than -3: The score is very unusually low
z is less than -2: The score is unusually low

The Astros had 51 wins, so their z-score is (51-81.3)/13.7 = -2.21. They are the only team in either league that can be considered an outlier, either high or low.

We can also compare teams in the different leagues. For example, both the As and the Braves had 96 wins, but because they are in different leagues they won't have the same z-score.

z-score for As: (96-81.3)/13.7 = 1.07
z-score for Braves: (96-80.7)/11.1 = 1.38

What this would say is that it was more impressive for the Braves to win 96 that for the As. A big part of this is that the Athletics played 19 games against the Astros, winning 15 and losing 4. The Braves never played the Astros, and that alone meant they got the same number of wins against a tougher set of opponents, which accounts for their higher z-score.

Z-scores in normally distributed sets

Not every set can be assumed to be normally distributed. Usually, we will be told a set is normally distributed and be given both mux and sigmax. If we have such a set, we can take a raw score and turn it into a percentile using the look-up table we got on the orange hand-out on Tuesday. Let's take an example.

It is assumed that IQ scores are normally distributed, where the average is 100 and the standard deviation is 15. This would say an IQ of 110 has a z-score of (110-100)/15 = 0.67. Because of our assumption, we can use the look-up table to find the percentile for an IQ of 110.

1. Use the Positive z-score side
2. Look in the row next to the 0.6 label in bold
3. Look in the column labeled 0.07.

The value at that position on the table is .7486. What this means is that 74.86% of the population has an IQ of 110 or less. If we subtract 74.86% from 100%, we get 25.14%. Rounding to the nearest percent, this means the 110 IQ is at about the cutoff for the 75th percentile.

We will discuss this in greater detail on Tuesday, February 11.

Saturday, February 1, 2014

Notes for 30 January 2014

Standard deviation formulas for samples and populations

The new topic on Thursday was standard deviation, a measure of how far spread out a data set is based on the average. (Note that the five number summary is also about the spread of the data, but it is based on the median.) For the first time, not only are there different symbols for standard deviation, sx for the sample and sigmax for a population, but the formulas to derive the values are different as well. (For average, the formulas for x-bar and mux are essentially the same.)

There are different methods for finding the standard deviation (you can see the alternate formulas at this page on the blog), but the one present here is the simplest computationally. I will not force you to compute these by hand, but if you don't have a calculator that has statistic functions, this is the easiest way to do the job.

Let's take the hockey scores data from earlier this week. The length of the list is 22, which we can think of as n, size of a sample or N, the size of a population.

7, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 1, 1, 1, 0, 0

These are the x values. The sum is 69. We also need the x² values and their sum.

49, 36, 25, 25, 25, 16, 16, 16, 16, 9, 9, 9, 9, 9, 9, 4, 4, 1, 1, 1, 0, 0

The sum of the x² is 289. This means the numerator of both fractions is

289 - 69²/22 = 72.59090909...

The square root of 72.59090909/22 is 1.81647..., which is the value for sigmax.

The square root of 72.59090909/21 is 1.859222..., which is the value for sx.

The reason for the difference in formulas is a math concept called degrees of freedom, which we will discuss on Tuesday.

Statistics on a budget

Thursday, February 20, 2014

Notes for February 18 and 20

Monday, February 17, 2014

Notes for February 11 and 13

Saturday, February 8, 2014

Notes for 4 February 2014

Saturday, February 1, 2014

Notes for 30 January 2014

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter