Tuesday, April 22, 2014

Notes for April 22 and April 24




Confidence interval for standard deviation of population


The confidence interval formula for sigmax is completely different than the formula for mux, and uses values from a new table, Table A-4, known as the chi squared table. Chi is pronounced like the chi in chiropractor, "kai", not "chee" or "chai".

Note that we don't have to test for normal distribution of the underlying data.

Let's give an example. If we have a set where n = 13 and sx = 3.21, we use n-1 as our degrees of freedom. Here is the line from Table A-4 for the row that corresponds to 12.

_____0.995___0.99___0.975___0.95___0.90___0.10____0.05___0.025____0.01___0.005
_12__3.074__3.571___4.404__5.226__6.304__18.549__21.026__23.337__26.217__28.299

The values of chi²Right and chi²Left are taken from the table as follows.

90% confidence: The right value is from the 0.05 column, the left value s from the 0.95 column. (Note that 0.95 - 0.05 = 0.90 or 90%)


95% confidence: The right value is from the 0.025 column, the left value s from the 0.975 column. (Note that 0.975 - 0.025 = 0.95 or 95%)


99% confidence: The right value is from the 0.005 column, the left value s from the 0.995 column. (Note that 0.995 - 0.005 = 0.99 or 99%)

In this instance, here are the confidence intervals for each of the standard percentages of confidence.

90% confidence: 3.21*sqrt(12/21.026) < sigmax <
3.21*sqrt(12/5.226)
2.425 < sigmax < 4.864

95% confidence: 3.21*sqrt(12/23.337) < sigmax < 3.21*sqrt(12/4.404)
2.302 < sigmax < 5.299

99% confidence: 3.21*sqrt(12/28.299) < sigmax < 3.21*sqrt(12/3.074)
2.090 < sigmax < 6.342


Like confidence intervals for other parameters like proportion and average, the 99% is the largest, the 95% is nested inside the 99% and the 90% is nested inside both the others.  Unlike other confidence intervals, we get the endpoints by multiplying our statistic sx by two positive numbers, one less than 1 and the other greater than 1. The other thing that is unlike the early confidence intervals is that our statistic is not in the exact center. for example 3.21 - 2.425 = .785, while 5.226 - 3.21 = 2016, so there is a greater distance to the high end of the interval than there is to the low end.

Goodness of fit

If we want to know if a coin is fair we can do a two tailed  z-score test of the proportion of heads to tails, where the null hypothesis is that both should be 50%. For example, if I flip a coin 100 times and get 52 heads and 48 tails, that isn't very far off from 50-50. The test statistic would be

z = (.52-.5)/sqrt(.5*.5/100) = .4

A z-score of .4 corresponds to a proportion of .6554. Because this is a two-tailed test, we use the higher number (either heads or tail percentage) and the percentage has to be fairly high.

Two-tailed 90% confidence: over .9500   
Two-tailed 95% confidence: over .9750   
Two-tailed 99% confidence: over .9950   

If instead we got 60 heads and 40 tails, the z-score would produce a much higher proportion.

z = (.60-.5)/sqrt(.5*.5/100) =2.0

A z-score of 2.0 corresponds to a proportion of .9772. If we wanted proof to 90% confidence or 95% confidence, we would say this isn't a fair coin, rejecting the null hypothesis. If we want the proof to 99% confidence, the coin would have to be even more unfair, at least 62 to 38 in 100 flips.

We can't use this method to figure out if a six-sided die is fair, because we need to test that all six possibilities are coming up an equal number of times. For example, if we rolled the die 60 times, we would expect every number to show up exactly 10 times each. In the real world, that's not likely to happen, but we should expect something close.  Here is the result of an experiment done with the random number generator in Excel.

number:::  1  2  3  4  5  6
expected: 10 10 10 10 10 10
observed: 14  9 10 10  7 10

Our test statistic will be the sum of (Observed - Expected)²/Expected. Here are the six values.

1st: (14-10)²/10 = 16/10 = 1.6 
2nd: (9-10)²/10 = 1/10 = 0.1
3rd: (10-10)²/10 = 0/10 = 0.0
4th: (10-10)²/10 = 0/10 = 0.0
5th: (7-10)²/10 = 9/10 =0.9
6th: (10-10)²/10 = 0/10 = 0.0

The sum is 2.6. This is our test statistic. The degrees of freedom is the number of categories - 1, which in this case is 5. Look on the right side of chi square table for the thresholds in row d.f. = 5.

90% threshold: The 0.10 column, which has 9.236 in row 5
95% threshold: The 0.05 column, which has 11.071 in row 5
99% threshold: The 0.01 column, which has 15.086 in row 5

Our test statistic is much lower than even the 90% confidence level, so we fail to reject the null hypothesis. In other words, while this test wasn't perfect, it was much too close to what we expected to doubt the die was unfair.

Independence of contingency tables

You may remember the idea of dependent probability where p(A, given B) would not be equal to p(A). It is possible to make a contingency table where p(A, given B) = p(A) for any A and B, where one is a row and the other is a column. Here is an example of a team whose road record and home record are exactly the same.



___H || A|| Total
W |16||16|| 32
L |10||10|| 20
  |26||26|| 52 = grand total

 We see that p(Wins) = p(Wins GIVEN Home) = p(Wins GIVEN Away). This hypothetical win-loss record is at the same proportion whether on the road or at home.



When we look at a contingency table, it most likely won't be exactly independent, but we have a new test using chi square table once again and using a test statistic that has the same formula as goodness of fit, the sum of (Observed - Expected)²/Expected. This time, the table we are given is the Observed and we must create the Expected using the formula

(row total) * (column total) = grand total.

Here is an example.

___H || A|| Total
W |22||18|| 40
L | 8||12|| 20
  |30||30|| 60 = grand total



Now the home and road records aren't identical. We create the expected by keeping the row and column totals and blanking out the values inside.

___H || A|| Total
W |__||__|| 40
L |__||__|| 20
  |30||30|| 60 = grand total



In the box for Home Wins, upper left hand corner we put


40 * 30/60 = 20. (Usually this is a decimal number, but I designed it to be a whole number to make this example easier.


___H || A|| Total
W |20||__|| 40
L |__||__|| 20
  |30||30|| 60 = grand total



We could do the (row total) * (column total) = grand total method three more times to fill in the rest, but we don't have to. We can simply subtract the number in the box we just filled in to get the rest of the top row and the left column.


___H || A|| Total
W |20||20|| 40
L |10||__|| 20
  |30||30|| 60 = grand total



Now it's easy to fill in the last box as well.


___H || A|| Total
W |20||20|| 40
L |10||10|| 20
  |30||30|| 60 = grand total



Now we get the four values of (Observed - Expected)²/Expected. Here is the Observed contingency table once again.


___H || A|| Total
W |22||18|| 40
L | 8||12|| 20
  |30||30|| 60 = grand total



Top left: (22-20)²/20 = 4/20 = 0.2
Top right: (18-20)²/20 = 4/20 = 0.2
Bottom left: (8-10)²/10 = 4/10 = 0.4
Bottom right: (12-10)²/10 = 4/10 = 0.4

Sum = 1.2

The degrees of freedom for a contingency table is (# of rows - 1) * (# of columns - 1), which in this case is (2-1)*(2-1) = 1*1 = 1. (Notice that we only had to fill in one box in the contingency table and the rest could be found by subtraction.) When degrees of freedom = 1, our thresholds are as follows.

90% confidence level (column 0.10): 2.706
95% confidence level (column 0.05): 3.841
99% confidence level (column 0.01): 6.635

Our test statistic of 1.2 does not let us reject the null hypothesis. This means that while the home and road records are different, we do not consider them to be statistically significantly different.


Restriction of range and correlation

 
 If we take two matched sets of data, we can create a trendline, yp = ax + b, which is also know as the predictor line or the line of regression or the line of least squares. This is the illustration of a set of miles per gallon highway on the x axis and weight on the y axis. Not surprisingly, lighter cars get better gas mileage in general and heavier cars get less miles per gallon. Here is the set of 18 matched pairs, written as (mpg_highway, weight).

(24, 3930)

(24, 3985)

(26, 3995)

(26, 4020)

(27, 3515)

(27, 3175)

(27, 3225)

(28, 3220)

(29, 3115)

(29, 3450)

(30, 3525)

(30, 3245)

(30, 3115)

(31, 2795)

(32, 3235)

(34, 2500)

(37, 2440)

(37, 2290)

The correlation coefficient R² isn't as perfect 1.000, but according to the table we use to check how strong the correlation is, For 18 points the 95% confidence level threshold is 0.2190 and the 99% is 0.3841. So a value of R² = 0.7643 for these two variables shows they have very strong correlation and that isn't surprising. Strong negative correlation here (we can tell it's negative because the line slopes downward) means lighter cars generally get better gas mileage than heavier cars. That makes sense.



The statistical concept that is new here is restriction of range. This says that there is a tendency if you only look at data where the x values are limited, the R² value will go down. Let's say we only look at the first eight cars on the list, the ones that get under 30 mpg highway. We see that the trend is still downward and the slope is steeper. The thing our rule says is that
R² will be less usually, and it is less here. 0.60711 instead of 0.7643. For 8 points, this level of correlation surpasses the 95% threshold of 0.4998, but does not surpass the 99% threshold of 0.6956.


We see the same tendency come true when we look only at the cars getting 30 mpg highway or better. Here there are ten data points and the R² value is 0.61952. The thresholds for 10 pairs are 0.3994 and 0.5852 for 95% and 99% confidence, respectively. By our measure, this set has better correlation than the heavier cars do, but not as good as the set of all 18 cars.



The Monty Hall Problem (or for younger people, The Game Show Problem)

Way back in the day, there was a game show called Let's Make a Deal and the host was Monty Hall. (As a survey in class showed, only one student was aware of this, watching it in re-runs on the cable  channel The Game Show Network.) There were many different games played in a half hour using many rules, but one of the famous ones is called The Monty Hall Problem. We can call it The Game Show Problem or more descriptively One Brand New Car and Two Goats.

The rules of the game are as follows. There are three closed doors and the contestant must choose one. Behind two of the doors there are bad prizes and behind the last door there is a great prize. Usually but not always, the bad prizes were goats. Usually but not always, the great prize was a brand new car.

(Note from a different perspective: If you have a place to raise them, goats are excellent producers of meat and milk. The most famous cheese of Greece, known as Feta Cheese, is almost always made from goat's milk. A new car, on the other hand, usually means higher insurance rate and even though it's a prize, the winner has to pay the sales tax.)

Back to the game. After the contestant chooses a door, not knowing if the prize is good or bad, the game show host (Monty Hall) shows that there's a goat behind some unchosen door and asks the contestant if he (or she) wants to switch to another door. (In the three door game, switching means over to the only other door available.)

The math question here is this. Does it make sense to stick with your original door or switch?

Explaining the simplest situation, the three door, one car, two goat version: Okay, on the contestant's first choice, there is a 1/3 chance of getting the car and 2/3 probability of getting a goat. If the contestant doesn't switch, the chance of winning is 1/3.

If the contestant switches, we have to look at two possible situations.

1. The contestant picked the car in the first place. There is a 1/3 chance of this and if the contestant switches, there are no other cars so the contest will lose.

2. The contest picked a goat in the first place. There was 2/3 chance this would happen. If your door has a goat and Monty shows you the second goat, the door you would switch to must have the car, so if the contestant switches, the chance of winning is 2/3.

Generalizing the problem. Let's call the number of doors D, the number of bad prizes B and the number of good prizes G, where G + = D.


Change the rules so that there can be three doors or more with at least two bad prizes. (If you have a bad prize, the host still needs to show a bad prize.)

If you don't switch, the odds are G/D.

After doing some algebra, we see that the are G/D * (D-1)/(D-2). The second fraction is of the form big/little, so it must be more than 1. What this means is that no matter how many goats (at least two or the game doesn't work) and how many cars (at least one or you can only pick goats), it always makes sense to switch.


 

No comments: