Bayesian probability
If a trait is very rare, only a very accurate test gives us useful
information. For example, if a trait shows up in only 1 in 10,000
people but the test for the trait has an error rate of 1 in 1,000, we
should expect about 10 false positives for every true positive. Here is
the completed table for that situation.
________don't___have____row total
test + ____9,999__999______10,998__
test - _9,989,001___1_____9,989,002__
col.___9,999,000_1,000______10,000,000 grand total
In
a situation such as this, testing positive twice could give us useful
information, as testing positive once has an error rate of about 90.9%.
We have to assume the errors are random and not deterministic. For
example, if a test for a chemical compound in opium also catches a
similar compound found in poppy seed bagels, testing twice won't get rid
of the errors. Assuming just random errors here is what we do.
Step
1: The top row of the first contingency table is the column
total/grand total row of the second contingency table. What this does
is takes the numbers from the people who tested positive the first time
and makes them the totals for those who will be tested twice.
________don't___have____row total
test + __________________________
test - __________________________
col._______9,999__999_____10,998 grand total
Step
2: Multiply error rate by have column total to find the number who
have that test negative. Round to the nearest whole number. (We didn't
have to round before, but now we do.)
999*1/1000 = .999 ~= 1, this means test positive and have is 998.
________don't___have____row total
test + ___________998____________
test - _____________1____________
col._______9,999__999_____10,998 grand total
Step
3: Multiply error rate by don't have column total to find the errors.
9,999*1/1000 = 9.999 ~= 10. That means the test negative in that column
is 9,999 - 10 = 9,989.
________don't___have____row total
test + _______10__998____________
test - _____9,989____1____________
col._______9,999__999_____10,998 grand total
Step 4: row totals
________don't___have____row total
test + _______10__998_____1,008___
test - _____9,989____1_____9,990___
col._______9,999__999_____10,998 grand total
Step 5: Find the error rate for testing positive twice. 10/1,008 = .0099... or about 1%.
Of
the ten million people tested, we would send letters to 1,008 telling
them they tested positive twice. Of those people, ten don't have the
trait and are getting false information, but 998 are getting the right
information. In the first test, there was someone with the trait who
tested negative, and the same is true in the second test, so there are
two people with the trait who did not get two positive test results.
While this isn't a perfect situation, it's much better than the over 90%
error rate we got for positive tests the first time through.
Relative frequency charts and ogives
Relative frequencies are also known as proportions and can sometimes be considered probabilities. If the proportions correspond to ordered categories, a line chart can be a clear way to present the data.
Here is the data for the Los Angeles Angels scoring by inning, given as percentages. The first number is the percentage scored in that inning and the second number in brackets is the percentage of entire runs scored in a game up through that inning.
1st: 13.5% [13.5%]
2nd: 10.0% [23.5%]
3rd: 10.4% [33.9%]
4th: 11.0% [45.0%]
5th: 11.7% [56.6%]
6th: 11.7% [68.3%]
7th: 8.8% [77.1%]
8th: 12.3% [89.4%]
9th: 9.4% [98.8%]
extra innings: 1.2% [100.0%]
The Angels are fairly consistent, except for the big bump in the first inning and the drop-off in the 7th. Unsurprisingly, they score very few extra inning runs, though some teams like the Giants score four times as many.
In a line graph, we have two possible options. The first is the line in blue, which shows the production inning by inning. The red line shows the cumulative numbers and that graph is called the ogive, pronounced "oh-jive". If they were completely consistent, the ogive would be a completely straight line, but instead we see the slight bends in the red line when run production increased and decreases per inning.
Sunday, May 11, 2014
Tuesday, April 22, 2014
Notes for April 22 and April 24
Confidence interval for standard deviation of population
The confidence interval formula for sigmax is completely different than the formula for mux, and uses values from a new table, Table A-4, known as the chi squared table. Chi is pronounced like the chi in chiropractor, "kai", not "chee" or "chai".
Note that we don't have to test for normal distribution of the underlying data.
Let's give an example. If we have a set where n = 13 and sx = 3.21, we use n-1 as our degrees of freedom. Here is the line from Table A-4 for the row that corresponds to 12.
_____0.995___0.99___0.975___0.95___0.90___0.10____0.05___0.025____0.01___0.005
_12__3.074__3.571___4.404__5.226__6.304__18.549__21.026__23.337__26.217__28.299
The values of chi²Right and chi²Left are taken from the table as follows.
90% confidence: The right value is from the 0.05 column, the left value s from the 0.95 column. (Note that 0.95 - 0.05 = 0.90 or 90%)
95% confidence: The right value is from the 0.025 column, the left value s from the 0.975 column. (Note that 0.975 - 0.025 = 0.95 or 95%)
99% confidence: The right value is from the 0.005 column, the left value s from the 0.995 column. (Note that 0.995 - 0.005 = 0.99 or 99%)
In this instance, here are the confidence intervals for each of the standard percentages of confidence.
90% confidence: 3.21*sqrt(12/21.026) < sigmax < 3.21*sqrt(12/5.226)
2.425 < sigmax < 4.864
95% confidence: 3.21*sqrt(12/23.337) < sigmax < 3.21*sqrt(12/4.404)
2.302 < sigmax < 5.299
99% confidence: 3.21*sqrt(12/28.299) < sigmax < 3.21*sqrt(12/3.074)
2.090 < sigmax < 6.342
Like confidence intervals for other parameters like proportion and average, the 99% is the largest, the 95% is nested inside the 99% and the 90% is nested inside both the others. Unlike other confidence intervals, we get the endpoints by multiplying our statistic sx by two positive numbers, one less than 1 and the other greater than 1. The other thing that is unlike the early confidence intervals is that our statistic is not in the exact center. for example 3.21 - 2.425 = .785, while 5.226 - 3.21 = 2016, so there is a greater distance to the high end of the interval than there is to the low end.
Goodness of fit
If we want to know if a coin is fair we can do a two tailed z-score test of the proportion of heads to tails, where the null hypothesis is that both should be 50%. For example, if I flip a coin 100 times and get 52 heads and 48 tails, that isn't very far off from 50-50. The test statistic would be
z = (.52-.5)/sqrt(.5*.5/100) = .4
A z-score of .4 corresponds to a proportion of .6554. Because this is a two-tailed test, we use the higher number (either heads or tail percentage) and the percentage has to be fairly high.
Two-tailed 90% confidence: over .9500
Two-tailed 95% confidence: over .9750
Two-tailed 99% confidence: over .9950
If instead we got 60 heads and 40 tails, the z-score would produce a much higher proportion.
z = (.60-.5)/sqrt(.5*.5/100) =2.0
A z-score of 2.0 corresponds to a proportion of .9772. If we wanted proof to 90% confidence or 95% confidence, we would say this isn't a fair coin, rejecting the null hypothesis. If we want the proof to 99% confidence, the coin would have to be even more unfair, at least 62 to 38 in 100 flips.
We can't use this method to figure out if a six-sided die is fair, because we need to test that all six possibilities are coming up an equal number of times. For example, if we rolled the die 60 times, we would expect every number to show up exactly 10 times each. In the real world, that's not likely to happen, but we should expect something close. Here is the result of an experiment done with the random number generator in Excel.
number::: 1 2 3 4 5 6
expected: 10 10 10 10 10 10
observed: 14 9 10 10 7 10
Our test statistic will be the sum of (Observed - Expected)²/Expected. Here are the six values.
1st: (14-10)²/10 = 16/10 = 1.6
2nd: (9-10)²/10 = 1/10 = 0.1
3rd: (10-10)²/10 = 0/10 = 0.0
4th: (10-10)²/10 = 0/10 = 0.0
5th: (7-10)²/10 = 9/10 =0.9
6th: (10-10)²/10 = 0/10 = 0.0
The sum is 2.6. This is our test statistic. The degrees of freedom is the number of categories - 1, which in this case is 5. Look on the right side of chi square table for the thresholds in row d.f. = 5.
90% threshold: The 0.10 column, which has 9.236 in row 5
95% threshold: The 0.05 column, which has 11.071 in row 5
99% threshold: The 0.01 column, which has 15.086 in row 5
Our test statistic is much lower than even the 90% confidence level, so we fail to reject the null hypothesis. In other words, while this test wasn't perfect, it was much too close to what we expected to doubt the die was unfair.
Independence of contingency tables
You may remember the idea of dependent probability, where p(A, given B) would not be equal to p(A). It is possible to make a contingency table where p(A, given B) = p(A) for any A and B, where one is a row and the other is a column. Here is an example of a team whose road record and home record are exactly the same.
___H || A|| Total
W |16||16|| 32
L |10||10|| 20
|26||26|| 52 = grand total
We see that p(Wins) = p(Wins GIVEN Home) = p(Wins GIVEN Away). This hypothetical win-loss record is at the same proportion whether on the road or at home.
When we look at a contingency table, it most likely won't be exactly independent, but we have a new test using chi square table once again and using a test statistic that has the same formula as goodness of fit, the sum of (Observed - Expected)²/Expected. This time, the table we are given is the Observed and we must create the Expected using the formula
(row total) * (column total) = grand total.
Here is an example.
___H || A|| Total
W |22||18|| 40
L | 8||12|| 20
|30||30|| 60 = grand total
Now the home and road records aren't identical. We create the expected by keeping the row and column totals and blanking out the values inside.
___H || A|| Total
W |__||__|| 40
L |__||__|| 20
|30||30|| 60 = grand total
In the box for Home Wins, upper left hand corner we put
40 * 30/60 = 20. (Usually this is a decimal number, but I designed it to be a whole number to make this example easier.
___H || A|| Total
W |20||__|| 40
L |__||__|| 20
|30||30|| 60 = grand total
We could do the (row total) * (column total) = grand total method three more times to fill in the rest, but we don't have to. We can simply subtract the number in the box we just filled in to get the rest of the top row and the left column.
___H || A|| Total
W |20||20|| 40
L |10||__|| 20
|30||30|| 60 = grand total
Now it's easy to fill in the last box as well.
___H || A|| Total
W |20||20|| 40
L |10||10|| 20
|30||30|| 60 = grand total
Now we get the four values of (Observed - Expected)²/Expected. Here is the Observed contingency table once again.
___H || A|| Total
W |22||18|| 40
L | 8||12|| 20
|30||30|| 60 = grand total
Top left: (22-20)²/20 = 4/20 = 0.2
Top right: (18-20)²/20 = 4/20 = 0.2
Bottom left: (8-10)²/10 = 4/10 = 0.4
Bottom right: (12-10)²/10 = 4/10 = 0.4
Sum = 1.2
The degrees of freedom for a contingency table is (# of rows - 1) * (# of columns - 1), which in this case is (2-1)*(2-1) = 1*1 = 1. (Notice that we only had to fill in one box in the contingency table and the rest could be found by subtraction.) When degrees of freedom = 1, our thresholds are as follows.
90% confidence level (column 0.10): 2.706
95% confidence level (column 0.05): 3.841
99% confidence level (column 0.01): 6.635
Our test statistic of 1.2 does not let us reject the null hypothesis. This means that while the home and road records are different, we do not consider them to be statistically significantly different.
Restriction of range and correlation
If we take two matched sets of data, we can create a trendline, yp = ax + b, which is also know as the predictor line or the line of regression or the line of least squares. This is the illustration of a set of miles per gallon highway on the x axis and weight on the y axis. Not surprisingly, lighter cars get better gas mileage in general and heavier cars get less miles per gallon. Here is the set of 18 matched pairs, written as (mpg_highway, weight).
(24, 3930)
(24, 3985)
(26, 3995)
(26, 4020)
(27, 3515)
(27, 3175)
(27, 3225)
(28, 3220)
(29, 3115)
(29, 3450)
(30, 3525)
(30, 3245)
(30, 3115)
(31, 2795)
(32, 3235)
(34, 2500)
(37, 2440)
(37, 2290)
The correlation coefficient R² isn't as perfect 1.000, but according to the table we use to check how strong the correlation is, For 18 points the 95% confidence level threshold is 0.2190 and the 99% is 0.3841. So a value of R² = 0.7643 for these two variables shows they have very strong correlation and that isn't surprising. Strong negative correlation here (we can tell it's negative because the line slopes downward) means lighter cars generally get better gas mileage than heavier cars. That makes sense.
The statistical concept that is new here is restriction of range. This says that there is a tendency if you only look at data where the x values are limited, the R² value will go down. Let's say we only look at the first eight cars on the list, the ones that get under 30 mpg highway. We see that the trend is still downward and the slope is steeper. The thing our rule says is that
R² will be less usually, and it is less here. 0.60711 instead of 0.7643. For 8 points, this level of correlation surpasses the 95% threshold of 0.4998, but does not surpass the 99% threshold of 0.6956.
We see the same tendency come true when we look only at the cars getting 30 mpg highway or better. Here there are ten data points and the R² value is 0.61952. The thresholds for 10 pairs are 0.3994 and 0.5852 for 95% and 99% confidence, respectively. By our measure, this set has better correlation than the heavier cars do, but not as good as the set of all 18 cars.
The Monty Hall Problem (or for younger people, The Game Show Problem)
Way back in the day, there was a game show called Let's Make a Deal and the host was Monty Hall. (As a survey in class showed, only one student was aware of this, watching it in re-runs on the cable channel The Game Show Network.) There were many different games played in a half hour using many rules, but one of the famous ones is called The Monty Hall Problem. We can call it The Game Show Problem or more descriptively One Brand New Car and Two Goats.
The rules of the game are as follows. There are three closed doors and the contestant must choose one. Behind two of the doors there are bad prizes and behind the last door there is a great prize. Usually but not always, the bad prizes were goats. Usually but not always, the great prize was a brand new car.
(Note from a different perspective: If you have a place to raise them, goats are excellent producers of meat and milk. The most famous cheese of Greece, known as Feta Cheese, is almost always made from goat's milk. A new car, on the other hand, usually means higher insurance rate and even though it's a prize, the winner has to pay the sales tax.)
Back to the game. After the contestant chooses a door, not knowing if the prize is good or bad, the game show host (Monty Hall) shows that there's a goat behind some unchosen door and asks the contestant if he (or she) wants to switch to another door. (In the three door game, switching means over to the only other door available.)
The math question here is this. Does it make sense to stick with your original door or switch?
Explaining the simplest situation, the three door, one car, two goat version: Okay, on the contestant's first choice, there is a 1/3 chance of getting the car and 2/3 probability of getting a goat. If the contestant doesn't switch, the chance of winning is 1/3.
If the contestant switches, we have to look at two possible situations.
1. The contestant picked the car in the first place. There is a 1/3 chance of this and if the contestant switches, there are no other cars so the contest will lose.
2. The contest picked a goat in the first place. There was 2/3 chance this would happen. If your door has a goat and Monty shows you the second goat, the door you would switch to must have the car, so if the contestant switches, the chance of winning is 2/3.
Generalizing the problem. Let's call the number of doors D, the number of bad prizes B and the number of good prizes G, where G + B = D.
Change the rules so that there can be three doors or more with at least two bad prizes. (If you have a bad prize, the host still needs to show a bad prize.)
If you don't switch, the odds are G/D.
After doing some algebra, we see that the are G/D * (D-1)/(D-2). The second fraction is of the form big/little, so it must be more than 1. What this means is that no matter how many goats (at least two or the game doesn't work) and how many cars (at least one or you can only pick goats), it always makes sense to switch.
Tuesday, April 8, 2014
Notes for April 8
The relations between variables in the line of regression
When we input the points into our calculators for two variable statistics, there are a lot of numbers produced. Here is the list from the TI-30XIIs
n
x-bar
s_x
sigma_x
y-bar
s_y
sigma_y
sum(x)
sum(x²)
sum(y)
sum(y²)
sum(xy)
a
b
r
On the take-home section of the second midterm, you see the messy formula for r that you need to use if you don't have a calculator to do it for you. There is a relationship between a, the slope and r that goes as follows.
a = r × s_y/s_x
Remember that a is the slope of the trendline, which means the rise over run. The two standard deviations become our scaling factor and r decides if the line slopes positively (uphill from left to right) or negatively (downhill from left to right).
The formula yp = ax + b is in slope intercept form, which means when x = 0, yp = b. The only x values we can plug into the formula are ones between the min and max values of x. We have a workaround to this, which that the centroid (x-bar, y-bar) is always on the line. this means we can change the formula to point slope form.
yp - y-bar = a(x - x-bar)
Getting rid of the parentheses it becomes
yp - y-bar = ax - a×x-bar
Adding y-bar to both sides we get
yp = ax - a×x-bar + - y-bar
What this means is b = y-bar - a×x-bar
Note on the midterm and on the board, I gave the residuals as yp - y. In point of fact, it should be the other way around y - yp. It's okay to use the formula given in class on the test.
When we input the points into our calculators for two variable statistics, there are a lot of numbers produced. Here is the list from the TI-30XIIs
n
x-bar
s_x
sigma_x
y-bar
s_y
sigma_y
sum(x)
sum(x²)
sum(y)
sum(y²)
sum(xy)
a
b
r
On the take-home section of the second midterm, you see the messy formula for r that you need to use if you don't have a calculator to do it for you. There is a relationship between a, the slope and r that goes as follows.
a = r × s_y/s_x
Remember that a is the slope of the trendline, which means the rise over run. The two standard deviations become our scaling factor and r decides if the line slopes positively (uphill from left to right) or negatively (downhill from left to right).
The formula yp = ax + b is in slope intercept form, which means when x = 0, yp = b. The only x values we can plug into the formula are ones between the min and max values of x. We have a workaround to this, which that the centroid (x-bar, y-bar) is always on the line. this means we can change the formula to point slope form.
yp - y-bar = a(x - x-bar)
Getting rid of the parentheses it becomes
yp - y-bar = ax - a×x-bar
Adding y-bar to both sides we get
yp = ax - a×x-bar + - y-bar
What this means is b = y-bar - a×x-bar
Note on the midterm and on the board, I gave the residuals as yp - y. In point of fact, it should be the other way around y - yp. It's okay to use the formula given in class on the test.
Notes for April 1st and 3rd
Two averages from two populations
In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.
Question: Do female Laney students sleep a number of hours each night different from male Laney students?
This uses data sets from a previous class. Here are the numbers for the students who submitted data, with the males listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.
With a test like this, we can arbitrary choose which set is the first set and which is the second. Let's do it so x-bar1 > x-bar2. This way, our t-score will be positive, which is what the table expects.
H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)
x-bar1 = 7.54
s1 = 1.47
n1 = 12
x-bar2 = 7.31
s2 = .94
n2 = 26
The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.
(7.54-7.31)/sqrt(1.47^2/26+.94^2/12)[enter]
0.4971...
2 tails__0.01_______0.02_____0.05_______0.10______0.20
11_______3.106_____2.718_____2.201_____1.796_____1.383
This number is less than every threshold, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.
Matched pairs
Was the price of silver in 2007 significantly different than it was in 2008?
Side by side, we have two lists of prices of silver, the highest price in a given month in 2007, followed by the highest price in that same month in 2008. Take the differences in the prices and find the average and standard deviation. The size of the list is 12, so the degrees of freedom are 11. If we assume we did not know which year showed higher prices when we started this experiment, it make sense to make this a two-tailed test. Just for a change of pace, let us use the 90% confidence level.
Mo.___2007___2008
Jan.__13.45__16.23
Feb.__14.49__19.81
Mar.__13.34__20.67
Apr.__14.01__17.74
May___12.90__18.19
Jun.__13.19__17.50
Jul.__12.86__18.84
Aug.__12.02__15.27
Sep.__12.77__12.62
Oct.__14.17__11.16
Nov.__14.69__10.26
Dec.__14.76__10.66
Find the test statistic t, the threshold from Table A-3 and determine if we should reject H0, which in matched pairs tests is always that mu1 = mu2.
Answers in the comments.
Correlation
When we have a data set, sometimes we collect more than one variable of information about the units. For example, in class surveys taken in previous classes, among the numerical variables were the height in inches, the GPA, the opinion about the difficulty of the class, age and average hours of sleep per night.
A question about two variables is if they are related to one another in some simple way. One simple way is correlation, which can be positive or negative. Here is a general definition of each.
Positive correlation between two numerical variables, call them x and y, means that the high values of x tend to be paired with the high values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the low values of y.
The variables x and y show negative correlation if that the high values of x tend to be paired with the low values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the high values of y.
If we pick two variables at random, we do not expect to see correlation. We can write this as a null hypothesis, where the test statistic is r², the correlation coefficient. The sign of low correlation is r² = 0. The values of rx,y are always between -1, which means perfect negative correlation, and +1, which means perfect positive correlation. This means 0 <= r² <= 1.
The second orange sheet gives us threshold numbers for the 99% confidence level and 95% confidence level for correlation given the number of points n. For instance, when n = 5, the thresholds are .7709 for 95% confidence and .9197 for 99% confidence. This splits up the numbers from 0 to 1 into three regions.
0 < r² < .7709 We fail to reject the null hypothesis, which means not strong correlation.
.7709 < r² < .9197We reject the null hypothesis with 95% confidence, but not 99% confidence. This is fairly strong correlation.
.9197 < r² We reject the null hypothesis with 99% confidence this is very strong correlation.
Just like with any hypothesis test, we should decide the confidence level before testing. This is a two-tailed test, because whether correlation is positive or negative, the relationships between number sets can often give us vital scientific information.
There is an important warning: Correlation is not causation. Just because two number sets have a relation, it doesn't mean that x causes y or y causes x. Sometimes there is a hidden third factor that is the cause of both of the things we are looking at. Sometimes, it's random chance and there is no causative agent at all.
Here is a set of five points, listed as (x,y) in each case.
(1,1)
(2,2)
(3,4)
(4,4)
(6,5)
As we can see, the points are ordered from low to high in both coordinates, so we expect some correlation. If we input the points into our calculator, we get a value for r (which is the same as rx,y) of .933338696..., and r² = .8712, which is strong positive correlation, but not very strong positive correlation. Assuming the 95% confidence level is good enough for us, we can use the a and b variables from out calculator to give us the equation of the line
yp = .797x + .649
This is called the predictor line (that's where the p comes from) or the line of regression or the line of least squares or the trendline. Any such line for a given data set has two important criteria it meets. It passes through the centroid (x-bar, y-bar), the center point of all the data, and it minimizes the sum of the absolute values of the residuals, which is |y - yp| for all points.
Let's find the absolute values of the residuals for each of the five points, using the rounded values of a and b.
Point (1,1): |1 - .797*1 - .649| = 0.446
Point (2,2): |2 - .797*2 - .649| = 0.243
Point (3,4): |4 - .797*3 - .649| = 0.96
Point (4,4): |4 - .797*4 - .649| = 0.163
Point (6,5): |5 - .797*6 - .649| = 0.431
As we can see, the point (3,4) is farthest from the line, while the point (4,4) is the closest. The centroid (3.2, 3.2) is exactly on the line if you use the un-rounded values of a and b, and even using the rounded values, the centroid only misses the line by .0006.
In class, we used five points, but the last point was (5,6) instead of (6,5). This changes the numbers. rx,y goes up to .973328527..., which is above the 99% confidence threshold. The formula for the new predictor line is
yp = 1.2x - .2
Where we see the difference in these two different examples is in the residuals.
Point (1,1): |1 - 1.2*1 + .2| = 0
Point (2,2): |2 - 1.2*2 + .2| = 0.2
Point (3,4): |4 - 1.2*3 + .2| = 0.6
Point (4,4): |4 - 1.2*4 + .2| = 0.6
Point (5,6): |6 - 1.2*5 + .2| = 0.2
The closest point is now exactly on the line, which is a rarity, but even the farthest away point is only .6 units away, closer than the farthest away on the line with the lower correlation coefficient.
As we get more points in our data set, we lower our threshold that shows correlation strength. This way, a few points that are outliers do not completely ruin the chances of the data showing correlation, though sometimes strong outliers can mess up the data set and the correlation coefficient gets so close to zero that we cannot reject the null hypothesis that the two variables are not simply related.
In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.
Question: Do female Laney students sleep a number of hours each night different from male Laney students?
This uses data sets from a previous class. Here are the numbers for the students who submitted data, with the males listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.
With a test like this, we can arbitrary choose which set is the first set and which is the second. Let's do it so x-bar1 > x-bar2. This way, our t-score will be positive, which is what the table expects.
H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)
x-bar1 = 7.54
s1 = 1.47
n1 = 12
x-bar2 = 7.31
s2 = .94
n2 = 26
The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.
(7.54-7.31)/sqrt(1.47^2/26+.94^2/12)[enter]
0.4971...
2 tails__0.01_______0.02_____0.05_______0.10______0.20
11_______3.106_____2.718_____2.201_____1.796_____1.383
This number is less than every threshold, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.
Matched pairs
Was the price of silver in 2007 significantly different than it was in 2008?
Side by side, we have two lists of prices of silver, the highest price in a given month in 2007, followed by the highest price in that same month in 2008. Take the differences in the prices and find the average and standard deviation. The size of the list is 12, so the degrees of freedom are 11. If we assume we did not know which year showed higher prices when we started this experiment, it make sense to make this a two-tailed test. Just for a change of pace, let us use the 90% confidence level.
Mo.___2007___2008
Jan.__13.45__16.23
Feb.__14.49__19.81
Mar.__13.34__20.67
Apr.__14.01__17.74
May___12.90__18.19
Jun.__13.19__17.50
Jul.__12.86__18.84
Aug.__12.02__15.27
Sep.__12.77__12.62
Oct.__14.17__11.16
Nov.__14.69__10.26
Dec.__14.76__10.66
Find the test statistic t, the threshold from Table A-3 and determine if we should reject H0, which in matched pairs tests is always that mu1 = mu2.
Answers in the comments.
Correlation
When we have a data set, sometimes we collect more than one variable of information about the units. For example, in class surveys taken in previous classes, among the numerical variables were the height in inches, the GPA, the opinion about the difficulty of the class, age and average hours of sleep per night.
A question about two variables is if they are related to one another in some simple way. One simple way is correlation, which can be positive or negative. Here is a general definition of each.
Positive correlation between two numerical variables, call them x and y, means that the high values of x tend to be paired with the high values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the low values of y.
The variables x and y show negative correlation if that the high values of x tend to be paired with the low values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the high values of y.
If we pick two variables at random, we do not expect to see correlation. We can write this as a null hypothesis, where the test statistic is r², the correlation coefficient. The sign of low correlation is r² = 0. The values of rx,y are always between -1, which means perfect negative correlation, and +1, which means perfect positive correlation. This means 0 <= r² <= 1.
The second orange sheet gives us threshold numbers for the 99% confidence level and 95% confidence level for correlation given the number of points n. For instance, when n = 5, the thresholds are .7709 for 95% confidence and .9197 for 99% confidence. This splits up the numbers from 0 to 1 into three regions.
0 < r² < .7709 We fail to reject the null hypothesis, which means not strong correlation.
.7709 < r² < .9197We reject the null hypothesis with 95% confidence, but not 99% confidence. This is fairly strong correlation.
.9197 < r² We reject the null hypothesis with 99% confidence this is very strong correlation.
Just like with any hypothesis test, we should decide the confidence level before testing. This is a two-tailed test, because whether correlation is positive or negative, the relationships between number sets can often give us vital scientific information.
There is an important warning: Correlation is not causation. Just because two number sets have a relation, it doesn't mean that x causes y or y causes x. Sometimes there is a hidden third factor that is the cause of both of the things we are looking at. Sometimes, it's random chance and there is no causative agent at all.
Here is a set of five points, listed as (x,y) in each case.
(1,1)
(2,2)
(3,4)
(4,4)
(6,5)
As we can see, the points are ordered from low to high in both coordinates, so we expect some correlation. If we input the points into our calculator, we get a value for r (which is the same as rx,y) of .933338696..., and r² = .8712, which is strong positive correlation, but not very strong positive correlation. Assuming the 95% confidence level is good enough for us, we can use the a and b variables from out calculator to give us the equation of the line
yp = .797x + .649
This is called the predictor line (that's where the p comes from) or the line of regression or the line of least squares or the trendline. Any such line for a given data set has two important criteria it meets. It passes through the centroid (x-bar, y-bar), the center point of all the data, and it minimizes the sum of the absolute values of the residuals, which is |y - yp| for all points.
Let's find the absolute values of the residuals for each of the five points, using the rounded values of a and b.
Point (1,1): |1 - .797*1 - .649| = 0.446
Point (2,2): |2 - .797*2 - .649| = 0.243
Point (3,4): |4 - .797*3 - .649| = 0.96
Point (4,4): |4 - .797*4 - .649| = 0.163
Point (6,5): |5 - .797*6 - .649| = 0.431
As we can see, the point (3,4) is farthest from the line, while the point (4,4) is the closest. The centroid (3.2, 3.2) is exactly on the line if you use the un-rounded values of a and b, and even using the rounded values, the centroid only misses the line by .0006.
In class, we used five points, but the last point was (5,6) instead of (6,5). This changes the numbers. rx,y goes up to .973328527..., which is above the 99% confidence threshold. The formula for the new predictor line is
yp = 1.2x - .2
Where we see the difference in these two different examples is in the residuals.
Point (1,1): |1 - 1.2*1 + .2| = 0
Point (2,2): |2 - 1.2*2 + .2| = 0.2
Point (3,4): |4 - 1.2*3 + .2| = 0.6
Point (4,4): |4 - 1.2*4 + .2| = 0.6
Point (5,6): |6 - 1.2*5 + .2| = 0.2
The closest point is now exactly on the line, which is a rarity, but even the farthest away point is only .6 units away, closer than the farthest away on the line with the lower correlation coefficient.
As we get more points in our data set, we lower our threshold that shows correlation strength. This way, a few points that are outliers do not completely ruin the chances of the data showing correlation, though sometimes strong outliers can mess up the data set and the correlation coefficient gets so close to zero that we cannot reject the null hypothesis that the two variables are not simply related.
Monday, March 31, 2014
Notes for March 25 and 27
The first hypothesis tests we studied were checking to see if an
experimental sample produced a value that was significantly different
from some known value produced either by math or by earlier experiments.
For example, in the lady tasting tea, since she has two choices each time a mixture is given to her, the math would say that her chance of getting it right by just guessing is 50% or H0: p = .5. In testing psychic abilities, there are five different symbols on the cards, so random guessing should get the right answer 1 out of 5 times, or 20%, so H0: p = .2.
In a test for average human body temperature, the assumption of 98.6 degrees Fahrenheit being the average came from an experiment performed in the 19th Century.
We can also do tests by taking samples from two different populations. The null hypothesis, as always, is an equality, the assumption that the parameters from the two different populations are the same. As always, we need convincing evidence that the difference is significant to reject the null hypothesis, and we can choose just how convincing that evidence must be by setting the confidence level, which is usually either 90% or 95% or 99%.
Two proportions from two populations
Like with the one proportion test, the test statistic is a z-score. We have the proportions from the two samples, p-hat1 = f1/n1 and p-hat2 = f2/n2, but we also need to create the pooled proportion p-bar = (f1 + f2)/(n1 + n2).
Here's an example from the polling data from last year.
Question: Was John McCain's popularity in 2008 Iowa significantly different from his popularity in Pennsylvania?
Let's assume we don't know either way, so it will be a two tailed test. Polling data traditionally uses the 95% confidence level, so that means the z-score will have to be either greater than or equal to 1.96 or less than or equal to -1.96 for us to reject the null hypothesis. Here are our numbers, with Iowa as the first data set.
f1 = 263
n1 = 658
p-hat1 = .400
f2 = 283
n2 = 657
p-hat2 = .430
p-bar = (263+283)/(658+657) = .415 (q-bar = .585)
Type this into your calculator.
(.400-.430)/sqrt(.415x.585/658+.415x.585/657[enter]
The answer is -1.103..., which rounds to -1.10. This would say the difference we see in the two samples is not enough to convince us of a significant difference in popularity for McCain between the two states, so we would fail to reject the null hypothesis. In the actual election, McCain had 45.2% of the vote in Pennsylvania and 44.8% of the vote in Iowa, which are fairly close to equal.
Student's t-scores
Confidence intervals are used over and over again in statistics, most especially in trying to find out what value the parameter of the population has, when all we can effectively gather is a statistic from a sample. For numerical data, we aren't allowed to use the normal distribution table for this process, because the standard deviation sx of a sample isn't a very precise estimator of sigmax of the underlying population. To deal with this extra level of uncertainty, a statistician named William Gossett came up with the t-score distribution, also known as Student's t-score because Gossett published all his work under the pseudonym Student. He used this fake name for publishing to get around a ban on publishing in journals established by his superiors at the Guinness Brewing Company where he worked.
The critical t-score values are published on table A-3. The values depend on the Degrees of Freedom, which in the case of a single sample set of data is equal to n-1. For every degree of freedom, we could have another positive and negative t-score table two pages long, just like the z-score table, but that would take up way too much room, so statistics textbooks have reverted instead to publishing just the highlights. There are five columns on the table, each column labeled with a number from "Area in One Tail" and "Area in Two Tails". Let's look at the degrees of freedom row for the number 13.
1 tail___0.005______0.01_____0.025______0.05______0.10
2 tails__0.01_______0.02_____0.05_______0.10______0.20
13_______3.012_____2.650_____2.160_____1.771_____1.333
What this means is that if we have a sample of size 14, then the degrees of freedom are 13 and we can use these numbers to find the cut-off points for certain percentages. The formula for t-scores looks like the formula for z-scores, where z = (x - mux)/sigmax and t = z = (x - x-bar)/sx. Because we don't know sigmax, we use the t-score table. For example, the second column in row 13 is the number 2.650. This means that in a sample of 14, a Student's t-score of -2.650 is the cutoff for the bottom 1%, the t-score of +2.650 is the cutoff for the top 1% and the middle 98% is between t-scores of -2.650 and +2.650.
Using the t-score table
Let's say we have a t-score of 2.53 and n = 25, which means the degrees of freedom are 25-1 = 24. Here is the line of the t-score table that corresponds to d.f. = 24.
1 tail___0.005______0.01_____0.025______0.05______0.10
2 tails__0.01_______0.02_____0.05_______0.10______0.20
24_______2.797_____2.492_____2.064_____1.711_____1.318
What does this mean for our t-score of 2.53. If it was a z-score, the look-up table would give us an answer to four digits, .9943, which is a score that would be beyond the 99% confidence threshold for one tail (.9943 > .9500) but not beyond the confidence interval for 99% confidence and one tail because those thresholds are .9950 high and .0050 low. On the t-score table, all we can say is 2.53 is between 2.797 and 2.492, the closest scores on our line. In an two tailed test, it is beyond the 0.02 threshold (which would be 98% confidence, a number we don't use much) but not beyond the 99% threshold. In a one tailed (high) test our t-score is between 0.005 and 0.01, which means it passes the 99% threshold. Unlike the z-score table, t-scores only work with positive values, so if we get a negative t-score test, we follow these rules.
1. You have a negative t-score and the test is two tailed. Take the absolute value of the t-score and work with it.
2. You have a negative t-score and the test is one tailed low. Again, the absolute value will work.
3. You have a positive t-score and the test is one tailed low. This would be a problem, since only a negative t-score is useful in a one-tailed low test. You should fail to reject H0.
In the example below, we have yet another choice which always lets a one-tailed test be a one-tailed high test.
Two averages from two populations
In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.
Question: Do female Laney students sleep a number of hours each night different from male Laney students?
This uses data sets from a previous class. Here are the numbers for the students who submitted data, with the males listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.
With a test like this, we can arbitrary choose which set is the first set and which is the second. Let's do it so x-bar1 > x-bar2. This way, our t-score will be positive, which is what the table expects.
H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)
x-bar1 = 7.54
s1 = 1.47
n1 = 12
x-bar2 = 7.31
s2 = .94
n2 = 26
The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.
(7.54-7.31)/sqrt(1.47^2/26+.94^2/12)[enter]
0.4971...
2 tails__0.01_______0.02_____0.05_______0.10______0.20
11_______3.106_____2.718_____2.201_____1.796_____1.383
This number is less than every threshold, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.
For example, in the lady tasting tea, since she has two choices each time a mixture is given to her, the math would say that her chance of getting it right by just guessing is 50% or H0: p = .5. In testing psychic abilities, there are five different symbols on the cards, so random guessing should get the right answer 1 out of 5 times, or 20%, so H0: p = .2.
In a test for average human body temperature, the assumption of 98.6 degrees Fahrenheit being the average came from an experiment performed in the 19th Century.
We can also do tests by taking samples from two different populations. The null hypothesis, as always, is an equality, the assumption that the parameters from the two different populations are the same. As always, we need convincing evidence that the difference is significant to reject the null hypothesis, and we can choose just how convincing that evidence must be by setting the confidence level, which is usually either 90% or 95% or 99%.
Two proportions from two populations
Like with the one proportion test, the test statistic is a z-score. We have the proportions from the two samples, p-hat1 = f1/n1 and p-hat2 = f2/n2, but we also need to create the pooled proportion p-bar = (f1 + f2)/(n1 + n2).
Here's an example from the polling data from last year.
Question: Was John McCain's popularity in 2008 Iowa significantly different from his popularity in Pennsylvania?
Let's assume we don't know either way, so it will be a two tailed test. Polling data traditionally uses the 95% confidence level, so that means the z-score will have to be either greater than or equal to 1.96 or less than or equal to -1.96 for us to reject the null hypothesis. Here are our numbers, with Iowa as the first data set.
f1 = 263
n1 = 658
p-hat1 = .400
f2 = 283
n2 = 657
p-hat2 = .430
p-bar = (263+283)/(658+657) = .415 (q-bar = .585)
Type this into your calculator.
(.400-.430)/sqrt(.415x.585/658+.415x.585/657[enter]
The answer is -1.103..., which rounds to -1.10. This would say the difference we see in the two samples is not enough to convince us of a significant difference in popularity for McCain between the two states, so we would fail to reject the null hypothesis. In the actual election, McCain had 45.2% of the vote in Pennsylvania and 44.8% of the vote in Iowa, which are fairly close to equal.
Student's t-scores
Confidence intervals are used over and over again in statistics, most especially in trying to find out what value the parameter of the population has, when all we can effectively gather is a statistic from a sample. For numerical data, we aren't allowed to use the normal distribution table for this process, because the standard deviation sx of a sample isn't a very precise estimator of sigmax of the underlying population. To deal with this extra level of uncertainty, a statistician named William Gossett came up with the t-score distribution, also known as Student's t-score because Gossett published all his work under the pseudonym Student. He used this fake name for publishing to get around a ban on publishing in journals established by his superiors at the Guinness Brewing Company where he worked.
The critical t-score values are published on table A-3. The values depend on the Degrees of Freedom, which in the case of a single sample set of data is equal to n-1. For every degree of freedom, we could have another positive and negative t-score table two pages long, just like the z-score table, but that would take up way too much room, so statistics textbooks have reverted instead to publishing just the highlights. There are five columns on the table, each column labeled with a number from "Area in One Tail" and "Area in Two Tails". Let's look at the degrees of freedom row for the number 13.
1 tail___0.005______0.01_____0.025______0.05______0.10
2 tails__0.01_______0.02_____0.05_______0.10______0.20
13_______3.012_____2.650_____2.160_____1.771_____1.333
What this means is that if we have a sample of size 14, then the degrees of freedom are 13 and we can use these numbers to find the cut-off points for certain percentages. The formula for t-scores looks like the formula for z-scores, where z = (x - mux)/sigmax and t = z = (x - x-bar)/sx. Because we don't know sigmax, we use the t-score table. For example, the second column in row 13 is the number 2.650. This means that in a sample of 14, a Student's t-score of -2.650 is the cutoff for the bottom 1%, the t-score of +2.650 is the cutoff for the top 1% and the middle 98% is between t-scores of -2.650 and +2.650.
Using the t-score table
Let's say we have a t-score of 2.53 and n = 25, which means the degrees of freedom are 25-1 = 24. Here is the line of the t-score table that corresponds to d.f. = 24.
1 tail___0.005______0.01_____0.025______0.05______0.10
2 tails__0.01_______0.02_____0.05_______0.10______0.20
24_______2.797_____2.492_____2.064_____1.711_____1.318
What does this mean for our t-score of 2.53. If it was a z-score, the look-up table would give us an answer to four digits, .9943, which is a score that would be beyond the 99% confidence threshold for one tail (.9943 > .9500) but not beyond the confidence interval for 99% confidence and one tail because those thresholds are .9950 high and .0050 low. On the t-score table, all we can say is 2.53 is between 2.797 and 2.492, the closest scores on our line. In an two tailed test, it is beyond the 0.02 threshold (which would be 98% confidence, a number we don't use much) but not beyond the 99% threshold. In a one tailed (high) test our t-score is between 0.005 and 0.01, which means it passes the 99% threshold. Unlike the z-score table, t-scores only work with positive values, so if we get a negative t-score test, we follow these rules.
1. You have a negative t-score and the test is two tailed. Take the absolute value of the t-score and work with it.
2. You have a negative t-score and the test is one tailed low. Again, the absolute value will work.
3. You have a positive t-score and the test is one tailed low. This would be a problem, since only a negative t-score is useful in a one-tailed low test. You should fail to reject H0.
In the example below, we have yet another choice which always lets a one-tailed test be a one-tailed high test.
Two averages from two populations
In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.
Question: Do female Laney students sleep a number of hours each night different from male Laney students?
This uses data sets from a previous class. Here are the numbers for the students who submitted data, with the males listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.
With a test like this, we can arbitrary choose which set is the first set and which is the second. Let's do it so x-bar1 > x-bar2. This way, our t-score will be positive, which is what the table expects.
H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)
x-bar1 = 7.54
s1 = 1.47
n1 = 12
x-bar2 = 7.31
s2 = .94
n2 = 26
The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.
(7.54-7.31)/sqrt(1.47^2/26+.94^2/12)[enter]
0.4971...
2 tails__0.01_______0.02_____0.05_______0.10______0.20
11_______3.106_____2.718_____2.201_____1.796_____1.383
This number is less than every threshold, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.
Wednesday, March 26, 2014
Notes for March 18 and 20
The topic for the next few weeks is hypothesis testing. The main idea
is that experiments must be conducted to test the validity of an idea,
which is called a hypothesis. There are always two hypotheses
available, the null hypothesis H0 (pronounced "H zero" or "H nought") and the alternate hypothesis HA (pronounced "H A").
The standard is to assume the null hypothesis is true, which says that
nothing special is happening, which in most cases means that two things
we can measure should be equal or close to it. The alternate
hypothesis says the two measurements are different. We can have one
tailed high tests, where we want "large" positive test statistics. In
one tailed low tests, only negative test statistics with "large"
absolute value will do. In two tailed test, "large" absolute value for
positive or negative numbers will work. We will only accept the
alternate hypothesis if the experiment produces impressive results given
our particular criteria for that test.
The basics of hypothesis testing are similar to the ideals of the English legal system, which is also the system used in United States courts, that a defendant is presumed innocent until proven guilty. There are different levels of proof of guilt in different trials, whether it is beyond a reasonable doubt or the less rigorous standard of preponderance of evidence.
In a case involving an alleged crime, there is the reality of what the defendant did and the result of the trial. If the defendant did the illegal act, then being found guilty is the correct result under the law. If the reality is that the defendant didn't do the act, the correct result would be a not guilty verdict.
The reasonable doubt standard is put in place in theory to make sending an innocent person to jail unlikely, and this is called a Type I error. The best known Type I result in legal history is Jesus Christ.
It is also possible that a person who did a crime will be found not guilty. This is called a Type II error. When I ask my students for an example of Type II error, O.J. Simpson's name still rings out the loudest.
In hypothesis testing, there is the reality and the result of the experiment. If H0 is true, the two things measured are equal or pretty close to equal. If HA is true, they are significantly different.
If the experiment produces a test statistic that is beyond the threshold we set for it, and "beyond" could mean lower if it is a one-tailed low test, or higher if it is a one-tailed high test, or either lower or higher in a two-tailed test, then we reject H0. If the test statistic fails to get "beyond" the threshold, we fail to reject H0.
Rejecting a true null hypothesis is a Type I error. Rejecting a false alternate hypothesis is a Type II error.
In class, we discussed Sir Ronald Fisher and his hypothesis testing of the lady who said she could tell the difference in taste between tea poured into milk or milk poured into tea.
Here are the things that have to be done to make such an experiment work.
#1 Define the null hypothesis. In modern experiments, the null hypothesis is always defined as an equation. In a proportion test, the equation will be concerning p, the true probability of success. In the lady tasting tea, we would assume if nothing special is happening, then she is just guessing whether the tea or milk was poured first, and the probability of being correct on any given trial is 50% or .5. We write this as follows.
H0: p = .5
#2 Pick a threshold. The trials we are going to perform are taste tests where the lady cannot see the tea-milk mixture being poured. We have to decide on how high a test statistic we will consider impressive. The three standard choices are 90% confidence, 95% confidence or 99% confidence. For experiments in the medical field, where the decision is whether or not to bring a new drug to market, the 99% confidence level is common. For an experiment like this, where the result is not truly earth shattering, we might decide to us the 95% confidence threshold.
The experiment will produce a z-score the thresholds for high z-scores are as follows:
90% threshold: z = 1.28
95% threshold: z = 1.645
99% threshold: z = 2.325
#3 Decide on the number of trials in an experiment. There is a tug-of-war in deciding the number of trials. More trials produces numbers we can be more confident in, but more trials is also more expensive and more time consuming. In the case of the lady tasting tea, we don't want to keep her drinking tea mixtures for hours.
Different books set different standards for the minimum number of trials based on np and nq. Some say both np > 5 and nq > 5. Others say both numbers should be greater than 10, yet others say 15. The standard that np >= 10 and nq >= 10 can be connected to the standard that says n > np + 3*sqrt(pqn) > np - 3*sqrt(pqn) > 0 by a little algebraic manipulation.
np - 3*sqrt(pqn) > 0 [add the square root to both sides]
np > 3*sqrt(pqn) [square both sides]
n^2*p^2 > 9pqn [divide both sides by np]
np > 9q
Since q must be less than 1, but can be as close to 1 as we want, set it equal to 1 and the inequality becomes
np > 9, which we can change to np >= 10.
For example, If let's look at the different possible positive z-score results if the lady were given ten trials, which would be enough if we used the lowest standard of np >= 5 and nq >= 5.
10 correct out of 10: z = 3.16227... ~= 3.16, which is above the 99% threshold.
(look-up table: .9992)
9 correct out of 10: z = 2.52982... ~= 2.53, which is above the 99% threshold.
(look-up table: .9943)
8 correct out of 10: z = 1.89737... ~= 1.90, which is above the 95% threshold, but not the 99%.
(look-up table: .9713)
7 correct out of 10: z = 1.26491... ~= 1.26, which is below the 90% threshold.
(look-up table: .8962)
If we set the bar at the 90% threshold, she could impress us by getting 8 right out of 10 or better. Likewise, at the 95% threshold, 8 of 10 will be beyond the threshold and the result would make us reject H0. At the 99% threshold, she would have to get 9 of 10 or 10 of 10 to make a z-score that breaks the threshold.
#4 Interpreting the test statistic. Let's say for the sake of argument that the lady got 8 of 10 correct. (There is a book about 20th Century statistics entitled The Lady Tasting Tea, where a witness to the experiment says he can't recall how many times she was tested, but the lady got a perfect score.) If we set the threshold at 90% confidence or 95% confidence, we would be impressed by the z score of 1.90 and we would reject H0, which says that we don't think she is "just guessing", but actually has the talent she says she has. If we set the value at 99% confidence, we would fail to reject H0.
Here's the thing. We could be wrong. If we reject H0 incorrectly, it means she was just guessing and she was very lucky during this test, a Type I error. A z-score of 1.90 corresponds to a probability of 97.13%, which is called the p-value in hypothesis testing. She can pass the test by being better than 97.13% of lucky guessers. If she is a lucky guesser, she would fool anyone who had set the threshold at 90% or 95%.
If we fail to reject H0, which we would do if we set the threshold at 99%, this could also be an error, but this time it would be a Type II error. Under this scenario, she got 8 of 10 but she should do better usually. It's more difficult math to figure out how good she actually is and how unlucky she had to be to get only 8 of 10. The probability of a Type I error is called alpha, and it is determined by the threshold. The probability of a Type II error is called beta, and it usually explained in greater detail in the class after the introduction to statistics.
A low tailed test
Let's say we want a low error rate. Unlike the lady tasting tea who needed a lot of right answers to impress us, now we need a low score to get a result that will make us reject the null hypothesis.
Now our z-score thresholds are
90% threshold: z = -1.28
95% threshold: z = -1.645
99% threshold: z = -2.325
Let's say we want to be convinced our error rate is less than 10% and we want to be convinced to the 95% confidence level.
H0: p = 0.10
HA: p < 0.10
n = 50
10% of 50 is 5, so we should check to see what happens at f = 4, 3, 2, 1 and 0 errors. Typing this into the calculator will look like
(f/50 - .1)/sqrt(.1*.9/50)
f = 4 gives a rounded z-score of -.47
(look-up table: .3192 fail to reject H0)
f = 3 gives a rounded z-score of -.94
(look-up table: .1736 fail to reject H0)
f = 2 gives a rounded z-score of -1.41
(look-up table: .0793 fail to reject H0 at 95% confidence, but reject at 90%)
f = 1 gives a rounded z-score of -1.89
(look-up table: .0294 reject H0 at 95% confidence, but fail reject at 99%)
f = 0 gives a rounded z-score of -2.36
(look-up table: .0091 reject H0 at any confidence level we use 90%, 95% or 99%)
The basics of hypothesis testing are similar to the ideals of the English legal system, which is also the system used in United States courts, that a defendant is presumed innocent until proven guilty. There are different levels of proof of guilt in different trials, whether it is beyond a reasonable doubt or the less rigorous standard of preponderance of evidence.
In a case involving an alleged crime, there is the reality of what the defendant did and the result of the trial. If the defendant did the illegal act, then being found guilty is the correct result under the law. If the reality is that the defendant didn't do the act, the correct result would be a not guilty verdict.
The reasonable doubt standard is put in place in theory to make sending an innocent person to jail unlikely, and this is called a Type I error. The best known Type I result in legal history is Jesus Christ.
It is also possible that a person who did a crime will be found not guilty. This is called a Type II error. When I ask my students for an example of Type II error, O.J. Simpson's name still rings out the loudest.
In hypothesis testing, there is the reality and the result of the experiment. If H0 is true, the two things measured are equal or pretty close to equal. If HA is true, they are significantly different.
If the experiment produces a test statistic that is beyond the threshold we set for it, and "beyond" could mean lower if it is a one-tailed low test, or higher if it is a one-tailed high test, or either lower or higher in a two-tailed test, then we reject H0. If the test statistic fails to get "beyond" the threshold, we fail to reject H0.
Rejecting a true null hypothesis is a Type I error. Rejecting a false alternate hypothesis is a Type II error.
In class, we discussed Sir Ronald Fisher and his hypothesis testing of the lady who said she could tell the difference in taste between tea poured into milk or milk poured into tea.
Here are the things that have to be done to make such an experiment work.
#1 Define the null hypothesis. In modern experiments, the null hypothesis is always defined as an equation. In a proportion test, the equation will be concerning p, the true probability of success. In the lady tasting tea, we would assume if nothing special is happening, then she is just guessing whether the tea or milk was poured first, and the probability of being correct on any given trial is 50% or .5. We write this as follows.
H0: p = .5
#2 Pick a threshold. The trials we are going to perform are taste tests where the lady cannot see the tea-milk mixture being poured. We have to decide on how high a test statistic we will consider impressive. The three standard choices are 90% confidence, 95% confidence or 99% confidence. For experiments in the medical field, where the decision is whether or not to bring a new drug to market, the 99% confidence level is common. For an experiment like this, where the result is not truly earth shattering, we might decide to us the 95% confidence threshold.
The experiment will produce a z-score the thresholds for high z-scores are as follows:
90% threshold: z = 1.28
95% threshold: z = 1.645
99% threshold: z = 2.325
#3 Decide on the number of trials in an experiment. There is a tug-of-war in deciding the number of trials. More trials produces numbers we can be more confident in, but more trials is also more expensive and more time consuming. In the case of the lady tasting tea, we don't want to keep her drinking tea mixtures for hours.
Different books set different standards for the minimum number of trials based on np and nq. Some say both np > 5 and nq > 5. Others say both numbers should be greater than 10, yet others say 15. The standard that np >= 10 and nq >= 10 can be connected to the standard that says n > np + 3*sqrt(pqn) > np - 3*sqrt(pqn) > 0 by a little algebraic manipulation.
np - 3*sqrt(pqn) > 0 [add the square root to both sides]
np > 3*sqrt(pqn) [square both sides]
n^2*p^2 > 9pqn [divide both sides by np]
np > 9q
Since q must be less than 1, but can be as close to 1 as we want, set it equal to 1 and the inequality becomes
np > 9, which we can change to np >= 10.
For example, If let's look at the different possible positive z-score results if the lady were given ten trials, which would be enough if we used the lowest standard of np >= 5 and nq >= 5.
10 correct out of 10: z = 3.16227... ~= 3.16, which is above the 99% threshold.
(look-up table: .9992)
9 correct out of 10: z = 2.52982... ~= 2.53, which is above the 99% threshold.
(look-up table: .9943)
8 correct out of 10: z = 1.89737... ~= 1.90, which is above the 95% threshold, but not the 99%.
(look-up table: .9713)
7 correct out of 10: z = 1.26491... ~= 1.26, which is below the 90% threshold.
(look-up table: .8962)
If we set the bar at the 90% threshold, she could impress us by getting 8 right out of 10 or better. Likewise, at the 95% threshold, 8 of 10 will be beyond the threshold and the result would make us reject H0. At the 99% threshold, she would have to get 9 of 10 or 10 of 10 to make a z-score that breaks the threshold.
#4 Interpreting the test statistic. Let's say for the sake of argument that the lady got 8 of 10 correct. (There is a book about 20th Century statistics entitled The Lady Tasting Tea, where a witness to the experiment says he can't recall how many times she was tested, but the lady got a perfect score.) If we set the threshold at 90% confidence or 95% confidence, we would be impressed by the z score of 1.90 and we would reject H0, which says that we don't think she is "just guessing", but actually has the talent she says she has. If we set the value at 99% confidence, we would fail to reject H0.
Here's the thing. We could be wrong. If we reject H0 incorrectly, it means she was just guessing and she was very lucky during this test, a Type I error. A z-score of 1.90 corresponds to a probability of 97.13%, which is called the p-value in hypothesis testing. She can pass the test by being better than 97.13% of lucky guessers. If she is a lucky guesser, she would fool anyone who had set the threshold at 90% or 95%.
If we fail to reject H0, which we would do if we set the threshold at 99%, this could also be an error, but this time it would be a Type II error. Under this scenario, she got 8 of 10 but she should do better usually. It's more difficult math to figure out how good she actually is and how unlucky she had to be to get only 8 of 10. The probability of a Type I error is called alpha, and it is determined by the threshold. The probability of a Type II error is called beta, and it usually explained in greater detail in the class after the introduction to statistics.
A low tailed test
Let's say we want a low error rate. Unlike the lady tasting tea who needed a lot of right answers to impress us, now we need a low score to get a result that will make us reject the null hypothesis.
Now our z-score thresholds are
90% threshold: z = -1.28
95% threshold: z = -1.645
99% threshold: z = -2.325
Let's say we want to be convinced our error rate is less than 10% and we want to be convinced to the 95% confidence level.
H0: p = 0.10
HA: p < 0.10
n = 50
10% of 50 is 5, so we should check to see what happens at f = 4, 3, 2, 1 and 0 errors. Typing this into the calculator will look like
(f/50 - .1)/sqrt(.1*.9/50)
f = 4 gives a rounded z-score of -.47
(look-up table: .3192 fail to reject H0)
f = 3 gives a rounded z-score of -.94
(look-up table: .1736 fail to reject H0)
f = 2 gives a rounded z-score of -1.41
(look-up table: .0793 fail to reject H0 at 95% confidence, but reject at 90%)
f = 1 gives a rounded z-score of -1.89
(look-up table: .0294 reject H0 at 95% confidence, but fail reject at 99%)
f = 0 gives a rounded z-score of -2.36
(look-up table: .0091 reject H0 at any confidence level we use 90%, 95% or 99%)
Labels:
hypothesis testing,
p-values,
proportions,
test statistic,
z-scores
Subscribe to:
Posts (Atom)