Here are three sets of seven numbers each. The x values are the number of points scored by Kobe Bryant in the playoff games against the Houston Rockets, the y values are the points scored by the Lakers in those same games and the z values are the difference between the Lakers' score and the Rockets' score in each game.
x: __32__40__33__35__26__32__33
y: __92_111_108__87_118__80__89
z: __-8_+13_+14_-12_+30_-15_+19
1. Find the correlation coefficients for all three pairs of number sets, rx,y, rx,z and ry,z.
2. What are the cut-off values for 95% confidence and 99% confidence for correlation?
3. Which of the pairs of sets has the highest absolute correlation and what confidence level does that correlation exceed?
Round all answers to three digits. Answers in the comments.
Thursday, May 28, 2009
Wednesday, May 20, 2009
Topic for final exam
The final exam will be given at two times.
May 22: 8 a.m. to 10 a.m.
May 29: 10 a.m. to noon.
The final is comprehensive. You will need your yellow sheets, a calculator, scratch paper and a pencil.
The test will be four or five pages long. The amount from each part of the class will be
25%-30% from first exam
25%-30% from second exam
40-50% from after the second exam
On the list of topics below, any topic with an asterisk (*) means that though it might have been introduced before the first or second exam, it gets used throughout the class, so I don't necessarily count it as in the percentage of problem promised for each section.
Any topic in bold means you are expected to know how to get the answer without a formula or instructions being provided. In many cases, this means knowing how to use your calculator properly.
First exam
==========
frequency tables
stem and leaf plots
five number summary
box and whiskers
IQR and outliers for box and whiskers
Mean*, median*, mode, mid-range
parameter* and statistic*
population* and sample*
categorical data*
numerical data*
Bar charts
Pie charts
Line charts
Ogives
dotplots
percentage increase and decrease
contingency tables*
degrees of freedom*
conditional probability*
frequency and relative frequency
inclusion-exclusion
complementary event
order of operations*
Second exam through April 8
==========
standard deviation*
confidence intervals and margin of error
t-scores and z-scores*
raw scores, z-scores and percentages*
common critical values
Central Limit Theorem
Confidence of victory
After second exam
=======
Binomial coefficients and falling factorial
expected value of correct results
dependent and independent probabilities
Classic and modern parimutuel
expected value of a game
exactly r correct out of n trials
Bayesian probabilities
Hypothesis testing:
null hypothesis, alternative hypothesis, type I error, type II error
test statistic
threshold for xx% confidence (one-tailed high, one-tailed low, two-tailed)
one sample testing
two samples testing
Correlation (rx,y and the equation of the line yp = ax + b)
If you have any specific questions or want to make time to talk to me before the final, send me an e-mail and we can make an appointment.
May 22: 8 a.m. to 10 a.m.
May 29: 10 a.m. to noon.
The final is comprehensive. You will need your yellow sheets, a calculator, scratch paper and a pencil.
The test will be four or five pages long. The amount from each part of the class will be
25%-30% from first exam
25%-30% from second exam
40-50% from after the second exam
On the list of topics below, any topic with an asterisk (*) means that though it might have been introduced before the first or second exam, it gets used throughout the class, so I don't necessarily count it as in the percentage of problem promised for each section.
Any topic in bold means you are expected to know how to get the answer without a formula or instructions being provided. In many cases, this means knowing how to use your calculator properly.
First exam
==========
frequency tables
stem and leaf plots
five number summary
box and whiskers
IQR and outliers for box and whiskers
Mean*, median*, mode, mid-range
parameter* and statistic*
population* and sample*
categorical data*
numerical data*
Bar charts
Pie charts
Line charts
Ogives
dotplots
percentage increase and decrease
contingency tables*
degrees of freedom*
conditional probability*
frequency and relative frequency
inclusion-exclusion
complementary event
order of operations*
Second exam through April 8
==========
standard deviation*
confidence intervals and margin of error
t-scores and z-scores*
raw scores, z-scores and percentages*
common critical values
Central Limit Theorem
Confidence of victory
After second exam
=======
Binomial coefficients and falling factorial
expected value of correct results
dependent and independent probabilities
Classic and modern parimutuel
expected value of a game
exactly r correct out of n trials
Bayesian probabilities
Hypothesis testing:
null hypothesis, alternative hypothesis, type I error, type II error
test statistic
threshold for xx% confidence (one-tailed high, one-tailed low, two-tailed)
one sample testing
two samples testing
Correlation (rx,y and the equation of the line yp = ax + b)
If you have any specific questions or want to make time to talk to me before the final, send me an e-mail and we can make an appointment.
Monday, May 18, 2009
Class notes for 5/18: Regression and correlation
When we have a data set, sometimes we collect more than one variable of information about the units. For example, in our class survey, among the numerical variables were the height in inches, the GPA, the opinion about the difficulty of the class, age and average hours of sleep per night.
A question about two variables is if they are related to one another in some simple way. One simple way is correlation, which can be positive or negative. Here is a general definition of each.
Positive correlation between two numerical variables, call them x and y, means that the high values of x tend to be paired with the high values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the low values of y.
The variables x and y show negative correlation if that the high values of x tend to be paired with the low values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the high values of y.
If we pick two variables at random, we do not expect to see correlation. We can write this as a null hypothesis, where the test statistic is rx,y, the correlation coefficient. The sign of low correlation is rx,y = 0. The values of rx,y are always between -1, which means perfect negative correlation, and +1, which means perfect positive correlation.
The seventh page of the yellow sheets gives us threshold numbers for the 99% confidence level and 95% confidence level for correlation given the number of points n. For instance, when n = 5, the thresholds are .878 for 95% confidence and .959 for 99% confidence. This splits up the numbers from -1 to 1 into five regions.
-1 <= rx,y <= -.959: Very strong negative correlation
-.959 < rx,y <= -.878: Strong negative correlation
-.878 < rx,y < .878: The correlation is not particularly strong, regardless of positive or negative.
.878 <= rx,y < .959: strong positive correlation
.959 <= rx,y <= 1: very strong positive correlation
Just like with any hypothesis test, we should decide the confidence level before testing. This is a two-tailed test, because whether correlation is positive or negative, the relationships between number sets can often give us vital scientific information.
There is an important warning: Correlation is not causation. Just because two number sets have a relation, it doesn't mean that x causes y or y causes x. Sometimes there is a hidden third factor that is the cause of both of the things we are looking at. Sometimes, it's random chance and there is no causative agent at all.
Here is a set of five points, listed as (x,y) in each case.
(1,1)
(2,2)
(3,4)
(4,4)
(6,5)
As we can see, the points are ordered from low to high in both coordinates, so we expect some correlation. If we input the points into our calculator, we get a value for r (which is the same as rx,y) of .933338696..., which is strong positive correlation, but not very strong positive correlation. Assuming the 95% confidence level is good enough for us, we can use the a and b variables from out calculator to give us the equation of the line
yp = .797x + .649
This is called the predictor line (that's where the p comes from) or the line of regression or the line of least squares. Any such line for a given data set has two important criteria it meets. It passes through the centroid (x-bar, y-bar), the center point of all the data, and it minimizes the sum of the absolute values of the residuals, which is |y - yp| for all points.
Let's find the absolute values of the residuals for each of the five points, using the rounded values of a and b.
Point (1,1): |1 - .797*1 - .649| = 0.446
Point (2,2): |2 - .797*2 - .649| = 0.243
Point (3,4): |4 - .797*3 - .649| = 0.96
Point (4,4): |4 - .797*4 - .649| = 0.163
Point (6,5): |5 - .797*6 - .649| = 0.431
As we can see, the point (3,4) is farthest from the line, while the point (4,4) is the closest. The centroid (3.2, 3.2) is exactly on the line if you use the un-rounded values of a and b, and even using the rounded values, the centroid only misses the line by .0006.
In class, we used five points, but the last point was (5,6) instead of (6,5). This changes the numbers. rx,y goes up to .973328527..., which is above the 99% confidence threshold. The formula for the new predictor line is
yp = 1.2x - .2
Where we see the difference in these two different examples is in the residuals.
Point (1,1): |1 - 1.2*1 + .2| = 0
Point (2,2): |2 - 1.2*2 + .2| = 0.2
Point (3,4): |4 - 1.2*3 + .2| = 0.6
Point (4,4): |4 - 1.2*4 + .2| = 0.6
Point (5,6): |6 - 1.2*5 + .2| = 0.2
The closest point is now exactly on the line, which is a rarity, but even the farthest away point is only .6 units away, closer than the farthest away on the line with the lower correlation coefficient.
As we get more points in our data set, we lower our threshold that shows correlation strength. This way, a few points that are outliers do not completely ruin the chances of the data showing correlation, though sometimes strong outliers can mess up the data set and the correlation coefficient gets so close to zero that we cannot reject the null hypothesis that the two variables are not simply related.
A question about two variables is if they are related to one another in some simple way. One simple way is correlation, which can be positive or negative. Here is a general definition of each.
Positive correlation between two numerical variables, call them x and y, means that the high values of x tend to be paired with the high values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the low values of y.
The variables x and y show negative correlation if that the high values of x tend to be paired with the low values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the high values of y.
If we pick two variables at random, we do not expect to see correlation. We can write this as a null hypothesis, where the test statistic is rx,y, the correlation coefficient. The sign of low correlation is rx,y = 0. The values of rx,y are always between -1, which means perfect negative correlation, and +1, which means perfect positive correlation.
The seventh page of the yellow sheets gives us threshold numbers for the 99% confidence level and 95% confidence level for correlation given the number of points n. For instance, when n = 5, the thresholds are .878 for 95% confidence and .959 for 99% confidence. This splits up the numbers from -1 to 1 into five regions.
-1 <= rx,y <= -.959: Very strong negative correlation
-.959 < rx,y <= -.878: Strong negative correlation
-.878 < rx,y < .878: The correlation is not particularly strong, regardless of positive or negative.
.878 <= rx,y < .959: strong positive correlation
.959 <= rx,y <= 1: very strong positive correlation
Just like with any hypothesis test, we should decide the confidence level before testing. This is a two-tailed test, because whether correlation is positive or negative, the relationships between number sets can often give us vital scientific information.
There is an important warning: Correlation is not causation. Just because two number sets have a relation, it doesn't mean that x causes y or y causes x. Sometimes there is a hidden third factor that is the cause of both of the things we are looking at. Sometimes, it's random chance and there is no causative agent at all.
Here is a set of five points, listed as (x,y) in each case.
(1,1)
(2,2)
(3,4)
(4,4)
(6,5)
As we can see, the points are ordered from low to high in both coordinates, so we expect some correlation. If we input the points into our calculator, we get a value for r (which is the same as rx,y) of .933338696..., which is strong positive correlation, but not very strong positive correlation. Assuming the 95% confidence level is good enough for us, we can use the a and b variables from out calculator to give us the equation of the line
yp = .797x + .649
This is called the predictor line (that's where the p comes from) or the line of regression or the line of least squares. Any such line for a given data set has two important criteria it meets. It passes through the centroid (x-bar, y-bar), the center point of all the data, and it minimizes the sum of the absolute values of the residuals, which is |y - yp| for all points.
Let's find the absolute values of the residuals for each of the five points, using the rounded values of a and b.
Point (1,1): |1 - .797*1 - .649| = 0.446
Point (2,2): |2 - .797*2 - .649| = 0.243
Point (3,4): |4 - .797*3 - .649| = 0.96
Point (4,4): |4 - .797*4 - .649| = 0.163
Point (6,5): |5 - .797*6 - .649| = 0.431
As we can see, the point (3,4) is farthest from the line, while the point (4,4) is the closest. The centroid (3.2, 3.2) is exactly on the line if you use the un-rounded values of a and b, and even using the rounded values, the centroid only misses the line by .0006.
In class, we used five points, but the last point was (5,6) instead of (6,5). This changes the numbers. rx,y goes up to .973328527..., which is above the 99% confidence threshold. The formula for the new predictor line is
yp = 1.2x - .2
Where we see the difference in these two different examples is in the residuals.
Point (1,1): |1 - 1.2*1 + .2| = 0
Point (2,2): |2 - 1.2*2 + .2| = 0.2
Point (3,4): |4 - 1.2*3 + .2| = 0.6
Point (4,4): |4 - 1.2*4 + .2| = 0.6
Point (5,6): |6 - 1.2*5 + .2| = 0.2
The closest point is now exactly on the line, which is a rarity, but even the farthest away point is only .6 units away, closer than the farthest away on the line with the lower correlation coefficient.
As we get more points in our data set, we lower our threshold that shows correlation strength. This way, a few points that are outliers do not completely ruin the chances of the data showing correlation, though sometimes strong outliers can mess up the data set and the correlation coefficient gets so close to zero that we cannot reject the null hypothesis that the two variables are not simply related.
Labels:
class notes,
correlation,
predictor line,
regression
inputting the data for the worksheet using the TI-30XIIs
To get into the correct mode, follow these instructions.
[2nd][data]
move the underline so it is under 2-var, then press [enter].
[2nd][data]
move the underline so it so under clrdata, then press [enter].
now we enter the data.
[data]
x1 = 856.7[down]
y1 = 15.22[down]
x2 = 907.9[down]
y2 = 16.79[down]
x3 = 974.1[down]
y3 = 19.81[down]
x4 = 930.9[down]
y4 = 17.88[down]
x5 = 885.9[down]
y5 = 16.84[down]
x6 = 886.1[down]
y6 = 16.87[down]
x7 = 926.8[down]
y7 = 17.48[down]
x8 = 928.4[down]
y8 = 17.34[down]
x9 = 802.9[down]
y9 = 12.22[down]
x10 = 834.8[down]
y10 = 11.16[down]
x11 = 734.3[down]
y11 = 9.37[down]
x12 = 816.3[down]
y12 = 10.26[down]
x13=[statvar]
The numbers you need are
rx,y = .946053079... ~= .946 (Above the 99% threshold of .708)
a = 0.048098608... ~= .0481
b = -26.92322636... ~= -26.9232
So the formula to find the points on the line is yp = ax + b if you use the exact values and if using the approximation, yp = .0481x - 26.9232.
Answer for both versions to the nearest penny in the comments.
[2nd][data]
move the underline so it is under 2-var, then press [enter].
[2nd][data]
move the underline so it so under clrdata, then press [enter].
now we enter the data.
[data]
x1 = 856.7[down]
y1 = 15.22[down]
x2 = 907.9[down]
y2 = 16.79[down]
x3 = 974.1[down]
y3 = 19.81[down]
x4 = 930.9[down]
y4 = 17.88[down]
x5 = 885.9[down]
y5 = 16.84[down]
x6 = 886.1[down]
y6 = 16.87[down]
x7 = 926.8[down]
y7 = 17.48[down]
x8 = 928.4[down]
y8 = 17.34[down]
x9 = 802.9[down]
y9 = 12.22[down]
x10 = 834.8[down]
y10 = 11.16[down]
x11 = 734.3[down]
y11 = 9.37[down]
x12 = 816.3[down]
y12 = 10.26[down]
x13=[statvar]
The numbers you need are
rx,y = .946053079... ~= .946 (Above the 99% threshold of .708)
a = 0.048098608... ~= .0481
b = -26.92322636... ~= -26.9232
So the formula to find the points on the line is yp = ax + b if you use the exact values and if using the approximation, yp = .0481x - 26.9232.
Answer for both versions to the nearest penny in the comments.
Labels:
calculators,
correlation,
two variable statistics
Thursday, May 14, 2009
Class notes for 5/13: Two population tests for proportions and averages
The first hypothesis tests we studied were checking to see if an experimental sample produced a value that was significantly different from some known value produced either by math or by earlier experiments.
For example, in the lady tasting tea, since she has two choices each time a mixture is given to her, the math would say that her chance of getting it right by just guessing is 50% or H0: p = .5. In testing psychic abilities, there are five different symbols on the cards, so random guessing should get the right answer 1 out of 5 times, or 20%, so H0: p = .2.
In a test for average human body temperature, the assumption of 98.6 degrees Fahrenheit being the average came from an experiment performed in the 19th Century.
We can also do tests by taking samples from two different populations. The null hypothesis, as always, is an equality, the assumption that the parameters from the two different populations are the same. As always, we need convincing evidence that the difference is significant to reject the null hypothesis, and we can choose just how convincing that evidence must be by setting the confidence level, which is usually either 90% or 95% or 99%.
Two proportions from two populations
Like with the one proportion test, the test statistic is a z-score. We have the proportions from the two samples, p-hat1 = f1/n1 and p-hat2 = f2/n2, but we also need to create the pooled proportion p-bar = (f1 + f2)/(n1 + n2).
Here's an example from the polling data from last year.
Question: Was John McCain's popularity in Iowa significantly different from his popularity in Pennsylvania?
Let's assume we don't know either way, so it will be a two tailed test. Polling data traditionally uses the 95% confidence level, so that means the z-score will have to be either greater than or equal to 1.96 or less than or equal to -1.96 for us to reject the null hypothesis. Here are our numbers, with Iowa as the first data set.
f1 = 263
n1 = 658
p-hat1 = .400
f2 = 283
n2 = 657
p-hat2 = .430
p-bar = (263+283)/(658+657) = .415 (q-bar = .585)
Type this into your calculator.
(.400-.430)/sqrt(.415x.585/658+.415x.585/657[enter]
The answer is -1.103..., which rounds to -1.10. This would say the difference we see in the two samples is not enough to convince us of a significant difference in popularity for McCain between the two states, so we would fail to reject the null hypothesis. In the actual election, McCain had 45.2% of the vote in Pennsylvania and 44.8% of the vote in Iowa, which are fairly close to equal.
Two averages from two populations
In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.
Question: Do female Laney students sleep more hours each night than male Laney students?
We will take our data from the larger of the two class surveys, Data Set #2. Here are the numbers for the students who submitted data, with the females listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.
H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)
x-bar1 = 7.31
s1 = .94
n1 = 26
x-bar2 = 7.54
s2 = 1.47
n2 = 12
The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.
(7.31-7.54)/sqrt(.94^2/26+1.47^2/12)[enter]
-0.4971...
This number is between the thresholds, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.
For example, in the lady tasting tea, since she has two choices each time a mixture is given to her, the math would say that her chance of getting it right by just guessing is 50% or H0: p = .5. In testing psychic abilities, there are five different symbols on the cards, so random guessing should get the right answer 1 out of 5 times, or 20%, so H0: p = .2.
In a test for average human body temperature, the assumption of 98.6 degrees Fahrenheit being the average came from an experiment performed in the 19th Century.
We can also do tests by taking samples from two different populations. The null hypothesis, as always, is an equality, the assumption that the parameters from the two different populations are the same. As always, we need convincing evidence that the difference is significant to reject the null hypothesis, and we can choose just how convincing that evidence must be by setting the confidence level, which is usually either 90% or 95% or 99%.
Two proportions from two populations
Like with the one proportion test, the test statistic is a z-score. We have the proportions from the two samples, p-hat1 = f1/n1 and p-hat2 = f2/n2, but we also need to create the pooled proportion p-bar = (f1 + f2)/(n1 + n2).
Here's an example from the polling data from last year.
Question: Was John McCain's popularity in Iowa significantly different from his popularity in Pennsylvania?
Let's assume we don't know either way, so it will be a two tailed test. Polling data traditionally uses the 95% confidence level, so that means the z-score will have to be either greater than or equal to 1.96 or less than or equal to -1.96 for us to reject the null hypothesis. Here are our numbers, with Iowa as the first data set.
f1 = 263
n1 = 658
p-hat1 = .400
f2 = 283
n2 = 657
p-hat2 = .430
p-bar = (263+283)/(658+657) = .415 (q-bar = .585)
Type this into your calculator.
(.400-.430)/sqrt(.415x.585/658+.415x.585/657[enter]
The answer is -1.103..., which rounds to -1.10. This would say the difference we see in the two samples is not enough to convince us of a significant difference in popularity for McCain between the two states, so we would fail to reject the null hypothesis. In the actual election, McCain had 45.2% of the vote in Pennsylvania and 44.8% of the vote in Iowa, which are fairly close to equal.
Two averages from two populations
In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.
Question: Do female Laney students sleep more hours each night than male Laney students?
We will take our data from the larger of the two class surveys, Data Set #2. Here are the numbers for the students who submitted data, with the females listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.
H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)
x-bar1 = 7.31
s1 = .94
n1 = 26
x-bar2 = 7.54
s2 = 1.47
n2 = 12
The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.
(7.31-7.54)/sqrt(.94^2/26+1.47^2/12)[enter]
-0.4971...
This number is between the thresholds, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.
Tuesday, May 12, 2009
Class notes for 5/11: Hypothesis testing for the mean of a population
t-scores and p values
If we have a z-score between -3.5 and +3.5, Table A-2 lets us find the p value associated with that z-score accurate to four decimal places. For example, if z = 1.71, the p value is .9564, which is to say that z-score is higher than 95.64% of data in a normally distributed set.
The t-score table is smaller, and to read a t-score correctly, we also need n, the size of the sample, because that gives us the Degrees of Freedom, which is n-1 in this case.
If n=10, then d.f. = 9, and the t-score table reads as follows.
___________________________Area in One Tail_____________
_______0.005______0.01______0.025______0.05______0.10__________df=9___3.250_____2.821______2.262_____1.833_____1.383__________
If t = 1.71, that value isn't on our table, but because it lies between the values associated with 0.05 and 0.10, that means that score is in the top 10% of scores, but not in the top 5%.
If instead n=30 and d.f. = 29, here are the t-score values.
___________________________Area in One Tail_____________
_______0.005______0.01______0.025______0.05______0.10__________df=29__2.756_____2.462______2.045_____1.699_____1.311__________
Now a t-score of 1.71 lies between 2.045 and 1.699, which means it is in the top 5%, but not the top 2.5%.
Like the z-score table, the t-score table is symmetric about the value t=0. If d.f.=29, t=-1.71 is a score in the bottom 5%, but not the bottom 2.5%.
Hypothesis testing for the mean of a population
Hypothesis testing for the mean of a population assumes we know the population mean from some previously obtained information. Perhaps that mean has changed over time or the previous information wasn't correct to begin with, but the null hypothesis assumes we know that mean, which we call mux. If we take a sample from the population, we will get the values x-bar, sx and n, and using those values and mux, we can get the t-score.
Just like with the hypothesis test for a proportion, the test can be one-tailed high, one-tailed low or two-tailed.
For example, if we were testing people who had studied using a special method and we were checking scores on a standardized test, we would only be impressed if the average went up, so a one-tailed high test would be appropriate.
If the experiment was dealing with a cholesterol drug, we would want to see a lower average reading, and a one-tailed low test would be used.
If we assume the average duration of a pop song on the charts today is the same as the duration of pop songs in the seventies, we can't assume beforehand if the new readings will be higher or lower, and would be surprised if the new average were significantly different in either direction, so a two-tailed test would be appropriate.
Here is some data we went over in class. In most textbooks, the 'normal' human body temperature is listed at 98.6 degrees Fahrenheit, based on the work of Dr. Carl Wunderlich back in the 19th Century. If we do a test, it should be a two-tailed test, since we would be surprised if the normal temperature is significantly higher or significantly lower than this. Since this is a medical experiment, let's use the 99% level of confidence.
The size of the sample was n=103, which means the degrees of freedom are 102. Our table doesn't have a listing for d.f.=102, and the next lowest available value is d.f.=100. Here are the table values for that row of Table A-3.
___________________________Area in Two Tails____________
_________0.01______0.02_______0.05______0.10______0.20__________df=100__2.626_____2.364______1.984_____1.660_____1.290__________
With a two tailed test at the 99% confidence level, this means we want the 0.01 column. The "middle" 99% of data lies between t-scores of -2.616 and 2.616. If the t-score lies in that range, we will fail to reject H0. If it is greater or equal to 2.616 or less than or equal to -2.616, we will reject H0.
The values from this study found that x-bar = 98.2 degrees and the standard deviation sx was 0.62. Plugging into our t-score equation from above, we get
t = (98.2-98.6)/0.62*sqrt(103) = -6.547671977... ~= -6.548.
We don't get an exact p value for a number so far away from zero, but if we look at outlier z-score table, we can roughly approximate that this p value is somewhere around 1 in 1,000,000,000. We can say with 99% confidence that the average body temperature is not 98.6 degrees, but probably close to the sample average of 98.2 degrees. Our p value shows we could qualify for even greater confidence with our statement, but very rarely do tests ask for more than 99% confidence, and changing the criteria after the fact is not recommended. Still, publishing this incredibly tiny p value will convince people who can read a statistical report that the evidence is very strong indeed.
This test also changed the idea of what should constitute a fever. Instead of one temperature of 100.4 degrees Fahrenheit being the absolute gauge, the temperature will fluctuate depending on the time of day, as do the normal temperature readings.
If we have a z-score between -3.5 and +3.5, Table A-2 lets us find the p value associated with that z-score accurate to four decimal places. For example, if z = 1.71, the p value is .9564, which is to say that z-score is higher than 95.64% of data in a normally distributed set.
The t-score table is smaller, and to read a t-score correctly, we also need n, the size of the sample, because that gives us the Degrees of Freedom, which is n-1 in this case.
If n=10, then d.f. = 9, and the t-score table reads as follows.
___________________________Area in One Tail_____________
_______0.005______0.01______0.025______0.05______0.10__________df=9___3.250_____2.821______2.262_____1.833_____1.383__________
If t = 1.71, that value isn't on our table, but because it lies between the values associated with 0.05 and 0.10, that means that score is in the top 10% of scores, but not in the top 5%.
If instead n=30 and d.f. = 29, here are the t-score values.
___________________________Area in One Tail_____________
_______0.005______0.01______0.025______0.05______0.10__________df=29__2.756_____2.462______2.045_____1.699_____1.311__________
Now a t-score of 1.71 lies between 2.045 and 1.699, which means it is in the top 5%, but not the top 2.5%.
Like the z-score table, the t-score table is symmetric about the value t=0. If d.f.=29, t=-1.71 is a score in the bottom 5%, but not the bottom 2.5%.
Hypothesis testing for the mean of a population
Hypothesis testing for the mean of a population assumes we know the population mean from some previously obtained information. Perhaps that mean has changed over time or the previous information wasn't correct to begin with, but the null hypothesis assumes we know that mean, which we call mux. If we take a sample from the population, we will get the values x-bar, sx and n, and using those values and mux, we can get the t-score.
Just like with the hypothesis test for a proportion, the test can be one-tailed high, one-tailed low or two-tailed.
For example, if we were testing people who had studied using a special method and we were checking scores on a standardized test, we would only be impressed if the average went up, so a one-tailed high test would be appropriate.
If the experiment was dealing with a cholesterol drug, we would want to see a lower average reading, and a one-tailed low test would be used.
If we assume the average duration of a pop song on the charts today is the same as the duration of pop songs in the seventies, we can't assume beforehand if the new readings will be higher or lower, and would be surprised if the new average were significantly different in either direction, so a two-tailed test would be appropriate.
Here is some data we went over in class. In most textbooks, the 'normal' human body temperature is listed at 98.6 degrees Fahrenheit, based on the work of Dr. Carl Wunderlich back in the 19th Century. If we do a test, it should be a two-tailed test, since we would be surprised if the normal temperature is significantly higher or significantly lower than this. Since this is a medical experiment, let's use the 99% level of confidence.
The size of the sample was n=103, which means the degrees of freedom are 102. Our table doesn't have a listing for d.f.=102, and the next lowest available value is d.f.=100. Here are the table values for that row of Table A-3.
___________________________Area in Two Tails____________
_________0.01______0.02_______0.05______0.10______0.20__________df=100__2.626_____2.364______1.984_____1.660_____1.290__________
With a two tailed test at the 99% confidence level, this means we want the 0.01 column. The "middle" 99% of data lies between t-scores of -2.616 and 2.616. If the t-score lies in that range, we will fail to reject H0. If it is greater or equal to 2.616 or less than or equal to -2.616, we will reject H0.
The values from this study found that x-bar = 98.2 degrees and the standard deviation sx was 0.62. Plugging into our t-score equation from above, we get
t = (98.2-98.6)/0.62*sqrt(103) = -6.547671977... ~= -6.548.
We don't get an exact p value for a number so far away from zero, but if we look at outlier z-score table, we can roughly approximate that this p value is somewhere around 1 in 1,000,000,000. We can say with 99% confidence that the average body temperature is not 98.6 degrees, but probably close to the sample average of 98.2 degrees. Our p value shows we could qualify for even greater confidence with our statement, but very rarely do tests ask for more than 99% confidence, and changing the criteria after the fact is not recommended. Still, publishing this incredibly tiny p value will convince people who can read a statistical report that the evidence is very strong indeed.
This test also changed the idea of what should constitute a fever. Instead of one temperature of 100.4 degrees Fahrenheit being the absolute gauge, the temperature will fluctuate depending on the time of day, as do the normal temperature readings.
Labels:
hypothesis testing,
p-values,
Student's t-scores
Monday, May 11, 2009
Practice true-false questions about hypothesis testing.
1. If the confidence level is 90%, it is more common to make Type I errors than it is with a confidence level of 99%.
2. Proportion tests are never two-tailed.
3. If we reject H0, but in reality we shouldn't have, we have made a Type II error.
4. If we have a z-score of -1.38, we would reject H0 in a 90% confidence one-tailed low test.
5. If we have a t-score of -1.38 and n=7, we would reject H0 in a 90% confidence one-tailed low test.
6. The null hypothesis is always stated as an equation.
7. You can never start an experiment by assuming the alternate hypothesis is true.
8. If we have a z-score of -1.68, we would reject H0 in a 90% confidence two-tailed test.
9. For us to be 95% confident the lady tasting tea knew what she was doing, she had to get at least 95% of her answers correct.
10. You are allowed to do an experiment and decide what confidence level you want to use after you seen the results.
Answers in the comments.
2. Proportion tests are never two-tailed.
3. If we reject H0, but in reality we shouldn't have, we have made a Type II error.
4. If we have a z-score of -1.38, we would reject H0 in a 90% confidence one-tailed low test.
5. If we have a t-score of -1.38 and n=7, we would reject H0 in a 90% confidence one-tailed low test.
6. The null hypothesis is always stated as an equation.
7. You can never start an experiment by assuming the alternate hypothesis is true.
8. If we have a z-score of -1.68, we would reject H0 in a 90% confidence two-tailed test.
9. For us to be 95% confident the lady tasting tea knew what she was doing, she had to get at least 95% of her answers correct.
10. You are allowed to do an experiment and decide what confidence level you want to use after you seen the results.
Answers in the comments.
Class notes for 5/11: p-values and Confidence Levels
The test statistics we will use for hypothesis testing will differ, sometimes using z-scores, other times t-scores and yet other times the chi-square table. What all these tests have in common is that they correspond to probabilities, called p-values.
The idea is that if we have a test with 90% confidence, we will only reject H0 if an event happens 10% of the time or less.
Likewise, 95% confidence means we reject H0 when the probability of the event is less than or equal to 5%.
99% confidence means we reject H0 with tests that happen 1% of the time or less.
This table shows what p-value corresponds to the threshold where we reject the null hypothesis, which technically is the same as accepting the alternate hypothesis. When it's one tailed high, the p-value threshold is easy, .9 for 90% confidence, .95 for 95% confidence and .99 for 99% confidence. For one tailed low, the pattern is that the p-value equals 100% minus the confidence level. For the two tailed test, the two tails have to add up to 100% minus the confidence level, so the 90% confidence level threshold is at the p-values of 5% (.05) and 95% (.95).
Remember, just because a test convinced us to reject the null hypothesis doesn't mean the null hypothesis is false. It could still be true, and we would be making a Type I error.
If we use the 90% confidence level, we should make Type I errors about 10% of the time.
At the 95% confidence level, the probability of Type I errors is 5%.
At 99% confidence level, the probability of Type I errors is 1%.
The lower likelihood of errors is why most tests done on medical data is done at the 99% confidence level.
The p-value is often published to show just how well the test did. Maybe you only asked to prove something to 90% confidence on a one-tailed high test, but the p-value is .9978. This shows people that read the findings that it would be strong enough data to reject H0 at even higher confidence levels.
The idea is that if we have a test with 90% confidence, we will only reject H0 if an event happens 10% of the time or less.
Likewise, 95% confidence means we reject H0 when the probability of the event is less than or equal to 5%.
99% confidence means we reject H0 with tests that happen 1% of the time or less.
This table shows what p-value corresponds to the threshold where we reject the null hypothesis, which technically is the same as accepting the alternate hypothesis. When it's one tailed high, the p-value threshold is easy, .9 for 90% confidence, .95 for 95% confidence and .99 for 99% confidence. For one tailed low, the pattern is that the p-value equals 100% minus the confidence level. For the two tailed test, the two tails have to add up to 100% minus the confidence level, so the 90% confidence level threshold is at the p-values of 5% (.05) and 95% (.95).
Remember, just because a test convinced us to reject the null hypothesis doesn't mean the null hypothesis is false. It could still be true, and we would be making a Type I error.
If we use the 90% confidence level, we should make Type I errors about 10% of the time.
At the 95% confidence level, the probability of Type I errors is 5%.
At 99% confidence level, the probability of Type I errors is 1%.
The lower likelihood of errors is why most tests done on medical data is done at the 99% confidence level.
The p-value is often published to show just how well the test did. Maybe you only asked to prove something to 90% confidence on a one-tailed high test, but the p-value is .9978. This shows people that read the findings that it would be strong enough data to reject H0 at even higher confidence levels.
Thursday, May 7, 2009
Class notes for 5/6: Hypothesis testing
The topic for the next few weeks is hypothesis testing. The main idea is that experiments must be conducted to test the validity of an idea, which is called a hypothesis. There are always two hypotheses available, the null hypothesis H0 (pronounced "H zero" or "H nought") and the alternate hypothesis HA (pronounced "H A"). The standard is to assume the null hypothesis is true, which says that nothing special is happening, which in most cases means that two things we can measure should be equal or close to it. The alternate hypothesis says the two measurements are different. We can have one tailed high tests, where we want "large" positive test statistics. In one tailed low tests, only negative test statistics with "large" absolute value will do. In two tailed test, "large" absolute value for positive or negative numbers will work. We will only accept the alternate hypothesis if the experiment produces impressive results given our particular criteria for that test.
The basics of hypothesis testing are similar to the ideals of the English legal system, which is also the system used in United States courts, that a defendant is presumed innocent until proven guilty. There are different levels of proof of guilt in different trials, whether it is beyond a reasonable doubt or the less rigorous standard of preponderance of evidence.
In a case involving an alleged crime, there is the reality of what the defendant did and the result of the trial. If the defendant did the illegal act, then being found guilty is the correct result under the law. If the reality is that the defendant didn't do the act, the correct result would be a not guilty verdict.
The reasonable doubt standard is put in place in theory to make sending an innocent person to jail unlikely, and this is called a Type I error. The best known Type I result in legal history is Jesus Christ.
It is also possible that a person who did a crime will be found not guilty. This is called a Type II error. When I ask my students for an example of Type II error, O.J. Simpson's name still rings out the loudest.
In hypothesis testing, there is the reality and the result of the experiment. If H0 is true, the two things measured are equal or pretty close to equal. If HA is true, they are significantly different.
If the experiment produces a test statistic that is beyond the threshold we set for it, and "beyond" could mean lower if it is a one-tailed low test, or higher if it is a one-tailed high test, or either lower or higher in a two-tailed test, then we reject H0. If the test statistic fails to get "beyond" the threshold, we fail to reject H0.
Rejecting a true null hypothesis is a Type I error. Rejecting a false alternate hypothesis is a Type II error.
In class, we discussed Sir Ronald Fisher and his hypothesis testing of the lady who said she could tell the difference in taste between tea poured into milk or milk poured into tea.
Here are the things that have to be done to make such an experiment work.
#1 Define the null hypothesis. In modern experiments, the null hypothesis is always defined as an equation. In a proportion test, the equation will be concerning p, the true probability of success. In the lady tasting tea, we would assume if nothing special is happening, then she is just guessing whether the tea or milk was poured first, and the probability of being correct on any given trial is 50% or .5. We write this as follows.
H0: p = .5
#2 Pick a threshold. The trials we are going to perform are taste tests where the lady cannot see the tea-milk mixture being poured. We have to decide on how high a test statistic we will consider impressive. The three standard choices are 90% confidence, 95% confidence or 99% confidence. For experiments in the medical field, where the decision is whether or not to bring a new drug to market, the 99% confidence level is common. For an experiment like this, where the result is not truly earth shattering, we might decide to us the 95% confidence threshold.
The experiment will produce a z-score the thresholds for high z-scores are as follows:
90% threshold: z = 1.28
95% threshold: z = 1.645
99% threshold: z = 2.325
#3 Decide on the number of trials in an experiment. There is a tug-of-war in deciding the number of trials. More trials produces numbers we can be more confident in, but more trials is also more expensive and more time consuming. In the case of the lady tasting tea, we don't want to keep her drinking tea mixtures for hours.
Different books set different standards for the minimum number of trials based on np and nq. Some say both np > 5 and nq > 5. Others say both numbers should be greater than 10, yet others say 15. The standard that np >= 10 and nq >= 10 can be connected to the standard that says n > np + 3*sqrt(pqn) > np - 3*sqrt(pqn) > 0 by a little algebraic manipulation.
np - 3*sqrt(pqn) > 0 [add the square root to both sides]
np > 3*sqrt(pqn) [square both sides]
n^2*p^2 > 9pqn [divide both sides by np]
np > 9q
Since q must be less than 1, but can be as close to 1 as we want, set it equal to 1 and the inequality becomes
np > 9, which we can change to np >= 10.
For example, If let's look at the different possible positive z-score results if the lady were given ten trials, which would be enough if we used the lowest standard of np >= 5 and nq >= 5.
10 correct out of 10: z = 3.16227... ~= 3.16, which is above the 99% threshold.
9 correct out of 10: z = 2.52982... ~= 2.52, which is above the 99% threshold.
8 correct out of 10: z = 1.89737... ~= 1.90, which is above the 95% threshold, but not the 99%.
7 correct out of 10: z = 1.26491... ~= 1.26, which is below the 90% threshold.
If we set the bar at the 90% threshold, she could impress us by getting 8 right out of 10 or better. Likewise, at the 95% threshold, 8 of 10 will be beyond the threshold and the result would make us reject H0. At the 99% threshold, she would have to get 9 of 10 or 10 of 10 to make a z-score that breaks the threshold.
#4 Interpreting the test statistic. Let's say for the sake of argument that the lady got 8 of 10 correct. (There is a book about 20th Century statistics entitled The Lady Tasting Tea, where a witness to the experiment says he can't recall how many times she was tested, but the lady got a perfect score.) If we set the threshold at 90% confidence or 95% confidence, we would be impressed by the z score of 1.90 and we would reject H0, which says that we don't think she is "just guessing", but actually has the talent she says she has. If we set the value at 99% confidence, we would fail to reject H0.
Here's the thing. We could be wrong. If we reject H0 incorrectly, it means she was just guessing and she was very lucky during this test, a Type I error. A z-score of 1.90 corresponds to a probability of 97.13%, which is called the p-value in hypothesis testing. She can pass the test by being better than 97.13% of lucky guessers. If she is a lucky guesser, she would fool anyone who had set the threshold at 90% or 95%.
If we fail to reject H0, which we would do if we set the threshold at 99%, this could also be an error, but this time it would be a Type II error. Under this scenario, she got 8 of 10 but she should do better usually. It's more difficult math to figure out how good she actually is and how unlucky she had to be to get only 8 of 10. The probability of a Type I error is called alpha, and it is determined by the threshold. The probability of a Type II error is called beta, and it usually explained in greater detail in the class after the introduction to statistics.
The basics of hypothesis testing are similar to the ideals of the English legal system, which is also the system used in United States courts, that a defendant is presumed innocent until proven guilty. There are different levels of proof of guilt in different trials, whether it is beyond a reasonable doubt or the less rigorous standard of preponderance of evidence.
In a case involving an alleged crime, there is the reality of what the defendant did and the result of the trial. If the defendant did the illegal act, then being found guilty is the correct result under the law. If the reality is that the defendant didn't do the act, the correct result would be a not guilty verdict.
The reasonable doubt standard is put in place in theory to make sending an innocent person to jail unlikely, and this is called a Type I error. The best known Type I result in legal history is Jesus Christ.
It is also possible that a person who did a crime will be found not guilty. This is called a Type II error. When I ask my students for an example of Type II error, O.J. Simpson's name still rings out the loudest.
In hypothesis testing, there is the reality and the result of the experiment. If H0 is true, the two things measured are equal or pretty close to equal. If HA is true, they are significantly different.
If the experiment produces a test statistic that is beyond the threshold we set for it, and "beyond" could mean lower if it is a one-tailed low test, or higher if it is a one-tailed high test, or either lower or higher in a two-tailed test, then we reject H0. If the test statistic fails to get "beyond" the threshold, we fail to reject H0.
Rejecting a true null hypothesis is a Type I error. Rejecting a false alternate hypothesis is a Type II error.
In class, we discussed Sir Ronald Fisher and his hypothesis testing of the lady who said she could tell the difference in taste between tea poured into milk or milk poured into tea.
Here are the things that have to be done to make such an experiment work.
#1 Define the null hypothesis. In modern experiments, the null hypothesis is always defined as an equation. In a proportion test, the equation will be concerning p, the true probability of success. In the lady tasting tea, we would assume if nothing special is happening, then she is just guessing whether the tea or milk was poured first, and the probability of being correct on any given trial is 50% or .5. We write this as follows.
H0: p = .5
#2 Pick a threshold. The trials we are going to perform are taste tests where the lady cannot see the tea-milk mixture being poured. We have to decide on how high a test statistic we will consider impressive. The three standard choices are 90% confidence, 95% confidence or 99% confidence. For experiments in the medical field, where the decision is whether or not to bring a new drug to market, the 99% confidence level is common. For an experiment like this, where the result is not truly earth shattering, we might decide to us the 95% confidence threshold.
The experiment will produce a z-score the thresholds for high z-scores are as follows:
90% threshold: z = 1.28
95% threshold: z = 1.645
99% threshold: z = 2.325
#3 Decide on the number of trials in an experiment. There is a tug-of-war in deciding the number of trials. More trials produces numbers we can be more confident in, but more trials is also more expensive and more time consuming. In the case of the lady tasting tea, we don't want to keep her drinking tea mixtures for hours.
Different books set different standards for the minimum number of trials based on np and nq. Some say both np > 5 and nq > 5. Others say both numbers should be greater than 10, yet others say 15. The standard that np >= 10 and nq >= 10 can be connected to the standard that says n > np + 3*sqrt(pqn) > np - 3*sqrt(pqn) > 0 by a little algebraic manipulation.
np - 3*sqrt(pqn) > 0 [add the square root to both sides]
np > 3*sqrt(pqn) [square both sides]
n^2*p^2 > 9pqn [divide both sides by np]
np > 9q
Since q must be less than 1, but can be as close to 1 as we want, set it equal to 1 and the inequality becomes
np > 9, which we can change to np >= 10.
For example, If let's look at the different possible positive z-score results if the lady were given ten trials, which would be enough if we used the lowest standard of np >= 5 and nq >= 5.
10 correct out of 10: z = 3.16227... ~= 3.16, which is above the 99% threshold.
9 correct out of 10: z = 2.52982... ~= 2.52, which is above the 99% threshold.
8 correct out of 10: z = 1.89737... ~= 1.90, which is above the 95% threshold, but not the 99%.
7 correct out of 10: z = 1.26491... ~= 1.26, which is below the 90% threshold.
If we set the bar at the 90% threshold, she could impress us by getting 8 right out of 10 or better. Likewise, at the 95% threshold, 8 of 10 will be beyond the threshold and the result would make us reject H0. At the 99% threshold, she would have to get 9 of 10 or 10 of 10 to make a z-score that breaks the threshold.
#4 Interpreting the test statistic. Let's say for the sake of argument that the lady got 8 of 10 correct. (There is a book about 20th Century statistics entitled The Lady Tasting Tea, where a witness to the experiment says he can't recall how many times she was tested, but the lady got a perfect score.) If we set the threshold at 90% confidence or 95% confidence, we would be impressed by the z score of 1.90 and we would reject H0, which says that we don't think she is "just guessing", but actually has the talent she says she has. If we set the value at 99% confidence, we would fail to reject H0.
Here's the thing. We could be wrong. If we reject H0 incorrectly, it means she was just guessing and she was very lucky during this test, a Type I error. A z-score of 1.90 corresponds to a probability of 97.13%, which is called the p-value in hypothesis testing. She can pass the test by being better than 97.13% of lucky guessers. If she is a lucky guesser, she would fool anyone who had set the threshold at 90% or 95%.
If we fail to reject H0, which we would do if we set the threshold at 99%, this could also be an error, but this time it would be a Type II error. Under this scenario, she got 8 of 10 but she should do better usually. It's more difficult math to figure out how good she actually is and how unlucky she had to be to get only 8 of 10. The probability of a Type I error is called alpha, and it is determined by the threshold. The probability of a Type II error is called beta, and it usually explained in greater detail in the class after the introduction to statistics.
Tuesday, May 5, 2009
Class notes for 5/4, part 3: Bayesian probability and double testing
If a trait is very rare, only a very accurate test gives us useful information. For example, if a trait shows up in only 1 in 10,000 people but the test for the trait has an error rate of 1 in 1,000, we should expect about 10 false positives for every true positive. Here is the completed table for that situation.
________don't___have____row total
test + ____9,999__999______10,998__
test - _9,989,001___1_____9,989,002__
col.___9,999,000_1,000______10,000,000 grand total
In a situation such as this, testing positive twice could give us useful information, as testing positive once has an error rate of about 90.9%. We have to assume the errors are random and not deterministic. For example, if a test for a chemical compound in opium also catches a similar compound found in poppy seed bagels, testing twice won't get rid of the errors. Assuming just random errors here is what we do.
Step 1: The top row of the first contingency table is the column total/grand total row of the second contingency table. What this does is takes the numbers from the people who tested positive the first time and makes them the totals for those who will be tested twice.
________don't___have____row total
test + __________________________
test - __________________________
col._______9,999__999_____10,998 grand total
Step 2: Multiply error rate by have column total to find the number who have that test negative. Round to the nearest whole number. (We didn't have to round before, but now we do.)
999*1/1000 = .999 ~= 1, this means test positive and have is 998.
________don't___have____row total
test + ___________998____________
test - _____________1____________
col._______9,999__999_____10,998 grand total
Step 3: Multiply error rate by don't have column total to find the errors. 9,999*1/1000 = 9.999 ~= 10. That means the test negative in that column is 9,999 - 10 = 9,989.
________don't___have____row total
test + _______10__998____________
test - _____9,989____1____________
col._______9,999__999_____10,998 grand total
Step 4: row totals
________don't___have____row total
test + _______10__998_____1,008___
test - _____9,989____1_____9,990___
col._______9,999__999_____10,998 grand total
Step 5: Find the error rate for testing positive twice. 10/1,008 = .0099... or about 1%.
Of the ten million people tested, we would send letters to 1,008 telling them they tested positive twice. Of those people, ten don't have the trait and are getting false information, but 998 are getting the right information. In the first test, there was someone with the trait who tested negative, and the same is true in the second test, so there are two people with the trait who did not get two positive test results. While this isn't a perfect situation, it's much better than the over 90% error rate we got for positive tests the first time through.
________don't___have____row total
test + ____9,999__999______10,998__
test - _9,989,001___1_____9,989,002__
col.___9,999,000_1,000______10,000,000 grand total
In a situation such as this, testing positive twice could give us useful information, as testing positive once has an error rate of about 90.9%. We have to assume the errors are random and not deterministic. For example, if a test for a chemical compound in opium also catches a similar compound found in poppy seed bagels, testing twice won't get rid of the errors. Assuming just random errors here is what we do.
Step 1: The top row of the first contingency table is the column total/grand total row of the second contingency table. What this does is takes the numbers from the people who tested positive the first time and makes them the totals for those who will be tested twice.
________don't___have____row total
test + __________________________
test - __________________________
col._______9,999__999_____10,998 grand total
Step 2: Multiply error rate by have column total to find the number who have that test negative. Round to the nearest whole number. (We didn't have to round before, but now we do.)
999*1/1000 = .999 ~= 1, this means test positive and have is 998.
________don't___have____row total
test + ___________998____________
test - _____________1____________
col._______9,999__999_____10,998 grand total
Step 3: Multiply error rate by don't have column total to find the errors. 9,999*1/1000 = 9.999 ~= 10. That means the test negative in that column is 9,999 - 10 = 9,989.
________don't___have____row total
test + _______10__998____________
test - _____9,989____1____________
col._______9,999__999_____10,998 grand total
Step 4: row totals
________don't___have____row total
test + _______10__998_____1,008___
test - _____9,989____1_____9,990___
col._______9,999__999_____10,998 grand total
Step 5: Find the error rate for testing positive twice. 10/1,008 = .0099... or about 1%.
Of the ten million people tested, we would send letters to 1,008 telling them they tested positive twice. Of those people, ten don't have the trait and are getting false information, but 998 are getting the right information. In the first test, there was someone with the trait who tested negative, and the same is true in the second test, so there are two people with the trait who did not get two positive test results. While this isn't a perfect situation, it's much better than the over 90% error rate we got for positive tests the first time through.
Class notes for 5/4, part 2: Bayesian probability
Earlier in the term, we created contingency tables from reading data sets and filling in the positions of the table, then finding the row totals, column totals and the grand total. We then learned about conditional probability, where we find that p(left, given female) might not equal p(left, given male) or p(left). If these probabilities are not equal, we call them dependent, because it depends on if we are looking a the whole population or some specific sub-population. If they are all equal, the probabilities are independent.
In Bayesian probability, we will be building a contingency table "backwards". Instead of filling in each value in the table then finding row totals, column totals and grand total, we will start with a trait in the population and a test for that trait. We will make a 2x2 contingency table, where the columns deal with having the trait on not and the rows refer to testing positive of testing negative. If the test has an error rate, as is often the case, some people are going to get incorrect information. What we will see is that the overall error rate can sometimes be quite different for the error rate for those who test positive and the error rate for those who test negative.
Let's say there is a genetic trait in the population that shows up in 25% of subjects, which we will write as 1 in 4. The test for the trait has a 2% error rate, so it is 1 in 50.
Step 1: The grand total is the product of the denominators of the fractions.
In our case, 4*50 = 200
________don't___have____row total
test + __________________________
test - __________________________
col._____________________200 grand total
Step 2: Multiply the grand total by the trait proportion to find the column totals.
Since 25% of the population has the trait, 25% of 200 = 50 subjects have the trait in our idealized sample. By subtraction, 150 don't have the trait.
________don't___have____row total
test + __________________________
test - __________________________
col._____150_____50______200 grand total
Step 3: Fill in the "have the trait" column by multiplying the error rate by the column total to fill in the mistaken position, and fill in the rest by subtracting.
In our case, the error rate is 1 in 50. This means for the people who have the trait, 1 person will test negative, while the other 49 will correctly test positive.
________don't___have____row total
test + ___________49______________
test - ____________1______________
col._____150_____50______200 grand total
Step 4: Fill in the "don't have the trait" column using the same method.
3 of the 150 will get the wrong information, which in their case will be a positive test. The other 147 will get the right information, a negative test result.
________don't___have____row total
test + ____3______49______________
test - ___147______1_______________
col._____150_____50______200 grand total
Step 5: Fill in the row totals.
________don't___have____row total
test + ____3______49______52______
test - ___147______1______148______
col._____150_____50______200 grand total
The error numbers are marked in bold and blue for the next step.
Step 6: Find the error rates given test positive and test negative.
p(error) = (3+1)/200 = 1/50 = .02, which was the advertised error rate.
p(error, given test positive) = 3/52 ~= .058, much higher than .02
p(error, given test negative) = 1/148 ~= .0068, much lower than .02
Unless the trait shows up in 50% of the population, we expect to get differences between the error rates for test positive and test negative. Whichever is the smaller part of the population should see a higher error rate. Just how significant the differences are in the error rates between the two test groups depends on the size of the error rate to the trait rate. A 99% accurate test sounds good, but if the trait is very rare, we might well get more false positives than true positives.
In Bayesian probability, we will be building a contingency table "backwards". Instead of filling in each value in the table then finding row totals, column totals and grand total, we will start with a trait in the population and a test for that trait. We will make a 2x2 contingency table, where the columns deal with having the trait on not and the rows refer to testing positive of testing negative. If the test has an error rate, as is often the case, some people are going to get incorrect information. What we will see is that the overall error rate can sometimes be quite different for the error rate for those who test positive and the error rate for those who test negative.
Let's say there is a genetic trait in the population that shows up in 25% of subjects, which we will write as 1 in 4. The test for the trait has a 2% error rate, so it is 1 in 50.
Step 1: The grand total is the product of the denominators of the fractions.
In our case, 4*50 = 200
________don't___have____row total
test + __________________________
test - __________________________
col._____________________200 grand total
Step 2: Multiply the grand total by the trait proportion to find the column totals.
Since 25% of the population has the trait, 25% of 200 = 50 subjects have the trait in our idealized sample. By subtraction, 150 don't have the trait.
________don't___have____row total
test + __________________________
test - __________________________
col._____150_____50______200 grand total
Step 3: Fill in the "have the trait" column by multiplying the error rate by the column total to fill in the mistaken position, and fill in the rest by subtracting.
In our case, the error rate is 1 in 50. This means for the people who have the trait, 1 person will test negative, while the other 49 will correctly test positive.
________don't___have____row total
test + ___________49______________
test - ____________1______________
col._____150_____50______200 grand total
Step 4: Fill in the "don't have the trait" column using the same method.
3 of the 150 will get the wrong information, which in their case will be a positive test. The other 147 will get the right information, a negative test result.
________don't___have____row total
test + ____3______49______________
test - ___147______1_______________
col._____150_____50______200 grand total
Step 5: Fill in the row totals.
________don't___have____row total
test + ____3______49______52______
test - ___147______1______148______
col._____150_____50______200 grand total
The error numbers are marked in bold and blue for the next step.
Step 6: Find the error rates given test positive and test negative.
p(error) = (3+1)/200 = 1/50 = .02, which was the advertised error rate.
p(error, given test positive) = 3/52 ~= .058, much higher than .02
p(error, given test negative) = 1/148 ~= .0068, much lower than .02
Unless the trait shows up in 50% of the population, we expect to get differences between the error rates for test positive and test negative. Whichever is the smaller part of the population should see a higher error rate. Just how significant the differences are in the error rates between the two test groups depends on the size of the error rate to the trait rate. A 99% accurate test sounds good, but if the trait is very rare, we might well get more false positives than true positives.
Monday, May 4, 2009
Class notes for 5/4, part 1: more on expected value
The expected value of a two outcome game (win or lose) can be written as EV= p*(profit+risk)/risk. By dividing by risk, the game's expected value can be thought of as a percentage of money returned to you on average every time you play. Again, this is an average, so the average outcome doesn't have to be achieved. In many games, it never is. Expected value is really about the long run.
Flipping a fair coin, if you call "heads" every time, you should win about 50% of the time, and so EV = .5(1+1)/1 = 100%, so calling heads is a way to make this a fair game with a fair coin.
If you mix up your calls, it's also a fair game.
If you go with "rock" every time in rock/papers/scissors, your opponent will catch on soon enough and go with "paper" every time, and you will lose money in the long run.
Rock/scissors/paper is a fair game only if you can mix up your calls using some random method, or at least a method hard for your opponent to determine. The best method is 1/3 rock, 1/3 scissors and 1/3 paper, and the expected value is 100%, or that you will neither lose nor win, but break even.
Let's go back to roulette. We saw that playing a single number or playing either black or red produce the same expected value for the player.
Single number for player
p = 1/38
profit = 35
risk = 1
EV = 1/38*(35+1)/1 = 36/38 ~= 94.7%
Red (or Black) for player
p = 18/38
profit = 1
risk = 1
EV = 18/38*(1+1)/1 = 36/38 ~= 94.7%
For the casino, the probability of winning is 1 minus the probability for the player. The risk and profit numbers are switched from the players' values.
Single number for casino
p = 37/38
profit = 1
risk = 35
EV = 37/38*(35+1)/35 = (36*37)/(35*38) ~= 100.15%
Red (or Black) for player
p = 20/38
profit = 1
risk = 1
EV = 20/38*(1+1)/1 = 40/38 ~= 105.3%
When profit = risk, which is the same as saying the classic parimutuel odds are 1:1 or the modern parimutuel odds are +100 (which is the same as -100, though rarely written that way), the average of the two expected values will be 100%. You expect to lose about 5.3 cents every game and the casino expects to win about 5.3 cents.
When profit does not equal risk, we get different percentage advantages and disadvantages. Because the casino must risk 35 bets and the player only 1 bet when the player picks a single number at roulette, the expected value for the casino is still positive, but relatively small at .15 of a cent per game. In reality, the casino rarely spins the wheel with only one bettor playing, so the casino is not truly risking only its own money on a single spin, but can use the losses of some players to help offset any possible winner. Even if that weren't the case, a game with an expected value greater than 100% means a winner in the long run, and that is the business model the casino operate on, and very successfully as anyone can see.
Practice homework for 5/4
Change the following parimutuel payoffs from modern to classic or vice versa.
a) -125
b) +144
c) 7:5
d) 4:3
Fill in the Bayesian contingency table for a trait that shows up in 1 in 50 people in the population and a test for the trait that has an error rate of 1 in 200. Find the following probabilities.
p(error, given test -)
p(error, given test +)
Answers in the comments.
a) -125
b) +144
c) 7:5
d) 4:3
Fill in the Bayesian contingency table for a trait that shows up in 1 in 50 people in the population and a test for the trait that has an error rate of 1 in 200. Find the following probabilities.
p(error, given test -)
p(error, given test +)
Answers in the comments.
Subscribe to:
Posts (Atom)