The topic for the next few weeks is hypothesis testing. The main idea
is that experiments must be conducted to test the validity of an idea,
which is called a hypothesis. There are always two hypotheses
available, the null hypothesis H0 (pronounced "H zero" or "H nought") and the alternate hypothesis HA (pronounced "H A").
The standard is to assume the null hypothesis is true, which says that
nothing special is happening, which in most cases means that two things
we can measure should be equal or close to it. The alternate
hypothesis says the two measurements are different. We can have one
tailed high tests, where we want "large" positive test statistics. In
one tailed low tests, only negative test statistics with "large"
absolute value will do. In two tailed test, "large" absolute value for
positive or negative numbers will work. We will only accept the
alternate hypothesis if the experiment produces impressive results given
our particular criteria for that test.
The
basics of hypothesis testing are similar to the ideals of the English
legal system, which is also the system used in United States courts,
that a defendant is presumed innocent until proven guilty. There are
different levels of proof of guilt in different trials, whether it is
beyond a reasonable doubt or the less rigorous standard of preponderance
of evidence.
In a case involving an alleged crime, there is the
reality of what the defendant did and the result of the trial. If the
defendant did the illegal act, then being found guilty is the correct
result under the law. If the reality is that the defendant didn't do
the act, the correct result would be a not guilty verdict.
The
reasonable doubt standard is put in place in theory to make sending an
innocent person to jail unlikely, and this is called a Type I error.
The best known Type I result in legal history is Jesus Christ.
It
is also possible that a person who did a crime will be found not
guilty. This is called a Type II error. When I ask my students for an
example of Type II error, O.J. Simpson's name still rings out the
loudest.
In hypothesis testing, there is the reality and the result of the experiment. If H0 is true, the two things measured are equal or pretty close to equal. If HA is true, they are significantly different.
If
the experiment produces a test statistic that is beyond the threshold
we set for it, and "beyond" could mean lower if it is a one-tailed low
test, or higher if it is a one-tailed high test, or either lower or
higher in a two-tailed test, then we reject H0. If the test statistic fails to get "beyond" the threshold, we fail to reject H0.
Rejecting a true null hypothesis is a Type I error. Rejecting a false alternate hypothesis is a Type II error.
In class, we discussed Sir Ronald Fisher and his hypothesis testing of
the lady who said she could tell the difference in taste between tea
poured into milk or milk poured into tea.
Here are the things that have to be done to make such an experiment work.
#1 Define the null hypothesis.
In modern experiments, the null hypothesis is always defined as an
equation. In a proportion test, the equation will be concerning p,
the true probability of success. In the lady tasting tea, we would
assume if nothing special is happening, then she is just guessing
whether the tea or milk was poured first, and the probability of being
correct on any given trial is 50% or .5. We write this as follows.
H0: p = .5
#2 Pick a threshold.
The trials we are going to perform are taste tests where the lady
cannot see the tea-milk mixture being poured. We have to decide on how
high a test statistic we will consider impressive. The three standard
choices are 90% confidence, 95% confidence or 99% confidence. For
experiments in the medical field, where the decision is whether or not
to bring a new drug to market, the 99% confidence level is common. For
an experiment like this, where the result is not truly earth shattering,
we might decide to us the 95% confidence threshold.
The experiment will produce a z-score the thresholds for high z-scores are as follows:
90% threshold: z = 1.28
95% threshold: z = 1.645
99% threshold: z = 2.325
#3 Decide on the number of trials in an experiment.
There is a tug-of-war in deciding the number of trials. More trials
produces numbers we can be more confident in, but more trials is also
more expensive and more time consuming. In the case of the lady tasting
tea, we don't want to keep her drinking tea mixtures for hours.
Different books set different standards for the minimum number of trials based on np and nq. Some say both np > 5 and nq > 5. Others say both numbers should be greater than 10, yet others say 15. The standard that np >= 10 and nq >= 10 can be connected to the standard that says n > np + 3*sqrt(pqn) > np - 3*sqrt(pqn) > 0 by a little algebraic manipulation.
np - 3*sqrt(pqn) > 0 [add the square root to both sides]
np > 3*sqrt(pqn) [square both sides]
n^2*p^2 > 9pqn [divide both sides by np]
np > 9q
Since q must be less than 1, but can be as close to 1 as we want, set it equal to 1 and the inequality becomes
np > 9, which we can change to np >= 10.
For example, If let's look at the different possible positive z-score results if the lady were given ten trials, which would be enough if we used the lowest standard of np >= 5 and nq >= 5.
10 correct out of 10: z = 3.16227... ~= 3.16, which is above the 99% threshold.
(look-up table: .9992)
9 correct out of 10: z = 2.52982... ~= 2.53, which is above the 99% threshold.
(look-up table: .9943)
8 correct out of 10: z = 1.89737... ~= 1.90, which is above the 95% threshold, but not the 99%.
(look-up table: .9713)
7 correct out of 10: z = 1.26491... ~= 1.26, which is below the 90% threshold.
(look-up table: .8962)
If
we set the bar at the 90% threshold, she could impress us by getting 8
right out of 10 or better. Likewise, at the 95% threshold, 8 of 10 will
be beyond the threshold and the result would make us reject H0. At the 99% threshold, she would have to get 9 of 10 or 10 of 10 to make a z-score that breaks the threshold.
#4 Interpreting the test statistic.
Let's say for the sake of argument that the lady got 8 of 10 correct.
(There is a book about 20th Century statistics entitled The Lady Tasting Tea,
where a witness to the experiment says he can't recall how many times
she was tested, but the lady got a perfect score.) If we set the
threshold at 90% confidence or 95% confidence, we would be impressed by
the z score of 1.90 and we would reject H0,
which says that we don't think she is "just guessing", but actually has
the talent she says she has. If we set the value at 99% confidence, we
would fail to reject H0.
Here's the thing. We could be wrong. If we reject H0 incorrectly, it means she was just guessing and she was very lucky during this test, a Type I error. A z-score of 1.90 corresponds to a probability of 97.13%, which is called the p-value
in hypothesis testing. She can pass the test by being better than
97.13% of lucky guessers. If she is a lucky guesser, she would fool
anyone who had set the threshold at 90% or 95%.
If we fail to reject H0,
which we would do if we set the threshold at 99%, this could also be an
error, but this time it would be a Type II error. Under this scenario,
she got 8 of 10 but she should do better usually. It's more difficult
math to figure out how good she actually is and how unlucky she had to
be to get only 8 of 10. The probability of a Type I error is called alpha, and it is determined by the threshold. The probability of a Type II error is called beta, and it usually explained in greater detail in the class after the introduction to statistics.
A low tailed test
Let's say we want a low error rate. Unlike the lady tasting tea who needed a lot of right answers to impress us, now we need a low score to get a result that will make us reject the null hypothesis.
Now our z-score thresholds are
90% threshold: z = -1.28
95% threshold: z = -1.645
99% threshold: z = -2.325
Let's say we want to be convinced our error rate is less than 10% and we want to be convinced to the 95% confidence level.
H0: p = 0.10
HA: p < 0.10
n = 50
10% of 50 is 5, so we should check to see what happens at f = 4, 3, 2, 1 and 0 errors. Typing this into the calculator will look like
(f/50 - .1)/sqrt(.1*.9/50)
f = 4 gives a rounded z-score of -.47
(look-up table: .3192 fail to reject H0)
f = 3 gives a rounded z-score of -.94
(look-up table: .1736 fail to reject H0)
f = 2 gives a rounded z-score of -1.41
(look-up table: .0793 fail to reject H0 at 95% confidence, but reject at 90%)
f = 1 gives a rounded z-score of -1.89
(look-up table: .0294 reject H0 at 95% confidence, but fail reject at 99%)
f = 0 gives a rounded z-score of -2.36
(look-up table: .0091 reject H0 at any confidence level we use 90%, 95% or 99%)
Wednesday, March 26, 2014
Notes for March 18 and 20
Labels:
hypothesis testing,
p-values,
proportions,
test statistic,
z-scores
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment