Thursday, May 7, 2009

Class notes for 5/6: Hypothesis testing

The topic for the next few weeks is hypothesis testing. The main idea is that experiments must be conducted to test the validity of an idea, which is called a hypothesis. There are always two hypotheses available, the null hypothesis H0 (pronounced "H zero" or "H nought") and the alternate hypothesis HA (pronounced "H A"). The standard is to assume the null hypothesis is true, which says that nothing special is happening, which in most cases means that two things we can measure should be equal or close to it. The alternate hypothesis says the two measurements are different. We can have one tailed high tests, where we want "large" positive test statistics. In one tailed low tests, only negative test statistics with "large" absolute value will do. In two tailed test, "large" absolute value for positive or negative numbers will work. We will only accept the alternate hypothesis if the experiment produces impressive results given our particular criteria for that test.


The basics of hypothesis testing are similar to the ideals of the English legal system, which is also the system used in United States courts, that a defendant is presumed innocent until proven guilty. There are different levels of proof of guilt in different trials, whether it is beyond a reasonable doubt or the less rigorous standard of preponderance of evidence.

In a case involving an alleged crime, there is the reality of what the defendant did and the result of the trial. If the defendant did the illegal act, then being found guilty is the correct result under the law. If the reality is that the defendant didn't do the act, the correct result would be a not guilty verdict.

The reasonable doubt standard is put in place in theory to make sending an innocent person to jail unlikely, and this is called a Type I error. The best known Type I result in legal history is Jesus Christ.

It is also possible that a person who did a crime will be found not guilty. This is called a Type II error. When I ask my students for an example of Type II error, O.J. Simpson's name still rings out the loudest.


In hypothesis testing, there is the reality and the result of the experiment. If H0 is true, the two things measured are equal or pretty close to equal. If HA is true, they are significantly different.

If the experiment produces a test statistic that is beyond the threshold we set for it, and "beyond" could mean lower if it is a one-tailed low test, or higher if it is a one-tailed high test, or either lower or higher in a two-tailed test, then we reject H0. If the test statistic fails to get "beyond" the threshold, we fail to reject H0.

Rejecting a true null hypothesis is a Type I error. Rejecting a false alternate hypothesis is a Type II error.

In class, we discussed Sir Ronald Fisher and his hypothesis testing of the lady who said she could tell the difference in taste between tea poured into milk or milk poured into tea.

Here are the things that have to be done to make such an experiment work.

#1 Define the null hypothesis. In modern experiments, the null hypothesis is always defined as an equation. In a proportion test, the equation will be concerning p, the true probability of success. In the lady tasting tea, we would assume if nothing special is happening, then she is just guessing whether the tea or milk was poured first, and the probability of being correct on any given trial is 50% or .5. We write this as follows.

H0: p = .5

#2 Pick a threshold. The trials we are going to perform are taste tests where the lady cannot see the tea-milk mixture being poured. We have to decide on how high a test statistic we will consider impressive. The three standard choices are 90% confidence, 95% confidence or 99% confidence. For experiments in the medical field, where the decision is whether or not to bring a new drug to market, the 99% confidence level is common. For an experiment like this, where the result is not truly earth shattering, we might decide to us the 95% confidence threshold.

The experiment will produce a z-score the thresholds for high z-scores are as follows:

90% threshold: z = 1.28
95% threshold: z = 1.645
99% threshold: z = 2.325



#3 Decide on the number of trials in an experiment. There is a tug-of-war in deciding the number of trials. More trials produces numbers we can be more confident in, but more trials is also more expensive and more time consuming. In the case of the lady tasting tea, we don't want to keep her drinking tea mixtures for hours.

Different books set different standards for the minimum number of trials based on np and nq. Some say both np > 5 and nq > 5. Others say both numbers should be greater than 10, yet others say 15. The standard that np >= 10 and nq >= 10 can be connected to the standard that says n > np + 3*sqrt(pqn) > np - 3*sqrt(pqn) > 0 by a little algebraic manipulation.

np - 3*sqrt(pqn) > 0 [add the square root to both sides]
np > 3*sqrt(pqn) [square both sides]
n^2*p^2 > 9pqn [divide both sides by np]
np > 9q

Since q must be less than 1, but can be as close to 1 as we want, set it equal to 1 and the inequality becomes

np > 9, which we can change to np >= 10.

For example, If let's look at the different possible positive z-score results if the lady were given ten trials, which would be enough if we used the lowest standard of np >= 5 and nq >= 5.

10 correct out of 10: z = 3.16227... ~= 3.16, which is above the 99% threshold.
9 correct out of 10: z = 2.52982... ~= 2.52, which is above the 99% threshold.
8 correct out of 10: z = 1.89737... ~= 1.90, which is above the 95% threshold, but not the 99%.
7 correct out of 10: z = 1.26491... ~= 1.26, which is below the 90% threshold.

If we set the bar at the 90% threshold, she could impress us by getting 8 right out of 10 or better. Likewise, at the 95% threshold, 8 of 10 will be beyond the threshold and the result would make us reject H0. At the 99% threshold, she would have to get 9 of 10 or 10 of 10 to make a z-score that breaks the threshold.

#4 Interpreting the test statistic. Let's say for the sake of argument that the lady got 8 of 10 correct. (There is a book about 20th Century statistics entitled The Lady Tasting Tea, where a witness to the experiment says he can't recall how many times she was tested, but the lady got a perfect score.) If we set the threshold at 90% confidence or 95% confidence, we would be impressed by the z score of 1.90 and we would reject H0, which says that we don't think she is "just guessing", but actually has the talent she says she has. If we set the value at 99% confidence, we would fail to reject H0.

Here's the thing. We could be wrong. If we reject H0 incorrectly, it means she was just guessing and she was very lucky during this test, a Type I error. A z-score of 1.90 corresponds to a probability of 97.13%, which is called the p-value in hypothesis testing. She can pass the test by being better than 97.13% of lucky guessers. If she is a lucky guesser, she would fool anyone who had set the threshold at 90% or 95%.

If we fail to reject H0, which we would do if we set the threshold at 99%, this could also be an error, but this time it would be a Type II error. Under this scenario, she got 8 of 10 but she should do better usually. It's more difficult math to figure out how good she actually is and how unlucky she had to be to get only 8 of 10. The probability of a Type I error is called alpha, and it is determined by the threshold. The probability of a Type II error is called beta, and it usually explained in greater detail in the class after the introduction to statistics.

No comments: