Tuesday, May 5, 2009

Class notes for 5/4, part 2: Bayesian probability

Earlier in the term, we created contingency tables from reading data sets and filling in the positions of the table, then finding the row totals, column totals and the grand total. We then learned about conditional probability, where we find that p(left, given female) might not equal p(left, given male) or p(left). If these probabilities are not equal, we call them dependent, because it depends on if we are looking a the whole population or some specific sub-population. If they are all equal, the probabilities are independent.

In Bayesian probability, we will be building a contingency table "backwards". Instead of filling in each value in the table then finding row totals, column totals and grand total, we will start with a trait in the population and a test for that trait. We will make a 2x2 contingency table, where the columns deal with having the trait on not and the rows refer to testing positive of testing negative. If the test has an error rate, as is often the case, some people are going to get incorrect information. What we will see is that the overall error rate can sometimes be quite different for the error rate for those who test positive and the error rate for those who test negative.

Let's say there is a genetic trait in the population that shows up in 25% of subjects, which we will write as 1 in 4. The test for the trait has a 2% error rate, so it is 1 in 50.

Step 1: The grand total is the product of the denominators of the fractions.

In our case, 4*50 = 200


________don't___have____row total

test + __________________________

test - __________________________

col._____________________200 grand total

Step 2: Multiply the grand total by the trait proportion to find the column totals.

Since 25% of the population has the trait, 25% of 200 = 50 subjects have the trait in our idealized sample. By subtraction, 150 don't have the trait.


________don't___have____row total

test + __________________________

test - __________________________

col._____150_____50______200 grand total

Step 3: Fill in the "have the trait" column by multiplying the error rate by the column total to fill in the mistaken position, and fill in the rest by subtracting.

In our case, the error rate is 1 in 50. This means for the people who have the trait, 1 person will test negative, while the other 49 will correctly test positive.

________don't___have____row total

test + ___________49______________

test - ____________1______________

col._____150_____50______200 grand total

Step 4: Fill in the "don't have the trait" column using the same method.

3 of the 150 will get the wrong information, which in their case will be a positive test. The other 147 will get the right information, a negative test result.

________don't___have____row total

test + ____3______49______________

test - ___147______1_______________

col._____150_____50______200 grand total

Step 5: Fill in the row totals.

________don't___have____row total

test + ____3______49______52______

test - ___147______1______148______

col._____150_____50______200 grand total

The error numbers are marked in bold and blue for the next step.

Step 6: Find the error rates given test positive and test negative.
p(error) = (3+1)/200 = 1/50 = .02, which was the advertised error rate.
p(error, given test positive) = 3/52 ~= .058, much higher than .02
p(error, given test negative) = 1/148 ~= .0068, much lower than .02

Unless the trait shows up in 50% of the population, we expect to get differences between the error rates for test positive and test negative. Whichever is the smaller part of the population should see a higher error rate. Just how significant the differences are in the error rates between the two test groups depends on the size of the error rate to the trait rate. A 99% accurate test sounds good, but if the trait is very rare, we might well get more false positives than true positives.

No comments: