Bayesian probability
If a trait is very rare, only a very accurate test gives us useful
information. For example, if a trait shows up in only 1 in 10,000
people but the test for the trait has an error rate of 1 in 1,000, we
should expect about 10 false positives for every true positive. Here is
the completed table for that situation.
________don't___have____row total
test + ____9,999__999______10,998__
test - _9,989,001___1_____9,989,002__
col.___9,999,000_1,000______10,000,000 grand total
In
a situation such as this, testing positive twice could give us useful
information, as testing positive once has an error rate of about 90.9%.
We have to assume the errors are random and not deterministic. For
example, if a test for a chemical compound in opium also catches a
similar compound found in poppy seed bagels, testing twice won't get rid
of the errors. Assuming just random errors here is what we do.
Step
1: The top row of the first contingency table is the column
total/grand total row of the second contingency table. What this does
is takes the numbers from the people who tested positive the first time
and makes them the totals for those who will be tested twice.
________don't___have____row total
test + __________________________
test - __________________________
col._______9,999__999_____10,998 grand total
Step
2: Multiply error rate by have column total to find the number who
have that test negative. Round to the nearest whole number. (We didn't
have to round before, but now we do.)
999*1/1000 = .999 ~= 1, this means test positive and have is 998.
________don't___have____row total
test + ___________998____________
test - _____________1____________
col._______9,999__999_____10,998 grand total
Step
3: Multiply error rate by don't have column total to find the errors.
9,999*1/1000 = 9.999 ~= 10. That means the test negative in that column
is 9,999 - 10 = 9,989.
________don't___have____row total
test + _______10__998____________
test - _____9,989____1____________
col._______9,999__999_____10,998 grand total
Step 4: row totals
________don't___have____row total
test + _______10__998_____1,008___
test - _____9,989____1_____9,990___
col._______9,999__999_____10,998 grand total
Step 5: Find the error rate for testing positive twice. 10/1,008 = .0099... or about 1%.
Of
the ten million people tested, we would send letters to 1,008 telling
them they tested positive twice. Of those people, ten don't have the
trait and are getting false information, but 998 are getting the right
information. In the first test, there was someone with the trait who
tested negative, and the same is true in the second test, so there are
two people with the trait who did not get two positive test results.
While this isn't a perfect situation, it's much better than the over 90%
error rate we got for positive tests the first time through.
Relative frequency charts and ogives
Relative frequencies are also known as proportions and can sometimes be considered probabilities. If the proportions correspond to ordered categories, a line chart can be a clear way to present the data.
Here is the data for the Los Angeles Angels scoring by inning, given as percentages. The first number is the percentage scored in that inning and the second number in brackets is the percentage of entire runs scored in a game up through that inning.
1st: 13.5% [13.5%]
2nd: 10.0% [23.5%]
3rd: 10.4% [33.9%]
4th: 11.0% [45.0%]
5th: 11.7% [56.6%]
6th: 11.7% [68.3%]
7th: 8.8% [77.1%]
8th: 12.3% [89.4%]
9th: 9.4% [98.8%]
extra innings: 1.2% [100.0%]
The Angels are fairly consistent, except for the big bump in the first inning and the drop-off in the 7th. Unsurprisingly, they score very few extra inning runs, though some teams like the Giants score four times as many.
In a line graph, we have two possible options. The first is the line in blue, which shows the production inning by inning. The red line shows the cumulative numbers and that graph is called the ogive, pronounced "oh-jive". If they were completely consistent, the ogive would be a completely straight line, but instead we see the slight bends in the red line when run production increased and decreases per inning.
Sunday, May 11, 2014
Subscribe to:
Posts (Atom)