Sunday, May 11, 2014

Notes for May 6th and 8th

Bayesian probability

If a trait is very rare, only a very accurate test gives us useful information. For example, if a trait shows up in only 1 in 10,000 people but the test for the trait has an error rate of 1 in 1,000, we should expect about 10 false positives for every true positive. Here is the completed table for that situation.

________don't___have____row total

test + ____9,999__999______10,998__

test - _9,989,001___1_____9,989,002__

col.___9,999,000_1,000______10,000,000 grand total

In a situation such as this, testing positive twice could give us useful information, as testing positive once has an error rate of about 90.9%. We have to assume the errors are random and not deterministic. For example, if a test for a chemical compound in opium also catches a similar compound found in poppy seed bagels, testing twice won't get rid of the errors. Assuming just random errors here is what we do.

Step 1: The top row of the first contingency table is the column total/grand total row of the second contingency table. What this does is takes the numbers from the people who tested positive the first time and makes them the totals for those who will be tested twice.

________don't___have____row total

test + __________________________

test - __________________________

col._______9,999__999_____10,998 grand total

Step 2: Multiply error rate by have column total to find the number who have that test negative. Round to the nearest whole number. (We didn't have to round before, but now we do.)
999*1/1000 = .999 ~= 1, this means test positive and have is 998.

________don't___have____row total

test + ___________998____________

test - _____________1____________

col._______9,999__999_____10,998 grand total

Step 3: Multiply error rate by don't have column total to find the errors. 9,999*1/1000 = 9.999 ~= 10. That means the test negative in that column is 9,999 - 10 = 9,989.

________don't___have____row total

test + _______10__998____________

test - _____9,989____1____________

col._______9,999__999_____10,998 grand total

Step 4: row totals

________don't___have____row total

test + _______10__998_____1,008___

test - _____9,989____1_____9,990___

col._______9,999__999_____10,998 grand total

Step 5: Find the error rate for testing positive twice. 10/1,008 = .0099... or about 1%.

Of the ten million people tested, we would send letters to 1,008 telling them they tested positive twice. Of those people, ten don't have the trait and are getting false information, but 998 are getting the right information. In the first test, there was someone with the trait who tested negative, and the same is true in the second test, so there are two people with the trait who did not get two positive test results. While this isn't a perfect situation, it's much better than the over 90% error rate we got for positive tests the first time through. 

Relative frequency charts and ogives

Relative frequencies are also known as proportions and can sometimes be considered probabilities. If the proportions correspond to ordered categories, a line chart can be a clear way to present the data.

Here is the data for the Los Angeles Angels scoring by inning, given as percentages. The first number is the percentage scored in that inning and the second number in brackets is the percentage of entire runs scored in a game up through that inning.

1st: 13.5% [13.5%]
2nd: 10.0% [23.5%]
3rd: 10.4% [33.9%]
4th: 11.0% [45.0%]
5th: 11.7% [56.6%]
6th: 11.7% [68.3%]
7th: 8.8% [77.1%]
8th: 12.3% [89.4%]
9th: 9.4% [98.8%]
extra innings: 1.2% [100.0%]

The Angels are fairly consistent, except for the big bump in the first inning and the drop-off in the 7th. Unsurprisingly, they score very few extra inning runs, though some teams like the Giants score four times as many.


In a line graph, we have two possible options. The first is the line in blue, which shows the production inning by inning. The red line shows the cumulative numbers and that graph is called the ogive, pronounced "oh-jive". If they were completely consistent, the ogive would be a completely straight line, but instead we see the slight bends in the red line when run production increased and decreases per inning.