Statistics on a budget: ogives

Sunday, May 11, 2014

Notes for May 6th and 8th

Bayesian probability

If a trait is very rare, only a very accurate test gives us useful information. For example, if a trait shows up in only 1 in 10,000 people but the test for the trait has an error rate of 1 in 1,000, we should expect about 10 false positives for every true positive. Here is the completed table for that situation.

________don't___have____row total

test + ____9,999__999______10,998__

test - _9,989,001___1_____9,989,002__

col.___9,999,000_1,000______10,000,000 grand total

In a situation such as this, testing positive twice could give us useful information, as testing positive once has an error rate of about 90.9%. We have to assume the errors are random and not deterministic. For example, if a test for a chemical compound in opium also catches a similar compound found in poppy seed bagels, testing twice won't get rid of the errors. Assuming just random errors here is what we do.

Step 1: The top row of the first contingency table is the column total/grand total row of the second contingency table. What this does is takes the numbers from the people who tested positive the first time and makes them the totals for those who will be tested twice.

________don't___have____row total

test + __________________________

test - __________________________

col._______9,999__999_____10,998 grand total

Step 2: Multiply error rate by have column total to find the number who have that test negative. Round to the nearest whole number. (We didn't have to round before, but now we do.)
999*1/1000 = .999 ~= 1, this means test positive and have is 998.

________don't___have____row total

test + ___________998____________

test - _____________1____________

col._______9,999__999_____10,998 grand total

Step 3: Multiply error rate by don't have column total to find the errors. 9,999*1/1000 = 9.999 ~= 10. That means the test negative in that column is 9,999 - 10 = 9,989.

________don't___have____row total

test + _______10__998____________

test - _____9,989____1____________

col._______9,999__999_____10,998 grand total

Step 4: row totals

________don't___have____row total

test + _______10__998_____1,008___

test - _____9,989____1_____9,990___

col._______9,999__999_____10,998 grand total

Step 5: Find the error rate for testing positive twice. 10/1,008 = .0099... or about 1%.

Of the ten million people tested, we would send letters to 1,008 telling them they tested positive twice. Of those people, ten don't have the trait and are getting false information, but 998 are getting the right information. In the first test, there was someone with the trait who tested negative, and the same is true in the second test, so there are two people with the trait who did not get two positive test results. While this isn't a perfect situation, it's much better than the over 90% error rate we got for positive tests the first time through.

Relative frequency charts and ogives

Relative frequencies are also known as proportions and can sometimes be considered probabilities. If the proportions correspond to ordered categories, a line chart can be a clear way to present the data.

Here is the data for the Los Angeles Angels scoring by inning, given as percentages. The first number is the percentage scored in that inning and the second number in brackets is the percentage of entire runs scored in a game up through that inning.

1st: 13.5% [13.5%]
2nd: 10.0% [23.5%]
3rd: 10.4% [33.9%]
4th: 11.0% [45.0%]
5th: 11.7% [56.6%]
6th: 11.7% [68.3%]
7th: 8.8% [77.1%]
8th: 12.3% [89.4%]
9th: 9.4% [98.8%]
extra innings: 1.2% [100.0%]

The Angels are fairly consistent, except for the big bump in the first inning and the drop-off in the 7th. Unsurprisingly, they score very few extra inning runs, though some teams like the Giants score four times as many.

In a line graph, we have two possible options. The first is the line in blue, which shows the production inning by inning. The red line shows the cumulative numbers and that graph is called the ogive, pronounced "oh-jive". If they were completely consistent, the ogive would be a completely straight line, but instead we see the slight bends in the red line when run production increased and decreases per inning.

Wednesday, February 18, 2009

Class notes for 2/18

Increase and decrease, in absolute terms and in percentage: When we have numbers that have changed over time, we can get the absolute change by subtracting the old number from the new. If the numbers are rational data, which as you will recall has to do with the idea that zero means the complete lack of a thing, then we can also discuss the percentage change, which we get through the formula (new-old)/old*100. Don't forget to put the parentheses around the numbers in the numerator.

For instance, the graphical representation of data illustration shows the differing fortunes of two American corporations over the last eight years, by listing the market capitalization on 2000 compared to the same statistic in 2008. (Market capitalization means how much money it would cost to buy all the company's stock at the price it is trading for.)

Apple has had a good century so far, and their stock's value has risen from $5.56 billion in 2000 to $85.25 billion in 2008. In absolute terms that is an increase of (85.25-5.56) = 79.69 billion dollars.

The percentage increase is even more impressive as (85.25-5.56)/5.56*100 = 1433.27..., so the market capitalization has increase by a whopping 1433.3%.

General Motors has had a bad eight years, and their market capitalization has shrunk from $28.3 billion in 2000 to $2.99 billion in 2008. (2.99-28.3) = -25.31, so the absolute decrease is about $25.3 billion. The percentage decrease is (2.99-28.3)/28.3*100 = -89.43..., which means the percentage decrease for GM's market capitalization is about 89.4%. This illustrates an important point. It is possible for a percentage increase to be more than 100%, and this happens any time a number more than doubles. But unless a formerly positive number goes negative, and that is a rare thing, the percentage decrease will not be more than 100%.

Bar charts, also known as histograms: Bar charts are a popular way to represent categorical data, whether ordered or unordered. One of the great advantages of bar charts over pie charts is that it is easier to compare two data sets side by side. In the picture on the left, we see the percentages of people in four different demographic groups in the states of Florida (the yellow histograms) and Texas (the red histograms). What we see is that Texas has a slightly larger percentage of children under 5 years old (8% to 6%), of juveniles (19% to 16%), of adults under the age of 65 (63% to 61%), but Florida makes up for that gap by having a significantly higher percentage of senior citizens over the age of 65 (17% to 10%). Because Texas has a much higher population than Florida, it makes sense for us to look at relative frequency instead of frequency, so we can compare the data sets side by side and not have the Texas numbers completely overwhelm the Florida numbers.

Line charts: A very common use for line charts is to track numerical data over time. Newspapers and financial websites often show how the price of a commodity or stock is doing by using line charts, which look something like the profile of a mountain range. This graph was taken from kitco.com, a website devoted to trading metals, including gold, silver, copper and platinum. These are the prices of gold minute by minute for three days, Monday, Feb. 16, 2009 to Wednesday, Feb. 18. The Monday prices are shown in blue, the Tuesday prices in red and the Wednesday prices in green. What we can see is that prices rose on Monday from under $940 an ounce to nearly $960, Tuesday showed a steady climb from $960 to $970 an ounce, then Wednesday prices fell a bit early in the day, but rallied to finish near $980 an ounce.

Line charts can also be used to represent data that might be shown as histograms. This line chart in red and yellow shows the same demographic data we had in the bar chart section from above, with red showing Texas data and yellow showing Florida data.

There are two sets of lines in each color. The thin lines are the same as the histograms from the section above, while the thick lines that climb to 100% at the far right are ogives, pronounced "Oh-jives", which track the cumulative amount. Here are the numbers for each state, both from each demographic and the accumulation of demographic groups from youngest to oldest.

Florida:
Under 5 years old: 6.2%
6-18 years old: 16.0%
19-64 years old: 61.0%
65 and over: 16.8%

Florida (cumulative):
Under 5 years old: 6.2%
Under 18 years old: 22.2%
Under 64 years old: 83.2%
All age groups: 100%

Texas:
Under 5 years old: 8.1%
6-18 years old: 19.4%
19-64 years old: 62.5%
65 and over: 9.9%

Texas (cumulative):
Under 5 years old: 8.1%
Under 18 years old: 27.5%
Under 64 years old: 90.9%
All age groups: 100%

Pie charts: Pie charts work best with unordered categorical data. The pie slices are ordered from largest to smallest. The way most people set up the start of the data is to put a line segment pointing straight up and start putting the pie slices in clockwise from largest to smallest. This data shows the percentages of racial groups in the U.S. as of 2008. Besides the four main racial groups of whites (here identified as European Americans), Hispanics, African Americans and Asian Americans, all other racial groups account for less than 0.5%.

There is an alternative way of setting up pie charts where the "start line" points to 3 o'clock and the data is put from largest to smallest, moving counterclockwise. This picture is taken from the Sitemeter website attached to another blog that I maintain. 70% of the last 100 visitors are from the United States, 11% are from Canada, 10% cannot be recognized by the Sitemeter software by country of origin, followed by 3% United Kingdom, 2% Chile, and 1% each for Panama, Ireland, Guam and Germany.

Statistics on a budget

Sunday, May 11, 2014

Notes for May 6th and 8th

Wednesday, February 18, 2009

Class notes for 2/18

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter