Tuesday, February 10, 2009

Class notes for 2/9

We are now dealing with proportions, and the formulas are as follows:

Population: p = F/N

Sample: p-hat = f/n

We often want to compare one proportion to another, either two from the same sample or comparable proportions from different samples, or the sample proportion to the population proportion. Because of this, it is better to write the numbers as decimals or in scales based on powers of 10.

Scales based on powers of 10: The most famous scale base on powers of ten in percentage, which really means "per 100". It is much more common to see "53% of the people agree with the president's plan" than ".53 of the people..." or "53 out of every 100 people...". Technically, all those phrases are saying the same thing, but percentage is the most popular.

One of the places where decimals are used for proportions is in the sports pages. A batting average in baseball (hits/at bats) is given as a percent to three decimal place, and likewise winning proportions (win/total games) are written as .xxx. If a batter has 27 hits in 92 at-bats, the batting average 27/92 = .293478261... is shortened to .293 and pronounced "two ninety three". Likewise, a team who has won 17 games and lost 5 will have a winning proportion of 17/22 = .77272727... = .773, and often stated as "team has a winning percentage of seven seventy three." Technically, this is a mistake, because "percentage" means out or 100. The correct word from the dictionary, which no one ever uses, is "permillage", which means out of 1,000. The team in question would have a winning percentage of 77, and a winning permillage of 773.

In both of the cases from the sports pages, the greater number of place after the decimal is used to break ties. For example, a team with 14 wins and 4 losses is at .778, which is better than 17 wins and 5 losses, while 20 wins and 6 losses is at .769, so is slightly worse.

To get a number based on a power of 10 scale, you take the proportion and multiply by the power of ten, so it is either p*scale or p-hat*scale, depending on population or sample. Besides greater precision for breaking ties, sometimes we need greater precision because the proportions are so small.

When I ask a class what is the legal limit for blood alcohol while driving, invariably someone will say "point oh eight" and most people will agree. But .08 is wrong. .08 = 8%, and the correct answer is .08% = .0008. I don't blame the students. The number is badly represented and it is an easy mistake to make. Let's take a look at the number on other scales of 10.

.08 out of 100 is the same as
.8 out of 1,000 0r
8 out of 10,000 or
80 out of 100,000

80 parts out of 100,000 is a tiny proportion. To give an idea, ounce of pure alcohol mixed into ten gallons of blood would give you 78 parts out of 100,000, and most people have between a half gallon and a gallon and a half of blood in their body, between 4 and 12 pints. The amount of alcohol in a person's blood stream that is over the legal limit is about the same amount of alcohol as found in a capful of mouthwash used after brushing your teeth.

We will be dealing with much smaller proportions later in the class, where there are things that can be hazardous to your health at ranges measure in parts per billion, but for now, we will look at the per 100,000 scale for another type of statistic, measurements of mortality rates.

Here are the number of homicides in some local cities in 2007.

Oakland: 124 homicides
Richmond: 28 homicides
San Francisco: 98 homicides

Clearly, comparing these numbers is misleading, because we know these cities have very different numbers of citizens, so the standard way to measure these statistics is the per 100,000 population scale, which we find by the formula

f/n* scale

which in this case is

(# of homicides)/(city population) * 100,000

Oakland's population in 2007 is estimated at 415,000, Richmond at 106,000 and San Francisco at 825,000, so the murder rates on this standard scale are as follows

Oakland: 124/415000 * 100000 = 29.9
Richmond: 28/106000 * 100000 = 26.4
San Francisco: 98/825000 * 100000 = 11.9

So even though more people were murdered in San Francisco than in Richmond in 2007, the murder rate in Richmond was over twice as high, because Richmond has barely 1/8 of the population of San Francisco. (note: The trends for the three cities this decade are going in different directions. Oakland's murder rate is on the rise, while Richmond's is falling and San Francisco's has stayed about the same.)

Calculating proportions (probabilities): There are times when we will need to find new proportions from information previously calculated, either adding and subtracting old numbers or multiplying or dividing. It's best to use the fractional forms of the data when available, then round the answers after using the exact numbers instead of using answers that might have been rounded. Every time you use a rounded answer in a calculation, there is a change to increase the rounding error even more.

The words "proportions" and "probabilities" will be used interchangeably in the rest of this post.

Contingency tables and compound probabilities: Let's take the data from data set #2 regarding gender and left/right handedness and turn it into a contingency table.


___R__L_
M__9__3_
F_29__1_

What these numbers represent is there are 9 right-handed males, 3 left-handed males, 29 right-handed females and 1 left-handed females. We will now add the row totals, the column totals and the grand total, which will be marked in red.


___R__L_
M__9__3_ 12
F_29__1_ 30
__38__4_ 42
=grand total



This gives us the following probabilities.  We assume this is a sample so these values are p-hat.

p-hat(female) = 30/42
p-hat(male) = 12/42
p-hat(left) = 4/42
p-hat(right) = 38/42

We can also combine values from different variables as follows.

p-hat(female and left) = 1/42
p-hat(female and right) = 29/42
p-hat(male and left) = 3/42
p-hat(male and right) = 9/42

When we use the conjunction "and", we take the number of subjects that would answer yes to being both female and left handed, for example, and divide by the size of the data set. This means a single entry from the contingency table divided by the grand total.

The conjunction "or" means we want all the subjects that are in the combination or a row and a column together, but being careful that we did not count anyone twice. We use the principle of inclusion and exclusion when calculating this, which is as follows.

p(A or B) = p(A) + p(B) - p(A and B)

The rule is the same whether we are dealing with p or p-hat.

The reason we subtract is as follows. If I count all the women in the set, and then all the left handed people in the set and add those together, any left handed women were counted twice, so we subtract the total of left handed women to get count correct.

p-hat(female) = 30/42
p-hat(left) = 4/42
p-hat(female and left) = 1/42

p-hat(female or left) = 30/42 + 4/42 - 1/42 = 33/42

Another way to combine proportions is the conjunction "given". The idea of p(female, given left) means how many females are there in the subset of left handed people while p(left, given female) means how many left-handers are there among the women. The formula for this is

p-hat(A, given B) = p(A and B)/p(B)

In a contingency table, the easiest way to calculate this is to find the place in the table that tells us how many people are in the row and column that correspond to A and B, then divide by the row or column total that corresponds to B. Here are the eight different values we have for the given probabilities.

p-hat(female, given left) = 1/4
p-hat(female, given right) = 29/38
p-hat(male, given left) = 3/4
p-hat(male, given right) = 9/38
p-hat(left, given female) = 1/30
p-hat(right given, female) = 29/30
p-hat(left given, male) = 3/12 = 1/4
p-hat(right given, male) = 9/12 = 3/4

Complementary events and their probabilities: The complement of a subset is all the elements that are in the whole set but not in the subset. In a variable with two values, the complement of one value is simply the other value, so the complement of lefthanders is righthanders, and the complement of men is women. In a variable with more than one value, the complement of a value is all the other values. The complement of the 20-29 value in age group would be the subjects 19 and under combined with the subjects 30 and over. Since we are using the words "and", "or" and "given" as our conjunctions, the word used for complement in such a setting is "not".

If we know p(A), the easiest way to calculate the probability of the complement is

p(not A) = 1 - p(A).

In some books and in my notes, the probability of the complementary event will be denoted by the letter q, defined by the equation p + q = 1, or q = 1 - p.

Again, the rules for p and q are the same for p-hat and q-hat.

Here are some complements of the some of ideas we have defined using the conjunctions.

female complement = male
male complement = female
left complement = right
right complement = left

female and left complement = male or right
male and left complement = female or right

female, given left complement = male, given left
male, given left complement = female, given left
female, given right complement = male, given right
male, given right complement = female, given right


Practice problems: (answers given in comments)

1) Here are the homicide numbers for Oakland, Richmond and San Francisco from earlier in this century.

Oakland: 96 homicides, 399,000 population
Richmond: 40 homicides, 99,000 population
San Francisco: 96 homicides, 775,000 population

Find the murder rates from these years, rounded to the nearest tenth per 100,000 population and rank them from lowest (1st) to highest (3rd).

2) Find the complements of the following sets and the probabilities for the set and the complement rounded to the nearest tenth of a percent. (If it rounds exactly to a percent, you can write the answer as 42% instead of 42.0%, to give an example.)

a) left, given female
b) left and female
c) left or female
d) female, given left

2 comments:

Prof. Hubbard said...

problem 1)

Oakland: 96/399000*100000 = 24.1 per 100,000
Richmond: 40/99000*100000 = 40.4 per 100,000
San Francisco: 96/775000*100000 = 12.4 per 100,000

The rankings:
S.F.: 1st
Oakland: 2nd
Richmond: 3rd

2)

complements:

left, given female -> right, given female
left and female -> right or male
left or female -> right and male
female, given left -> male, given left

proportions to nearest percent.

p(left, given female) = 3.3%
p(right, given female) = 96.7%

p(left and female) = 2.4%
p(right or male) = 97.6%

p(left or female) = 78.6%
p(right and male) = 21.4%

p(female, given left) = 25%
p(male, given left) = 75%

Anonymous said...

Can you Please post the NOTES from Wednesday, Feb 11, 2009. Thanks.