Statistics on a budget: Notes for 28 January 2014

Frequency tables

On last Thursday, we took a first look at stem and leaf plots, a method for writing a data set using less space and less ink (or if we are on the Internet, less bandwidth). For some data sets that have a lot of duplication, we can use a frequency table, where the numerical values x are matched up with their corresponding frequencies f(x). Here is an example using hockey scores.

x || f(x)
7 || 1
6 || 1
5 || 3
4 || 4
3 || 6
2 || 2
1 || 3
0 || 2

This set isn't that big, so let me write it out as a list in order.

7, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 1, 1, 1, 0, 0

If we want to find the standard statistics for mean, median and mode, and the length of the list n, here are the methods.

Length of list n. we take the sum of the f(x) column.

x || f(x)
7 || 1
6 || 1
5 || 3
4 || 4
3 || 6
2 || 2
1 || 3
0 || 2
||22

The list has 22 values. We can use this to find mean and median.

Mode. Mode is easy using a frequency table, we see the biggest f(x) is 6 and it corresponds to x = 3. That means the mode is 3, the number that shows up most often.

Mean. We need to know the sum of all 22 values. For this we need to multiply x*f(x). The idea is that there is one 7, so that adds 7 to the total, but there are three 5s, so that adds 15 to the total.

x || f(x) || x*f(x)
7 || 1        7
6 || 1        6
5 || 3       15
4 || 4       16
3 || 6       18
2 || 2        4
1 || 3        3
0 || 2        0
||22   || 69

The average or mean is 69/22 = 3.1363636... Since the data is all whole numbers, we round the average to the nearest tenth, which we would write as x-bar = 3.1

Median. To find the median, we need to find the median position, which is always (n + 1)/2. In this case, that is (22+1)/2 = 23/2 = 11.5. The median will be the average of the 11th and 12th numbers on the list. Here's how we find those numbers using the frequency table.

x || f(x) || positions
7 || 1        1
6 || 1        2
5 || 3      3 through 5
4 || 4      6 through 9
3 || 6     10 through 15
2 || 2     16 through 17
1 || 3     18 through 20
0 || 2     21 through 22

I went to the end of the list, but I could have stopped as soon as we knew that the value 3 was in positions 10 through 15. That means the 11th number is a 3 and so is the 12th number. The average of 3 and 3 is

(3+3)/2 = 3.

In this case, the median is 3.

Relative frequencies

We are now dealing with proportions, and the formulas are as follows:
Population: p = F/N
Sample: p-hat = f/n
We often want to compare one proportion to another, either two from the same sample or comparable proportions from different samples, or the sample proportion to the population proportion. Because of this, it is better to write the numbers as decimals or in scales based on powers of 10.
Scales based on powers of 10: The most famous scale base on powers of ten in percentage, which really means "per 100". It is much more common to see "53% of the people agree with the president's plan" than ".53 of the people..." or "53 out of every 100 people...". Technically, all those phrases are saying the same thing, but percentage is the most popular.
One of the places where decimals are used for proportions is in the sports pages. A batting average in baseball (hits/at bats) is given as a percent to three decimal place, and likewise winning proportions (win/total games) are written as .xxx. If a batter has 27 hits in 92 at-bats, the batting average 27/92 = .293478261... is shortened to .293 and pronounced "two ninety three". Likewise, a team who has won 17 games and lost 5 will have a winning proportion of 17/22 = .77272727... = .773, and often stated as "team has a winning percentage of seven seventy three." Technically, this is a mistake, because "percentage" means out or 100. The correct word from the dictionary, which no one ever uses, is "permillage", which means out of 1,000. The team in question would have a winning percentage of 77, and a winning permillage of 773.
In both of the cases from the sports pages, the greater number of place after the decimal is used to break ties. For example, a team with 14 wins and 4 losses is at .778, which is better than 17 wins and 5 losses, while 20 wins and 6 losses is at .769, so is slightly worse.
To get a number based on a power of 10 scale, you take the proportion and multiply by the power of ten, so it is either p*scale or p-hat*scale, depending on population or sample. Besides greater precision for breaking ties, sometimes we need greater precision because the proportions are so small.
When I ask a class what is the legal limit for blood alcohol while driving, invariably someone will say "point oh eight" and most people will agree. But .08 is wrong. .08 = 8%, and the correct answer is .08% = .0008. I don't blame the students. The number is badly represented and it is an easy mistake to make. Let's take a look at the number on other scales of 10.
.08 out of 100 is the same as
.8 out of 1,000 0r
8 out of 10,000 or
80 out of 100,000
80 parts out of 100,000 is a tiny proportion. To give an idea, ounce of pure alcohol mixed into ten gallons of blood would give you 78 parts out of 100,000, and most people have between a half gallon and a gallon and a half of blood in their body, between 4 and 12 pints. The amount of alcohol in a person's blood stream that is over the legal limit is about the same amount of alcohol as found in a capful of mouthwash used after brushing your teeth.
We will be dealing with much smaller proportions later in the class, where there are things that can be hazardous to your health at ranges measure in parts per billion, but for now, we will look at the per 100,000 scale for another type of statistic, measurements of mortality rates.
Here are the number of homicides in some American cities in 2013.

Chicago: 415
New York City: 333
Detroit: 333
Washington, DC: 103
Oakland: 92

This makes it look like Chicago is most dangerous and Oakland is safest, but this doesn't take into account the populations of the cities. Here is the list with populations rounded to the nearest ten thousand.

Chicago: 415 murders, pop. 2,720,000
New York City: 333 murders, pop. 19,650,000
Detroit: 333 murders, pop. 700,000
Washington, DC: 103 murders, pop. 650,000
Oakland: 92 murders, pop. 390,000

Because the number of murders is much smaller than the populations, if we just divide we will get a very small decimal number. For example in Chicago.

415/2720000 = 0.0001525735...

With death statistics, the scale of measurement is usually number per 100,000. To get that, we multiply the answer by the scale 100,000.

0.0001525735... * 100,000 = 15.25735...

Rounding to the nearest tenth, we would say the murder rate per 100,000 people in Chicago last year was 15.3. Here are all the numbers, rounded to the nearest tenth.

Chicago: 415 murders, pop. 2,720,000 15.3 per 100,000
New York City: 333 murders, pop. 19,650,000 1.7 per 100,000
Detroit: 333 murders, pop. 700,000 47.6 per 100,000
Washington, DC: 103 murders, pop. 650,000 15.8 per 100,000
Oakland: 92 murders, pop. 390,000 23.6 per 100,000

Looking at the number on this scale, it is clear that Detroit has a much higher murder rate than the other cities listed Oakland has the second highest murder rate. Chicago and DC are very similar and New York City's rate is very low.

Statistics on a budget

Wednesday, January 29, 2014

Notes for 28 January 2014

No comments:

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter