Thursday, January 29, 2009

Class notes for 1/28

More on frequency tables: When we have a lot of duplicate values in a data set, a frequency table lets us write that data more compactly, by listing each value x, followed by the positive number of times it shows up, f(x). Let's look at the data list from the 1/26 notes, which were the scores in one of the two classes this semester on the first quiz.

x || f(x)

20 || 2
19 || 1
18 || 2
17 || 5
16 || 7
15 || 2
14 || 5
13 || 5
12 || 3
11 || 5
_9 || 2
_8 || 1
_7 || 2
_4 || 3


We already discussed how to find the mode and median, but what about the mean? We need to find n, and we need to find the sum of all the x values. To get those numbers, n = sum(f(x)) and sum(x) = sum(x*f(x)). Here is the list again with another column of numbers added, the value multiplied by its frequency.

x || f(x) || x*f(x)

20 || 2 || 40
19 || 1 || 19
18 || 2 || 36
17 || 5 || 85
16 || 7 ||112
15 || 2 || 30
14 || 5 || 70
13 || 5 || 65
12 || 3 || 36
11 || 5 || 55
_9 || 2 || 18
_8 || 1 || _8
_7 || 2 || 14
_4 || 3 || 12

The sum of the second column is 45, the sum of the third column is 600, so x-bar=600/45=13.333..., or rounded to the nearest tenth, 13.3.


The ideas of John Tukey. John Tukey was a statistician and computer scientist who did a lot of his best work back in the 1960s. Many of his ideas in statistics are about easier and shorter ways to present data sets. We will study three of them, the stem and leaf plot, the five number summary and its graphical representation, the box and whiskers plot.

Stem and leaf plots. If we don't have a lot of duplication, a frequency table is not going to make presenting data any shorter. What Tukey did was split numbers into two parts, the stem and the leaf. The standard way to do this is to make the last digit the leaf and the rest of the number the stem. For example, the number 87 would have stem = 8 and leaf = 7. Tukey's idea was to put the stem numbers at the left and list the leaves from low to high, using a mono-spaced font, which back in the day of typewriters was the only choice available for someone typing. (Courier is a mono-spaced font. The letter i is the same width as the letter w or any other symbol. Most fonts used today aren't mono-spaced anymore, but in this situation, we want to use Courier.)

Here is a list of numbers. It is the number of wins by teams in the American League at the end of the 2008 regular season.

97, 95, 89, 86, 68, 89, 88, 81, 75, 74, 100, 79, 75, 61

The highest number is 100, so we make the stem 10 and the leaf 0. All the numbers on the list are put in stem and leaf form, where the stem is underlined and followed by a |, then the leaves are listed from low to high from left to right.

____
10 | 0
_9 | 57
_8 | 16899
_7 | 4559
_6 | 18


Sometimes data is too spread out to use the last digit as the leaf, and instead the last two digits are the leaf, and we put a space between the two digit leaves for readability. Here is an example. This data set is the number of points scored by the teams in the AFC in the 2008 regular season.

Data set: 347, 448, 388, 440, 234, 298, 394, 367, 223, 244, 364, 350, 342, 356, 309, 317

44 | 08
43 |
42 |
41 |
40 |
39 | 4
38 | 8
37 |
36 | 47
35 | 06
34 | 27
33 |
32 |
31 | 7
30 | 9
29 | 8
28 |
27 |
26 |
25 |
24 | 4
23 | 4
22 | 3

If each stem is a span of ten values, this data is too spread out, but if the stems go from 200 to 299, 300 to 399 and 400-499, the stem and leaf method works very well. When the leaves are two digits, we put a space between them for easier reading.

___
4 | 40 48
3 | 09 17 42 47 50 56 64 67 88 94
2 | 23 34 44 98

Now the data is in a more compact list, and the big idea of stem and leaf, which is not only to give the values but to give the shape of the data, becomes clear. There are only a few values in the 400-499 range, most are between 300 and 399, with some low values in the 200s.

It's possible to have the opposite problem where a stem has too many leaves. Here is the stem and leaf version of the data set we represented with a frequency table earlier.

___
2 | 00
1 | 11111222333334444455666666677777889
0 | 44477899

Obviously, the vast majority of the data is between 10 and 19. It makes sense to make more stems by splitting each category in half, going from 0 to 4, 5 t0 9, 10 to 14, 15 to 19 and 20 to 24. Since there is no data above 25, we don't have to create an empty stem from 25 to 29.


___
2 | 00
1 | 55666666677777889
1 | 111112223333344444
0 | 77899
0 | 444

This shows us that the most common values were between 10 and 14, with 15 to 19 almost as common, with the frequency trailing off as values get above 19 or under 10.


Five number summary. Both frequency tables and stem and leaf plots tell us about all the data. The five number summary, yet another idea of John Tukey, gives us an idea of how the data is distributed, but leaves a lot of information out. The five numbers are the highest value, the lowest value, and three intermediate numbers, Q1, Q2 and Q3. Q stands for quartile, so these numbers are respectively the cut-off points for the bottom 25% of the data, the bottom 50% of the data and the bottom 75% of the data. We already know Q2 by another name, the median. Once we remove the median from a set of data, we then have two new subsets, the bottom half and the top half. Q1 is the median of the bottom half, while Q3 is the median of the top. Let's do an example with a set of data we have already dealt with, the points scored by AFC teams. First, here is the list of 16 numbers put in order from high to low. The middle two values are marked in bold type.

448, 440, 394, 388, 367, 364, 356, 350, 347, 342, 317, 309, 298, 244, 234 223

The median, or Q2 is (350+347)/2 = 348.5.

Top half: 448, 440, 394, 388, 367, 364, 356, 350

The median of the top half, or Q3 is (388+367)/2 = 377.5.

Bottom half: 347, 342, 317, 309, 298, 244, 234 223

The median of the bottom half, or Q1 is (309+298)/2 = 303.5.

In order, the five number summary is:

High: 448
Q3: 377.5
Q2: 348.5
Q1: 303.5
Low: 223

Box and whisker plots: The box and whisker plot is a graphical representation of the five number summary. Each of the five points is plotted against a number line for scale. The area between and is the box, and it represents the middle 50% of the data. The whiskers are lines from the edge of the box to the high and low values. is shown by a dotted line somewher inside the box, though not exactly in the middle, necessarily.

Here is a representation of five number summary above as a box and whisker plot. It could either be on a vertical scale, as pictured here, or it could be represented on a horizontal scale. When financial data from a set of time periods is summarized, the data is often represented with box and whiskers plots, though some websites call them candlesticks instead. The vertical scale represents the value in currency (usually dollars), while the horizontal scale represents the change in time, where the time periods can be any standardized amount, whether day to day or week to week or month to month.

No comments: