Monday, February 2, 2009

Class notes for 2/2

Graphical representation of data

Frequency tables and dotplots: A frequency table is a list of values, which we will call x, followed by the number of times each value shows up on the list, which is called the frequency, denoted by f or f(x). A dotplot is a graph that represents a frequency table. For example, let's take the frequency table from the last class notes and turn it into a dotplot with a vertical scale.

x || f(x)

20 || 2
19 || 1
18 || 2
17 || 5
16 || 7
15 || 2
14 || 5
13 || 5
12 || 3
11 || 5
_9 || 2
_8 || 1
_7 || 2
_4 || 3


For the dot plot version, I type in the information in the Courier font, which is mono-spaced. This was the same number of characters on a line will take up the same width

20| **
19| *
18| **
17| *****
16| *******
15| **
14| *****
13| *****
12| ***
11| *****
10|
9 | **
8 | *
7 | **
6 |
5 |
4 | ***
3 |
2 |
1 |

Notice that every value from 1 to 20 is included in the scale, even the ones that did not appear on the list. This is so the dotplot will be correctly spaced.

The five number summary, box and whiskers, IQR and outlying data. Examples of five number summaries and box and whisker plots can be seen in the previous class notes, but John Tukey added one more idea to the box and whisker definition, which would take into account the idea of outlying data, values that are far larger or far smaller than most of the other data. The question is how to define "far larger" and "far smaller", and we will see that in all the different systems we use, the cut-off point for these decisions is arbitrary. Tukey's arbitrary decision involves a number called the Inter-quartile Range or IQR for short. The formula is as follows:

IQR = Q3 - Q1

This is the distance between the values that define the cut-off points for the 75th percentile and the 25th percentile, so the middle half of the data lies inside that range. That is represented by the box, and Tukey decided that the neither the left or right whisker should ever be more than one and a half times as long as the width of the box. If there is any data that is higher that Q3 + 1.5*IQR, it is represented by a dot. Likewise, any data that is lower than Q1 - 1.5*IQR is represented by a dot. The whiskers only extend to the highest data less than the high outlier threshold or the lowest data more than the low outlier threshold.

Example: Here are the number of times the most popular twelve stories on the Huffington Post website were read today, where the scale is in thousands or readers.

759 165 162 152 132 129 128 118 113 65 55 44 35

Obviously, the most popular story* is being read by more people than the next five stories combined, so we would expect it to be outlying data. Let's create the five number summary, the IQR and find the outlier thresholds.

High: 759
Q3: 157 (The average of 162 and 152)
Q2: 128.5 (The average of 129 and 128)
Q1: 60 (The average of 65 and 55)
Low: 35

IQR = 157 - 60 = 97

High outlier threshold = Q3 + 1.5*IQR = 157 + 1.5*97 = 157 + 145.5 = 302.5
Low outlier threshold = Q1 - 1.5*IQR = 60 - 1.5*97 = 60 - 145.5 = -85.5

As we might expect, there is no low outlying data, but the highest value of 759 is much higher than any of the rest of the data. The way this will be presented is by the upper whisker only extending to the value of 165, which is the highest non-outlying value, while the outlier will be represented by a dot. Here is a horizontal box and whisker plot of this data. You can click on this picture to get a larger version.



(* The most popular story the day this data was taken were pictures of Olympic champion Michael Phelps smoking a bong in the dorm room of a friend.)

Rounding: There are several different rules for rounding values, depending on what the data looks like. In general, when taking data from a list and finding the average, we round to one decimal place more than the data is represented. If numbers are given as whole numbers, we round to the nearest tenth. If numbers are given to the nearest penny, we round the average to the nearest tenth of a penny.

The numbers I gathered from the website I decided to round to the nearest thousand then write the numbers in thousands. For this decision, I used the significant digit method. If you put too many places of numbers, people can get caught up in the precision of the number instead of the general size of the number. For big numbers reported in newspapers and magazines, most writers use either one or two or three significant digits, which is to say how many digits are reported before you round the rest of the number to zeroes.

Here are some examples. Let's say that the bailout of the banks cost $763,922,575,873.22. That's too much information, and who knows if being precise makes any sense, since the final numbers might change.

One significant digit: The bank bailout costs $800 billion.
Two significant digits: The bank bailout costs $760 billion.
Three significant digits: The bank bailout costs $764 billion.

All of these answers are correct, depending on how many significant digits we decide to use. The more significant digits, the more precise the answer is, but using more than three or so significant digits for data presented to the public is probably too much information. If we are using numbers in calculations, we usually want to be as precise as possible, and only do the rounding at the end, when the number is presented as the answer.

Order of operations: You may have learned the order of operations by remembering the word PEMDAS or the mnemonic device

Please (parentheses)
Excuse (exponents)
My Dear (multiplication and division)
Aunt Sally (addition and subtraction)

With calculators like the TI-30X II s or the TI-83 or TI-84, the display screen has a line which shows you the equation as you press the keys, so you can enter an entire equation just once and then press the [ENTER] key. The calculator understands the order of operations rules, but sometimes you still have to be careful how you type your formulas in.

A fraction is really two sets of parentheses. You might remember the formula for the slope of a line in algebra class as being the difference of the y values divided by the difference of the x values. Let's say the two points were (8, 7) and (4, 2). Because of the fraction sign we want to do 7-2 = 5 and 8-4 = 4 BEFORE we do the division, so to make that clear to the calculator, we have to put parentheses around the numerator equation, press the division key, then parentheses around the denominator equations, then press [ENTER].

On your calculator, the square root is followed by an open parentheses. In some texts, a formula that includes a square root might show two number being multiplied together by putting the numbers in parentheses and next to each other instead of using the multiplication sign. That can be a problem on the calculator, because the square root starts it own set of parentheses. Instead of typing (.3)(.7), on the calculator you should type .3*.7. The instructions for different equations are shown below.

On most calculators, you don't have to close every set of parentheses. The only calculator I know that is a stickler for parentheses agreeing with one another left and right is the TI-89. On the other hand, if you use a spreadsheet program, it will always complain if your left and right parentheses don't agree.


No comments: