Tuesday, January 27, 2009

Class notes for 1/26

In earlier classes, we discussed three measures of center.

Mean (average)
Type of data: numerical
Method: add up all the numbers and divide by n (or N), the number of things on the list.

Median
Type of data: numerical or ordered categorical
Method: Put the values in order and find the "middle value" which is the value in position (n+1)/2. If n is odd, there is a single median value on the list. If n is even, there are two middle values on the list, and if numerical, take the average of the two. If the data is categorical and the two values aren't the same, the median lies between two categories.

Mode
Type of data: any data can be used, but only if there are duplicate values on the list.
Method: Find the most common value. If there is a tie for most common, there can be more than one mode.

A fourth measure of center was introduced, the mid-range.

Mid-range
Type of data: numerical
Method: (high + low)/2

Sensitivity to outliers: Mid-range is especially sensitive to outlying values, which means values much higher than the rest of the data or much lower than most of the data. Mean is also sensitive, but not as sensitive. Median and mode are not sensitive to outliers at all.

When we should and shouldn't use average: Some types of numerical data do not give useful information when we take the average.

Coded numerical: Usually, with numbers there is a meaning we can give to the ideas of "more" and "less". With coded data, we don't necessarily have that. Examples are zip codes, social security numbers and driver's licenses. Finding the average zip code of a group of people is meaningless, though finding the mode means that is the most popular of the possible zip codes.

Ordinal data: Here, the idea of a > b has meaning, but the distance between units isn't the same. In a ranking system, it's better to be first than it is to be second, but we can't say how much better, and we don't know if the difference between first and second is the same as the distance between second and third.

Often, when we switch from an ordered categorical system like grades (A, B, C, D, F) to the numbers used for grade points (4.0, 3.0, 2.0, 1.0, 0.0), the choice of what numbers to use is arbitrary. Is the distance from an A to B really the same as the distance from a C to a D? Is getting an A in one class and a C in another really the same as getting two Bs, since both would be a 3.0 Grade Point Average (GPA). How about 2 As and a D, which is 3.0, or 3 As and an F? Should all those situations be counted the same way?

Interval data: This is the minimum requirement need for mean to make sense, the idea that the distance between two numbers, a - b, has a consistent meaning, like degrees in temperature readings or the number of strokes taken to complete a round of golf. In these system, the number zero does not mean the complete absence of a thing, so it dividing one number by another from the data set doesn't give meaningful information, but taking and average is about adding values together and dividing by the number of values, so an average temperature or an average of the scores in four rounds of golf does produce a useful statistic.

Rational data: This is data where not only a - b means something, but also a/b. The difference between interval and rational data is the meaning of the number zero. If zero indicates the complete lack of a thing, then we can talk about something between twice as much as another thing, or 10% less. A lot of numerical systems of measurement are rational, but not all.

===

Frequency tables

When we have a lot of repetition in a set of data, a way to write the information more compactly is a frequency table, where a value (either categorical or numerical) is followed by the number of times it shows up in a data set. Here is an example, where we will call the values x and their frequencies f(x).

x || f(x)

20 || 2
19 || 1
18 || 2
17 || 5
16 || 7
15 || 2
14 || 5
13 || 5
12 || 3
11 || 5
_9 || 2
_8 || 1
_7 || 2
_4 || 3

Because there is so much duplication this is much easier to read than 20, 20, 19, 18, 18, 17, 17, 17, 17, 17, etc.

Finding the mode: Whichever value corresponds to the highest frequency is the mode. (Data with no mode would have all frequencies equal to 1, and it would not be a good candidate for being representing as a frequency table.) In the example above, the value 16 shows up seven times, more than any other, so it clearly is the mode.

Finding the median: If we add up all the frequencies, we get n. We need to find (n+1)/2, and figure out which value is in that position. In the data above, n = 45, so the thing in position 23 is the median. We can put the positions of all the data on the list as follows.


x || f(x)

20 || 2 positions 1-2
19 || 1 position 3
18 || 2 positions 4-5
17 || 5 positions 6-10
16 || 7 positions 11-17
15 || 2 positions 18-19
14 || 5 positions 20-24
13 || 5
12 || 3
11 || 5
_9 || 2
_8 || 1
_7 || 2
_4 || 3

This means position 23 is in the middle of a string of values = 14, and 14 is the median.

Next class, we will show how to get an average using a frequency table.

No comments: