Wednesday, February 4, 2009

Class notes for 2/4

More on the five number summary and outlying data. Consider the number of wins for each team in the NFC at the end of the 2008 season. In order, the list looks like this.

12, 12, 11, 10, 9, 9, 9, 9, 9, 8, 8, 7, 6, 4, 2, 0

The five number summary is as follows.

High: 12
Q3 : 9.5
Q2 : 9
Q1 : 6.5
Low: 0

Now we check to see if any of the numbers are outliers.

IQR = 9.5 - 6.5 = 3
Q3 + 1.5*IQR = 9.5 + 4.5 = 14 (no data higher than this, so no high outliers)
Q1 - 1.5*IQR = 6.5 - 4.5 = 2 (the value 0 is a low outlier, the value 2 is just inside the threshold)

The Detroit Lions' total of no wins in 16 games was the low outlier in the NFC.

Here is the data for the AFC.

13, 12, 12, 11, 11, 11, 9, 8, 8, 8, 7, 5, 5, 4, 4, 2

The five number summary is as follows.

High: 13
Q3 : 11
Q2 : 8
Q1 : 5
Low: 2

Now we check to see if any of the numbers are outliers.

IQR = 11 - 5 = 6
Q3 + 1.5*IQR = 11 + 6 = 17 (no data higher than this, so no high outliers)
Q1 - 1.5*IQR = 5 - 6 = -1 (no data lower than this, so no low outliers)

Here are the two conferences' box and whisker plots shown above and below a scale that goes from 0 to 13.





Parameters and statistics for categorical data. We have already defined some numbers that are associated with data sets, which we call parameters if the data set is a population and statistics if the data set is a sample. Statistics has the idea of reserved symbols, like N and n for size of the data set, population and sample respectively, and mux and x-bar for the mean. With categorical data, the number of units that share a value for a given categorical variable is called the frequency of the value, denoted by F(value) in a population of f(value) in a sample. Two data sets with the same variable can be compared, but if the size of the sets is significantly different, it is fairer to compare the relative frequency, which if the frequency divided by the size of the data set, denoted by p in a population and p-hat in a sample.

For example, let's look at Data Set #2 and the variable of Major. Here are the frequencies of each value that has at least one subject.

f(AH) = 6
f(BF) = 13
f(CESM) = 2
f(HE) = 4
f(SS) = 4
f(Other) = 11
f(Und) = 2

To get the relative frequencies, we divide the frequencies by n, which in this case is 42. This gives us the p-hat values, which we will usually express as a percent. In the following equations, the symbol ~= means "approximately equal to".

p-hat(AH) = 6/42 ~= 14%
p-hat(BF) = 13/42 ~= 31%
p-hat(CESM) = 2/42 ~= 5%
p-hat(HE) = 4/42 ~= 10%
p-hat(SS) = 4/42 ~= 10%
p-hat(Other) = 11/42 ~= 26%
p-hat(Und) = 2/42 ~= 5%

The sum of all of relative frequency should give us 100%, but sometimes we get rounding error. In this case, the sum is 101%. When this happens, we round to the nearest tenth of a percent, hoping that the total will be either 99.9%, 100% or 100.1%. This is called the +/- .1% rule for rounding relative frequency.

p-hat(AH) = 6/42 ~= 14.3%
p-hat(BF) = 13/42 ~= 31.0%
p-hat(CESM) = 2/42 ~= 4.8%
p-hat(HE) = 4/42 ~= 9.5%
p-hat(SS) = 4/42 ~= 9.5%
p-hat(Other) = 11/42 ~= 26.2%
p-hat(Und) = 2/42 ~= 4.8%

The sum of the more closely rounded numbers is now 100.1%, which is close enough. With some sets of fractions that add up to 1, which in general are called probability distributions, it doesn't matter how far you go when rounding, you will never get exactly 100%, which is why we decide on this "close enough" method. For instance 1/3 + 1/3 + 1/3 = 1, but if we round 1/3 and then add them up, we get 99% or 99.9% or 99.99%, etc. depending on how far we round the numbers. Because these kinds of cases can show up, we decide on a fixed number we consider close enough, known in math as an epsilon value, and round all the numbers to the same number of digits after the decimal until the total is close enough.

Example of a worst case scenario: If we have 16 values for a variable and all the values show up once, every number in our probability distribution is 1/16 = .0625 = 6.25%

Rounded to nearest percent: 1/16 ~= 6%. 16*6% = 96%, way too low
Rounded to nearest tenth of a percent: 1/16 ~= 6.3%. 16*6.3% = 100.8%, too high
Rounded to nearest hundredth of a percent: 1/16 = 6.25%. Now the total is exactly 100%, since there is no rounding at all.

No comments: