Saturday, February 8, 2014

Notes for 4 February 2014

Two different versions of standard deviation of a numerical data set, sx and sigmax

The two formulas here are standard deviations for a numerical data set. When we have a sample, we use sx. When the data set is a population, we use sigmax. We get two different values because the denominator in the first case is n-1 and N in the latter case. The reason for this is degrees of freedom.

The idea of degrees of freedom is to count how much information you need to get an answer. For example, if a football game ends in regulation we know that

final score = score in 1st + score in 2nd + score in 3rd + score in 4th

There are five pieces of information, but if you have any four of these numbers, you can find the fifth. Here we would say the degrees of freedom are 4, which is 5-1. For example, if we know the Seahawks scored 43 points total in the Super Bowl, and the scored 8 points in the first quarter, 14 points in the second quarter and 14 points in the third, we can figure out how much they scored in the 4th quarter without being told.

score in 4th = final score - score in 1st - score in 2nd - score in 3rd = 43 - 8 - 14 - 14 = 7

This is a situation where the degrees of freedom are n-1, just like with the standard deviation of the sample. The idea is that if we know the average and somehow we get all the scores except for one, we can find the last score by multiplying the average by n then subtracting all the scores we know to find the one score we don't know.

A new way to measure data: z-scores
 
We can compare two sets of data against one another using the averages (x-bar or mux) and the standard deviation (sx or sigmax) with a formula known as the z-score. We subtract the average from the raw score x. If x is above average, we will get a positive number, if x is below average, we will get a negative number. (if x is exactly at teh average, we'll get zero.) We then divide by the standard deviation to get the z-score. This tells us how many standard deviations we are away from average, either high or low.

Here is an example using the American and National League final standings last year. The data set we will check out is the number of wins. I will treat both leagues as samples of all of Major League Baseball and the reason there is a difference in the average wins is because of inter-league play.



American League:
x-bar = 81.3
sx = 13.7

National League:
x-bar = 80.7
sx = 11.1

The AL had slightly more wins, but their standard deviation is higher because the data is more spread out, largely because the Houston Astros were so terrible. In any data set, we can use z-scores to find outliers.

z is greater than 3: The score is very unusually high
z is greater than 2: The score is unusually high
z is less than -3: The score is very unusually low
z is less than -2: The score is unusually low

The Astros had 51 wins, so their z-score is (51-81.3)/13.7 = -2.21. They are the only team in either league that can be considered an outlier, either high or low. 

We can also compare teams in the different leagues. For example, both the As and the Braves had 96 wins, but because they are in different leagues they won't have the same z-score.

z-score for As: (96-81.3)/13.7 = 1.07
z-score for Braves: (96-80.7)/11.1 = 1.38

What this would say is that it was more impressive for the Braves to win 96 that for the As. A big part of this is that the Athletics played 19 games against the Astros, winning 15 and losing 4. The Braves never played the Astros, and that alone meant they got the same number of wins against a tougher set of opponents, which accounts for their higher z-score.

Z-scores in normally distributed sets

Not every set can be assumed to be normally distributed. Usually, we will be told a set is normally distributed and be given both mux and sigmax. If we have such a set, we can take a raw score and turn it into a percentile using the look-up table we got on the orange hand-out on Tuesday. Let's take an example.

It is assumed that IQ scores are normally distributed, where the average is 100 and the standard deviation is 15. This would say an IQ of 110 has a z-score of (110-100)/15 = 0.67. Because of our assumption, we can use the look-up table to find the percentile for an IQ of 110.

1. Use the Positive z-score side
2. Look in the row next to the 0.6 label in bold
3. Look in the column labeled 0.07.

The value at that position on the table is .7486. What this means is that 74.86% of the population has an IQ of 110 or less. If we subtract 74.86% from 100%, we get 25.14%. Rounding to the nearest percent, this means the 110 IQ is at about the cutoff for the 75th percentile.

We will discuss this in greater detail on Tuesday, February 11.

No comments: