Tuesday, January 21, 2014

Lecture notes for 21 Jan. 2014

Variables and values


Each element of a data set has different pieces of information that are collected. For example, let's say this some of the information on a driver's license.

Height: 5'11"
Weight: 175 lbs.
Eyes: Brown
Hair: Black
Date of birth: 7/27/1985

The variables here are Height, Weight, Eyes, Hair and Date of birth. Height, Weight and Date of birth are numerical variables, since the answers are numbers. Eyes and hair are categorical variables, where the answers are not numbers. "175 lbs." is the value associated with the variable Weight.

Continuous variables versus discrete variables

Consider shoe sizes. In the United States, the sizes are ½ apart, so the count goes 5, 5½, 6, 6½, 7, 7½, 8, 8½, etc. What this means is if you give me a show size, I can tell you then next highest and the next lowest. According to Wikipedia, the smallest shoe size is a 1, so for that size there is no next lowest, but the next highest is 1½. A situation like this means the variable is discrete.

Just because two people wear the same size shoe does not mean they have the same size feet. For example, if a man's foot is between 10.92" and 11.08" long, the most comfortable shoe should be a size 9. You cannot say that there is a size of foot that is "the next size up" or "the next size down" from any given foot. No matter how close two feet are in size, if they are not exactly the same, it should be possible to find a foot that is in between the two that are chosen. When that is the case, when there is no defined "next size" either up or down, we say a variable is continuous.

Types of numerical data 
Coded numerical or nominal data: Usually, with numbers there is a meaning we can give to the ideas of "more" and "less". With coded data, we don't necessarily have that. Examples are zip codes, social security numbers and driver's licenses. Finding the average zip code of a group of people is meaningless, as is the mid-range and median. Finding the mode means that is the most popular of the possible zip codes, and that does give valuable information.

Ordinal data: Here, the idea of a > b has meaning, but the distance between units isn't the same. In a ranking system, it's better to be first than it is to be second, but we can't say how much better, and we don't know if the difference between first and second is the same as the distance between second and third.

Often, when we switch from an ordered categorical system like grades (A, B, C, D, F) to the numbers used for grade points (4.0, 3.0, 2.0, 1.0, 0.0), the choice of what numbers to use is arbitrary. Is the distance from an A to B really the same as the distance from a C to a D? Is getting an A in one class and a C in another really the same as getting two Bs, since both would be a 3.0 Grade Point Average (GPA). How about 2 As and a D, which is 3.0, or 3 As and an F? Should all those situations be counted the same way?

The feeling of this instructor is that it should not. Again, like coded data, ordinal data can use some of the measures of center, like median and mode, but average and mid-range do not give useful information.

Interval data: This is the minimum requirement need for mean to make sense, the idea that the distance between two numbers, a - b, has a consistent meaning, like degrees in temperature readings or the number of strokes taken to complete a round of golf. In these system, the number zero does not mean the complete absence of a thing, so it dividing one number by another from the data set doesn't give meaningful information, but taking and average is about adding values together and dividing by the number of values, so an average temperature or an average of the scores in four rounds of golf does produce a useful statistic.

Rational data: This is data where not only a - b means something, but also a/b. The difference between interval and rational data is the meaning of the number zero. If zero indicates the complete lack of a thing, then we can talk about something between twice as much as another thing, or 10% less. A lot of numerical systems of measurement are rational, but not all.


Measures of center
 
Mode
Type of data: any data can be used, but only if there are duplicate values on the list.
Method: Find the most common value. If there is a tie for most common, there can be more than one mode.

Median
Type of data: numerical or ordered categorical
Method: Put the values in order and find the "middle value" which is the value in position (n+1)/2. If n is odd, there is a single median value on the list. If n is even, there are two middle values on the list, and if numerical, take the average of the two. If the data is categorical and the two values aren't the same, the median lies between two categories.

Mean (average)
Type of data: numerical
Method: add up all the numbers and divide by n (or N), the number of things on the list.

Mid-range
Type of data: numerical
Method: (high + low)/2

Let's take the Games Behind data from today's handout and find the mode, median, mean and mid-range. The dash at the top (-) actually signifies 0 games behind.

Data set: 0, 4½, 12, 13, 13, 13, 15½, 16½, 16½, 18½, 18½, 20, 20½, 22½, 26
size of data set: n = 15

Mode: Mode is easy when the data is put in order like this. There are three 13 values, but only two 16½ values and two 18½ values. The mode is 13.

Median: Since n = 15,  the middle position counting from left or right is (15+1)/2 = 16/2 = 8. There is a single number in the eighth position whether counting from the left or the right.

0, 4½, 12, 13, 13, 13, 15½, 16½, 16½, 18½, 18½, 20, 20½, 22½, 26

The first 16½ on the list is the median.

Mean: The total is 230, so the mean is 230/15 = 15.3333..., which we will round to the nearest tenth, so the mean is rounded to 15.3

Mid-range: The biggest number is 26, the smallest number is 0 and (26+0)/2 = 13, so the mid-range is 13.

No comments: