Tuesday, January 20, 2009

Class notes for 1/21


Parameters and statistics. Any number associated with a population is a parameter. Any number associated with a sample is a statistic. It's easy to remember because the associated words begin with the same letter. Methods for remembering things are called mnemonics, pronounced with the leading m silent, named for Mneme, the Greek goddess of memory. The first parameter we learned about is the size of the population, which is always represented with N. The first statistic is the size of the sample, represented by the lowercase letter n.

Subscripts. In general math problems, the letters x and y are often used as variable names. Any letter can be used, and sometimes letters from other languages, most notably Greek are used, especially in trigonometry. When a variable represents a quantity in the world, it makes sense mnemonically to use the letter the word begins with. For instance, if we want to represent the height of a flagpole, the letter h could be used. What if the problem has a second object whose height needs to be measured? Maybe we could call that second height a letter near to h in the alphabet, like g or i or j. What if there are three or four or even more things whose heights have to be kept track of? This is a situation where subscripts become handy.

We could call the height of the first thing h1, the second height h2, the third h3 and so on. We pronounce these names "h one", "h two", "h three", etc. and because there are infinitely many positive whole numbers, we don't have to worry about running out. If you have to write this in a text editor that does not let you make subscripts, the standard is to use an underscore, such as h_1, h_2, h_3, etc. There is more about this in the post about text editor workarounds.

What about data that has been left blank in a list? Sometimes when we have a data set, we have several pieces of information about each unit on the list, but some data has been left blank. There are a couple of things we can do.

Option #1: Change the size of the data set. If the variable is numerical, Option #1 is the only option. In the class survey handed out on Wednesday, for example, we have 38 people who responded to questions in Data Set #1, so n = 38. Three students did not give a response to height in inches, so for that information we have no choice but to change n to 35 for that particular variable. We will need to use n (or N in the case of a population) when calculating the mean and median, as shown below.

Option #2: Create a new categorical value. With categorical variables, we can either ignore the blanks, or create a new category called "left blank" or "did not respond" or "none of the above". For instance, when voting, leaving one field blank on a ballot does not invalidate the entire ballot. You can vote for president, but decide not to vote for anyone for city council, and the presidential vote still counts. There have been ballots made from time to time that gave the option of "none of the above", which is like the idea of "left blank".

Mean, median and mode. There are several ways of stating a single number which gives an idea of the central measure of a set of numbers. The most used numbers are mean, median and mode.

Mean. Also known as average, to take the mean of a set of numbers (and this can only be done with numerical values), find the sum of all the numbers and divide by n. If the data set is a population, the mean is represented by the Greek letter mu, with a subscript of the letter of the variable. If the variable is called x, the mean is mux .If the variable is called d, the mean is mud. If the data set is a sample, we put a bar over the letter used for the variable name, like x-bar or d-bar. (In this text editor, there is no easy way to put a bar above a symbol, so I will type x-bar instead.)

Median. First, the numbers must be put in order, either from low to high or high to low. The median is the number "in the middle", which is to say position (n + 1)/2. If n is odd, then this will mean a specific single position. If n is even, then there are two things "in the middle", and the median will be the average of the two things.

Mode. The mode is the most common value, as long as there are any repeated values. If there are no repeats, there is no mode. If there are repeats and there is a tie for most common, there can be more than one mode.

Let's do some examples.

Data set #1: 11, 11, 9, 7, 13, 12, 8, 5, 12, 11, 4, 4, 8, 8, 5, 2
n = 16
Mean: The sum is 130, so the average is 130/16 = 8.125. The standard for rounding to to round the average to one place farther than the data, so this would round to 8.1

Median: First, put the numbers in order.
13, 12, 12, 11, 11, 11, 9, 8, 8, 8, 7, 5, 5, 4, 4, 2

Because there are 16 things on the list the middle position is (16+1)/2 = 8.5, which is to say we will take the average of the 8th and 9th values. Those two values are the first two 8s on the list, which are in bold and underlined. Obviously, the average of 8 and 8 is 8, so 8 is the median.

Mode: Both 11 and 8 show up on the list three times, which is the most, so both 8 and 11 are modes for this variable.

Data set #2: 33, 32, 35, 25, 24, 22, 20, 21, 19, 18, 17, 17, 16, 15, 9
n = 15

Mean: The sum of the numbers is 323, so the mean is 323/15 = 21.5333..., which rounds to 21.5 if we round to the nearest tenth.

Median: First we put the numbers in order.

35, 33, 32, 25, 23, 22, 20, 21, 19, 18, 17, 17, 16, 15, 9

The middle position is (15+1)/2 = 8, so the value in the eighth position, whether we count left to right or right to left is 21.

Mode: There is only one repeated value on the list, and that is 17.

(Data Set #1: Number of wins of the teams in the AFC at season's end.)
(Data Set #2: Number of wins of the teams in the Eastern Conference of the NBA as of the end of play on Jan. 21, 2009.)

No comments: