Statistics on a budget: January 2014

Wednesday, January 29, 2014

Notes for 28 January 2014

Frequency tables

On last Thursday, we took a first look at stem and leaf plots, a method for writing a data set using less space and less ink (or if we are on the Internet, less bandwidth). For some data sets that have a lot of duplication, we can use a frequency table, where the numerical values x are matched up with their corresponding frequencies f(x). Here is an example using hockey scores.

x || f(x)
7 || 1
6 || 1
5 || 3
4 || 4
3 || 6
2 || 2
1 || 3
0 || 2

This set isn't that big, so let me write it out as a list in order.

7, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 1, 1, 1, 0, 0

If we want to find the standard statistics for mean, median and mode, and the length of the list n, here are the methods.

Length of list n. we take the sum of the f(x) column.

x || f(x)
7 || 1
6 || 1
5 || 3
4 || 4
3 || 6
2 || 2
1 || 3
0 || 2
||22

The list has 22 values. We can use this to find mean and median.

Mode. Mode is easy using a frequency table, we see the biggest f(x) is 6 and it corresponds to x = 3. That means the mode is 3, the number that shows up most often.

Mean. We need to know the sum of all 22 values. For this we need to multiply x*f(x). The idea is that there is one 7, so that adds 7 to the total, but there are three 5s, so that adds 15 to the total.

x || f(x) || x*f(x)
7 || 1        7
6 || 1        6
5 || 3       15
4 || 4       16
3 || 6       18
2 || 2        4
1 || 3        3
0 || 2        0
||22   || 69

The average or mean is 69/22 = 3.1363636... Since the data is all whole numbers, we round the average to the nearest tenth, which we would write as x-bar = 3.1

Median. To find the median, we need to find the median position, which is always (n + 1)/2. In this case, that is (22+1)/2 = 23/2 = 11.5. The median will be the average of the 11th and 12th numbers on the list. Here's how we find those numbers using the frequency table.

x || f(x) || positions
7 || 1        1
6 || 1        2
5 || 3      3 through 5
4 || 4      6 through 9
3 || 6     10 through 15
2 || 2     16 through 17
1 || 3     18 through 20
0 || 2     21 through 22

I went to the end of the list, but I could have stopped as soon as we knew that the value 3 was in positions 10 through 15. That means the 11th number is a 3 and so is the 12th number. The average of 3 and 3 is

(3+3)/2 = 3.

In this case, the median is 3.

Relative frequencies

We are now dealing with proportions, and the formulas are as follows:
Population: p = F/N
Sample: p-hat = f/n
We often want to compare one proportion to another, either two from the same sample or comparable proportions from different samples, or the sample proportion to the population proportion. Because of this, it is better to write the numbers as decimals or in scales based on powers of 10.
Scales based on powers of 10: The most famous scale base on powers of ten in percentage, which really means "per 100". It is much more common to see "53% of the people agree with the president's plan" than ".53 of the people..." or "53 out of every 100 people...". Technically, all those phrases are saying the same thing, but percentage is the most popular.
One of the places where decimals are used for proportions is in the sports pages. A batting average in baseball (hits/at bats) is given as a percent to three decimal place, and likewise winning proportions (win/total games) are written as .xxx. If a batter has 27 hits in 92 at-bats, the batting average 27/92 = .293478261... is shortened to .293 and pronounced "two ninety three". Likewise, a team who has won 17 games and lost 5 will have a winning proportion of 17/22 = .77272727... = .773, and often stated as "team has a winning percentage of seven seventy three." Technically, this is a mistake, because "percentage" means out or 100. The correct word from the dictionary, which no one ever uses, is "permillage", which means out of 1,000. The team in question would have a winning percentage of 77, and a winning permillage of 773.
In both of the cases from the sports pages, the greater number of place after the decimal is used to break ties. For example, a team with 14 wins and 4 losses is at .778, which is better than 17 wins and 5 losses, while 20 wins and 6 losses is at .769, so is slightly worse.
To get a number based on a power of 10 scale, you take the proportion and multiply by the power of ten, so it is either p*scale or p-hat*scale, depending on population or sample. Besides greater precision for breaking ties, sometimes we need greater precision because the proportions are so small.
When I ask a class what is the legal limit for blood alcohol while driving, invariably someone will say "point oh eight" and most people will agree. But .08 is wrong. .08 = 8%, and the correct answer is .08% = .0008. I don't blame the students. The number is badly represented and it is an easy mistake to make. Let's take a look at the number on other scales of 10.
.08 out of 100 is the same as
.8 out of 1,000 0r
8 out of 10,000 or
80 out of 100,000
80 parts out of 100,000 is a tiny proportion. To give an idea, ounce of pure alcohol mixed into ten gallons of blood would give you 78 parts out of 100,000, and most people have between a half gallon and a gallon and a half of blood in their body, between 4 and 12 pints. The amount of alcohol in a person's blood stream that is over the legal limit is about the same amount of alcohol as found in a capful of mouthwash used after brushing your teeth.
We will be dealing with much smaller proportions later in the class, where there are things that can be hazardous to your health at ranges measure in parts per billion, but for now, we will look at the per 100,000 scale for another type of statistic, measurements of mortality rates.
Here are the number of homicides in some American cities in 2013.

Chicago: 415
New York City: 333
Detroit: 333
Washington, DC: 103
Oakland: 92

This makes it look like Chicago is most dangerous and Oakland is safest, but this doesn't take into account the populations of the cities. Here is the list with populations rounded to the nearest ten thousand.

Chicago: 415 murders, pop. 2,720,000
New York City: 333 murders, pop. 19,650,000
Detroit: 333 murders, pop. 700,000
Washington, DC: 103 murders, pop. 650,000
Oakland: 92 murders, pop. 390,000

Because the number of murders is much smaller than the populations, if we just divide we will get a very small decimal number. For example in Chicago.

415/2720000 = 0.0001525735...

With death statistics, the scale of measurement is usually number per 100,000. To get that, we multiply the answer by the scale 100,000.

0.0001525735... * 100,000 = 15.25735...

Rounding to the nearest tenth, we would say the murder rate per 100,000 people in Chicago last year was 15.3. Here are all the numbers, rounded to the nearest tenth.

Chicago: 415 murders, pop. 2,720,000 15.3 per 100,000
New York City: 333 murders, pop. 19,650,000 1.7 per 100,000
Detroit: 333 murders, pop. 700,000 47.6 per 100,000
Washington, DC: 103 murders, pop. 650,000 15.8 per 100,000
Oakland: 92 murders, pop. 390,000 23.6 per 100,000

Looking at the number on this scale, it is clear that Detroit has a much higher murder rate than the other cities listed Oakland has the second highest murder rate. Chicago and DC are very similar and New York City's rate is very low.

Friday, January 24, 2014

More practice:
The five number summary with outlier test with a set with positive and negative values

This data set is the scoring differences in the Western Conference of the NBA as of Friday, 24 January.

7.9 7.3 5.8 5.4 5.0 4.2 4.0 3.2 1.3 0.4 -0.7 -2.2 -2.4 -5.2 -6.7

Answers are in the comments.

Notes for 23 January 2014

Five number summary, IQR and outlier thresholds

Consider the number of wins for each team in the National League at the end of the 2014 season. In order, the list looks like this. n = 15 and the five numbers are in bold.

97, 96, 94, 92, 90, 86, 81, 76, 76, 74, 74, 74, 73, 66, 62

The five number summary is as follows.

High: 97
Q3 : 92
Q2 : 76
Q1 : 74
Low: 62

Now we check to see if any of the numbers are outliers.

IQR = 92 - 74 = 18
Q3 + 1.5*IQR = 92 + 27 = 119 (no data higher than this, so no high outliers)
Q1 - 1.5*IQR = 74 - 27 = 47 (no data lower than this, so no low outliers)

Here is the data for the American League.

97, 96, 93, 92, 91, 86, 85, 85, 78, 74, 71, 66, 63, 51

The five number summary is as follows.

High: 97
Q3 : 92
Q2 : 85
Q1 : 71
Low: 51

Now we check to see if any of the numbers are outliers.

IQR = 92 - 71 = 21
Q3 + 1.5*IQR = 92 + 31.5 = 123.5 (no data higher than this, so no high outliers)
Q1 - 1.5*IQR = 71 - 31.5 = 39.5 (no data lower than this, so no low outliers)

The number that looks out of place is the 51, the number of wins for the Houston Astros, by far the worst team in the major leagues. But even though they won eleven less games than the next worst team, the data is so spread out that their very bad year doesn't count as a low outlier.

Stem and leaf format

Here are the numbers for both leagues in a stem and leaf format. Because these are all two digit numbers, the stem is the tens places and the leaves are the one places.

National League
9 | 02467
8 | 16
7 | 344467
6 | 26

American League
9 | 122367
8 | 556
7 | 148
6 | 36
5 | 1

The National league has one clump of good teams over 90 and a clump of slightly less than average teams between 73 and 77 wins. The American League had six teams with more than 90 wins and the rest of the league is split fairly evenly in the 80s, 70s and 60s, with just Houston with 59 wins or less.

Frequencies and relative frequencies

The frequency of a value is how many times it shows up on the list and is denoted by either an F is the set is a population or f if the set is a sample. We can also combine values as follows, looking at the data from the American League.

f(92) = 2
f(over 90) = 6
f(70 to 79) = 3

Frequencies are always whole numbers, either zero or positive integers.

Relative frequencies are numbers between 0 and 1 and are sometimes called proportions or probablilites. Sometimes the word percentages is used, but that should only be used if the number is represented with a percent sign. In a population, we use the lowercase letter p and in a sample, the symbol is called p-hat. Let's take the relative frequencies for the f statistics above, writing them as fractions, decimals and percents.

p-hat(92) = 2/15 = .13333... or approximately 13.3%
p-hat(over 90) = 6/15 = .4 = 40%
p-hat(70 to 79) = 3/15 = .2 = 20%

Practice for five number summary.

Practice for frequency and relative frequency
Here are the National League wins again. Find the following frequencies and relative frequencies, writing the relative frequencies as fractions, decimals and percentages. Round the decimals to the nearest thousandth and the percentages to the nearest tenth of a percent. For example, 2/15 would be .133 to the nearest thousandth and 13.3% to the nearest tenth of a percent.

97, 96, 94, 92, 90, 86, 81, 76, 76, 74, 74, 74, 73, 66, 62

f(74) = _________

p-hat(74) = _________

f(between 70 and 79) = _________
p-hat(between 70 and 79) = _________

f(over 90) = _________
p-hat(over 90) = _________

Answers to the frequency and relative frequency question in the comments.

Tuesday, January 21, 2014

Lecture notes for 21 Jan. 2014

Variables and values

Each element of a data set has different pieces of information that are collected. For example, let's say this some of the information on a driver's license.

Height: 5'11"
Weight: 175 lbs.
Eyes: Brown
Hair: Black
Date of birth: 7/27/1985

The variables here are Height, Weight, Eyes, Hair and Date of birth. Height, Weight and Date of birth are numerical variables, since the answers are numbers. Eyes and hair are categorical variables, where the answers are not numbers. "175 lbs." is the value associated with the variable Weight.

Continuous variables versus discrete variables

Consider shoe sizes. In the United States, the sizes are ½ apart, so the count goes 5, 5½, 6, 6½, 7, 7½, 8, 8½, etc. What this means is if you give me a show size, I can tell you then next highest and the next lowest. According to Wikipedia, the smallest shoe size is a 1, so for that size there is no next lowest, but the next highest is 1½. A situation like this means the variable is discrete.

Just because two people wear the same size shoe does not mean they have the same size feet. For example, if a man's foot is between 10.92" and 11.08" long, the most comfortable shoe should be a size 9. You cannot say that there is a size of foot that is "the next size up" or "the next size down" from any given foot. No matter how close two feet are in size, if they are not exactly the same, it should be possible to find a foot that is in between the two that are chosen. When that is the case, when there is no defined "next size" either up or down, we say a variable is continuous.

Types of numerical data
Coded numerical or nominal data: Usually, with numbers there is a meaning we can give to the ideas of "more" and "less". With coded data, we don't necessarily have that. Examples are zip codes, social security numbers and driver's licenses. Finding the average zip code of a group of people is meaningless, as is the mid-range and median. Finding the mode means that is the most popular of the possible zip codes, and that does give valuable information.

Ordinal data: Here, the idea of a > b has meaning, but the distance between units isn't the same. In a ranking system, it's better to be first than it is to be second, but we can't say how much better, and we don't know if the difference between first and second is the same as the distance between second and third.

Often, when we switch from an ordered categorical system like grades (A, B, C, D, F) to the numbers used for grade points (4.0, 3.0, 2.0, 1.0, 0.0), the choice of what numbers to use is arbitrary. Is the distance from an A to B really the same as the distance from a C to a D? Is getting an A in one class and a C in another really the same as getting two Bs, since both would be a 3.0 Grade Point Average (GPA). How about 2 As and a D, which is 3.0, or 3 As and an F? Should all those situations be counted the same way?

The feeling of this instructor is that it should not. Again, like coded data, ordinal data can use some of the measures of center, like median and mode, but average and mid-range do not give useful information.

Interval data: This is the minimum requirement need for mean to make sense, the idea that the distance between two numbers, a - b, has a consistent meaning, like degrees in temperature readings or the number of strokes taken to complete a round of golf. In these system, the number zero does not mean the complete absence of a thing, so it dividing one number by another from the data set doesn't give meaningful information, but taking and average is about adding values together and dividing by the number of values, so an average temperature or an average of the scores in four rounds of golf does produce a useful statistic.

Rational data: This is data where not only a - b means something, but also a/b. The difference between interval and rational data is the meaning of the number zero. If zero indicates the complete lack of a thing, then we can talk about something between twice as much as another thing, or 10% less. A lot of numerical systems of measurement are rational, but not all.

Measures of center
Mode
Type of data: any data can be used, but only if there are duplicate values on the list.
Method: Find the most common value. If there is a tie for most common, there can be more than one mode.

Median
Type of data: numerical or ordered categorical
Method: Put the values in order and find the "middle value" which is the value in position (n+1)/2. If n is odd, there is a single median value on the list. If n is even, there are two middle values on the list, and if numerical, take the average of the two. If the data is categorical and the two values aren't the same, the median lies between two categories.

Mean (average)
Type of data: numerical
Method: add up all the numbers and divide by n (or N), the number of things on the list.

Mid-range
Type of data: numerical
Method: (high + low)/2

Let's take the Games Behind data from today's handout and find the mode, median, mean and mid-range. The dash at the top (-) actually signifies 0 games behind.

Data set: 0, 4½, 12, 13, 13, 13, 15½, 16½, 16½, 18½, 18½, 20, 20½, 22½, 26
size of data set: n = 15

Mode: Mode is easy when the data is put in order like this. There are three 13 values, but only two 16½ values and two 18½ values. The mode is 13.

Median: Since n = 15, the middle position counting from left or right is (15+1)/2 = 16/2 = 8. There is a single number in the eighth position whether counting from the left or the right.

0, 4½, 12, 13, 13, 13, 15½, 16½, 16½, 18½, 18½, 20, 20½, 22½, 26

The first 16½ on the list is the median.

Mean: The total is 230, so the mean is 230/15 = 15.3333..., which we will round to the nearest tenth, so the mean is rounded to 15.3

Mid-range: The biggest number is 26, the smallest number is 0 and (26+0)/2 = 13, so the mid-range is 13.

Saturday, January 18, 2014

Syllabus for Spring 2014

Math 13: Introduction to Statistics - class code 22723
Spring 2014 – Laney College
Tuesday Thursday: 10:30 am-12:20 pm

Instructor: Matthew Hubbard
Email address: mhubbard@peralta.edu
No recommended textbook
class website: http://budgetstats.blogspot.com/

Office hours:
T-Th: 1:30-1:55 pm G-201 (Math Lab)
T-Th: 6:30-6:55 pm G-201 (Math Lab)

Add and drop class dates
Last date to add: Sat., Feb. 1
Last date to drop class without a “W”: Sat., Feb. 1
Last date to drop class with a “W”: Sat., May 3

Holiday schedule for Tuesday-Thursday classes
Spring break April 14 to 19

Test dates:
Midterm 1: Thurs., Feb. 27
Midterm 2: Thurs., Apr. 10
Comprensive Final: Thursday, May 22 10:00 am to noon

Homework to be turned in: Assigned on Thursdays, due the next Tuesday.
Homework can be turned into Mr. Hubbard’s box in the Math Lab G-201, open Tuesday through Thursday from 9:00 am to 7:00 pm
OR can be turned in by e-mail to mhubbard@peralta.edu.

Late homework accepted AT THE BEGINNING of the next class (usually Thursday)

Quizzes: One every Thursday, except the first and last week and weeks with a midterm

Grading system
20% Homework
5% Labs
25% Quizzes best 2 of 3
25% Midterm I best 2 of 3
25% Midterm II best 2 of 3
25% Final

Lowest two of the homework scores will be dropped from the total.
Lowest two of the quiz scores will be dropped from the total.
Lowest total out of 100 points the quiz total and two midterms will be dropped from the final grade.
Anyone getting a higher grade out of 100 points on the final than the weighted average of all grades combined will get the final percentage instead deciding the final grade. This option is only available to students who have missed at most two homework assignments.
There are no make-up quizzes. Midterms can be made up if the student gives prior notice and has time to take the test before the next class meeting.
Anyone whose class average is over 97% going into the final can skip the final and get an A in the class.

Class rules: All cell phones and electronic communication devices off during class.
No hats, hoodies or headphones worn during quizzes and exams.
No calculators that also combine a cell phone or text message machine.

Recommended calculator: TI-30XIIs (any calculator with at least two lines of output will do, the TI-30XIIs is the cheapest that does all the things you need to do in this class. If you need help with any Texas Instruments calculator, I should be able to steer you in the right direction. I haven’t used other brands of calculators as much.)

Academic honesty: All assignments you turn in, homework, exams and quizzes, must be your own work. Anyone caught cheating on these assignments will be punished, where the punishment can be as severe as getting a zero on the assignment.

Student learning outcomes

Math 13 — Introduction to Statistics
1. Describe numerical and categorical data using statistical terminology and notation.
2. Analyze and explain relationships between variables in a sample or a population.
3. Make inferences about populations based on data obtained from samples.
4. Given a particular statistical or probabilistic context, determine whether or not a particular analytical methodology is appropriate and explain why.

The reciprocal relationship

The teacher will be on time and prepared to teach the class.
The students will be on time and prepared to learn.

The teacher will present the material to the best of his ability.
The students will absorb the material to the best of their ability. They will ask questions when topics are not clear.

The teacher will do his best to answer the questions the students ask about the material, either by repeating an answer with more details included or by taking a different approach to the material that might be clearer to some students.
The students will understand if the teacher feels a topic has been covered enough for the majority of the class and will accept questions being answered outside the class, either in extra time or through written communication.

The teacher will do his best to keep the class about the material. Personal details and distractions that are not germane to the class should not be part of the class.
The students will do their best to keep the class about the material. Questions that are not about the topic should be avoided. Distractions like cell phones and texting are not welcome when the class is in session.

The teacher will give assignments that will help the students master the skills required to pass the course.
The students will put in their best efforts to complete the assignments.
When the assignments are completed, the teacher will make every effort to get the assignments graded and back to the students in a timely manner, by the next class session whenever possible.

The teacher will present real life situations where the skills being learned will be used when they exist. In math, sometimes a particular skill is needed in general to solve later problems that will have real life applications. Other skills have the application of “learning how to learn”, of committing an idea to memory so that committing other ideas to memory becomes easier in the long run.
The student has the right to ask “When will I use this?” when dealing with mathematical topics. Sometimes, the answer is “We need this skill for the next skill we will learn.” Other times, the answer is “We are learning how to learn.” Both of these answers are as valid in their way as “We will need this to understand perspective” or “We use this to balance our checkbooks” or “Ratios can be used to change a recipe that serves three people to one that serves ten people” or other real life applications.

Statistics on a budget

Wednesday, January 29, 2014

Notes for 28 January 2014

Friday, January 24, 2014

More practice:
The five number summary with outlier test with a set with positive and negative values

Notes for 23 January 2014

Tuesday, January 21, 2014

Lecture notes for 21 Jan. 2014

Saturday, January 18, 2014

Syllabus for Spring 2014

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter

Statistics on a budget

Wednesday, January 29, 2014

Notes for 28 January 2014

Friday, January 24, 2014

More practice:The five number summary with outlier test with a set with positive and negative values

Notes for 23 January 2014

Tuesday, January 21, 2014

Lecture notes for 21 Jan. 2014

Saturday, January 18, 2014

Syllabus for Spring 2014

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter

More practice:
The five number summary with outlier test with a set with positive and negative values