Statistics on a budget: February 2009

Tuesday, February 24, 2009

Class notes for 2/23

Sometimes, graphic presentation of data uses pictograms instead of bars from bar charts or pie charts. It's best when this information is used one-dimensionally instead of two dimensionally. For example, using the record icon for albums sold is a one dimensional representation, and gives the information correctly. One stack is about 8 records tall (7 full records and a partial record) while the other stack is 5 records tall.

The icon of a gold bar is not as well used, because the bar is made both wider and taller. The ratio between the prices is 741.25/265.50, which is about 2.79, which means the higher price is a little less than three times the lower price. The gold bars have been made both wider and taller, so if we take the ratio of the rectangle sizes (92*54)/(35*21), we have the the big gold bar is about 6.78 times the size of the small one, which is nearer to 2.79*2.79. It's better to represent the information as just a height or just a width and not both, for exactly the reason presented here.

The test will cover material from all the class so far, so reading your notes, reading the online notes and reviewing homework and quizzes is the best use of your time studying.

Wednesday, February 18, 2009

Class notes for 2/18

Increase and decrease, in absolute terms and in percentage: When we have numbers that have changed over time, we can get the absolute change by subtracting the old number from the new. If the numbers are rational data, which as you will recall has to do with the idea that zero means the complete lack of a thing, then we can also discuss the percentage change, which we get through the formula (new-old)/old*100. Don't forget to put the parentheses around the numbers in the numerator.

For instance, the graphical representation of data illustration shows the differing fortunes of two American corporations over the last eight years, by listing the market capitalization on 2000 compared to the same statistic in 2008. (Market capitalization means how much money it would cost to buy all the company's stock at the price it is trading for.)

Apple has had a good century so far, and their stock's value has risen from $5.56 billion in 2000 to $85.25 billion in 2008. In absolute terms that is an increase of (85.25-5.56) = 79.69 billion dollars.

The percentage increase is even more impressive as (85.25-5.56)/5.56*100 = 1433.27..., so the market capitalization has increase by a whopping 1433.3%.

General Motors has had a bad eight years, and their market capitalization has shrunk from $28.3 billion in 2000 to $2.99 billion in 2008. (2.99-28.3) = -25.31, so the absolute decrease is about $25.3 billion. The percentage decrease is (2.99-28.3)/28.3*100 = -89.43..., which means the percentage decrease for GM's market capitalization is about 89.4%. This illustrates an important point. It is possible for a percentage increase to be more than 100%, and this happens any time a number more than doubles. But unless a formerly positive number goes negative, and that is a rare thing, the percentage decrease will not be more than 100%.

Bar charts, also known as histograms: Bar charts are a popular way to represent categorical data, whether ordered or unordered. One of the great advantages of bar charts over pie charts is that it is easier to compare two data sets side by side. In the picture on the left, we see the percentages of people in four different demographic groups in the states of Florida (the yellow histograms) and Texas (the red histograms). What we see is that Texas has a slightly larger percentage of children under 5 years old (8% to 6%), of juveniles (19% to 16%), of adults under the age of 65 (63% to 61%), but Florida makes up for that gap by having a significantly higher percentage of senior citizens over the age of 65 (17% to 10%). Because Texas has a much higher population than Florida, it makes sense for us to look at relative frequency instead of frequency, so we can compare the data sets side by side and not have the Texas numbers completely overwhelm the Florida numbers.

Line charts: A very common use for line charts is to track numerical data over time. Newspapers and financial websites often show how the price of a commodity or stock is doing by using line charts, which look something like the profile of a mountain range. This graph was taken from kitco.com, a website devoted to trading metals, including gold, silver, copper and platinum. These are the prices of gold minute by minute for three days, Monday, Feb. 16, 2009 to Wednesday, Feb. 18. The Monday prices are shown in blue, the Tuesday prices in red and the Wednesday prices in green. What we can see is that prices rose on Monday from under $940 an ounce to nearly $960, Tuesday showed a steady climb from $960 to $970 an ounce, then Wednesday prices fell a bit early in the day, but rallied to finish near $980 an ounce.

Line charts can also be used to represent data that might be shown as histograms. This line chart in red and yellow shows the same demographic data we had in the bar chart section from above, with red showing Texas data and yellow showing Florida data.

There are two sets of lines in each color. The thin lines are the same as the histograms from the section above, while the thick lines that climb to 100% at the far right are ogives, pronounced "Oh-jives", which track the cumulative amount. Here are the numbers for each state, both from each demographic and the accumulation of demographic groups from youngest to oldest.

Florida:
Under 5 years old: 6.2%
6-18 years old: 16.0%
19-64 years old: 61.0%
65 and over: 16.8%

Florida (cumulative):
Under 5 years old: 6.2%
Under 18 years old: 22.2%
Under 64 years old: 83.2%
All age groups: 100%

Texas:
Under 5 years old: 8.1%
6-18 years old: 19.4%
19-64 years old: 62.5%
65 and over: 9.9%

Texas (cumulative):
Under 5 years old: 8.1%
Under 18 years old: 27.5%
Under 64 years old: 90.9%
All age groups: 100%

Pie charts: Pie charts work best with unordered categorical data. The pie slices are ordered from largest to smallest. The way most people set up the start of the data is to put a line segment pointing straight up and start putting the pie slices in clockwise from largest to smallest. This data shows the percentages of racial groups in the U.S. as of 2008. Besides the four main racial groups of whites (here identified as European Americans), Hispanics, African Americans and Asian Americans, all other racial groups account for less than 0.5%.

There is an alternative way of setting up pie charts where the "start line" points to 3 o'clock and the data is put from largest to smallest, moving counterclockwise. This picture is taken from the Sitemeter website attached to another blog that I maintain. 70% of the last 100 visitors are from the United States, 11% are from Canada, 10% cannot be recognized by the Sitemeter software by country of origin, followed by 3% United Kingdom, 2% Chile, and 1% each for Panama, Ireland, Guam and Germany.

Tuesday, February 17, 2009

graphic presentation of data

This is a collection of statistics taken from the Atlantic Monthly magazine showing before and after statistics from the beginning and the end of the Bush administration. Most of the before numbers come from 2000, though some come from 2001, and the after numbers are from 2006, 2007 or 2008. There will be questions asked about the numbers here, so you can click on the picture to get a larger version. If you are on a computer where you are allowed to save files, you can left click on the picture and scroll down to "Save Image As..." to download the file to your computer. There will be questions on future practice problem sets that will ask about the data presented here.

Friday, February 13, 2009

Class notes for 2/11

Venn diagrams and contingency tables: In other classes you have taken, you may have seen Venn diagrams. The idea is to represent the ideas of sets and subsets and intersections of subsets pictorially. In this picture, the rectangle represents the whole set of things we are considering, known as the universe, while the two circles represent subsets A and B. This splits the rectangle into four parts, colored in the picture in white, yellow, gray and blue. Here are the color combinations that represent some of the sets we discuss in probability.

A = yellow and gray
not A = white and blue
B = gray and blue
not B = white and yellow
A and B = gray
A or B = yellow, gray and blue
not (A and B) = not A or not B = white, yellow and blue
not (A or B) = not A and not B = white

When a variable has only two values, like gender can be male or female or left-right handedness can be left or right, then "not male" is the same as "female", or "not left" is the same as "right". Many variables have more than two values, so "not 20-29" is easier to write than "19 & under or 30-39 or 40-49 or 50 & over". There are problems often associated with Venn diagrams and figuring out how many subjects are in certain subsets that are easier to solve using contingency tables than using Venn diagrams. Here is an example.

In both data sets combined, there are 80 subjects. There are a total of 6 left handed subjects, 30 males. 4 of the males are left handed. How many females are right handed?

How to solve it: Since the total is 80, 30 males means 50 females and 6 left handers means 74 righthanders. This means we know the row and column totals of a contingency table.

_____|__M_|__F_|_total
R____|____|____|_74
L____|____|____|__6
total|_30_|_50_|_80 grand total

Because we had the grand total and the total number of males, we get the total number of females by subtracting. We call this is degrees of freedom. Once we have the total, and we know that two numbers add up to that total, being given any single value means you can figure out the other value, so there is only one degree of freedom. If instead we were dealing with age groups, where we have five values, then we would have four degrees of freedom, meaning if you knew the frequencies of four values, you could add those up and subtract the total from the size of the whole set to find the fifth frequency that wasn't given.

Once we have all the row and column totals in a 2x2 contingency table, we only need one value inside the box to get all the other three, so once again, we have one degree of freedom. There are four left handed males, which means a 4 is put in row 2, column 1, as follows:

_____|__M_|__F_|_total
R____|____|____|_74
L____|__4_|____|__6
total|_30_|_50_|_80 grand total

Using subtraction, we can fill in the rest of the values.

_____|__M_|__F_|_total
R____|_26_|_48_|_74
L____|__4_|__2_|__6
total|_30_|_50_|_80 grand total

This means there are 48 right handed females, which is what we were asked. We also know there are 2 female lefties and 26 male righties, though those questions were not asked.

Conditional probability: Besides asking for p-hat(females), p-hat(male and right) or p-hat(left or female), we have the idea of p-hat(female, given left), which means if we count only the left handed subjects, how many of them are female. If you have the information in contingency table form, what changes in such a question is the denominator of the fraction, which is a row total or a column total instead of the grand total. Here are three examples.

p-hat(female and left) = 2/80 = .025 = 2.5%
p-hat(female, given left) = 2/6 = .333... ~ 33.3%
p-hat(left, given female) = 2/50 = .04 = 4.0%

[Note: I will use ~ to mean approximately equal when typing on the blog.]

There is a formula for conditional probability if you don't have the information in contingency table form.

p(A, given B) = p(A and B)/p(B)

Practice problems:

In a sample of 42 people, there are 4 left handed people. 19 people gave the answer of 3 on a scale from 1 to 5 for difficulty of the class. 2 of the left handed people gave the answer 3 to the difficulty question. Find the following probabilities, rounded to the nearest tenth of a percent.

p-hat(left and difficulty 3) =
p-hat(right or difficulty 3) =
p-hat(right, given difficulty 3) =
p-hat(difficulty 3, given right) =

Answer in the comments.

Tuesday, February 10, 2009

Class notes for 2/9

We are now dealing with proportions, and the formulas are as follows:

Population: p = F/N

Sample: p-hat = f/n

We often want to compare one proportion to another, either two from the same sample or comparable proportions from different samples, or the sample proportion to the population proportion. Because of this, it is better to write the numbers as decimals or in scales based on powers of 10.

Scales based on powers of 10: The most famous scale base on powers of ten in percentage, which really means "per 100". It is much more common to see "53% of the people agree with the president's plan" than ".53 of the people..." or "53 out of every 100 people...". Technically, all those phrases are saying the same thing, but percentage is the most popular.

One of the places where decimals are used for proportions is in the sports pages. A batting average in baseball (hits/at bats) is given as a percent to three decimal place, and likewise winning proportions (win/total games) are written as .xxx. If a batter has 27 hits in 92 at-bats, the batting average 27/92 = .293478261... is shortened to .293 and pronounced "two ninety three". Likewise, a team who has won 17 games and lost 5 will have a winning proportion of 17/22 = .77272727... = .773, and often stated as "team has a winning percentage of seven seventy three." Technically, this is a mistake, because "percentage" means out or 100. The correct word from the dictionary, which no one ever uses, is "permillage", which means out of 1,000. The team in question would have a winning percentage of 77, and a winning permillage of 773.

In both of the cases from the sports pages, the greater number of place after the decimal is used to break ties. For example, a team with 14 wins and 4 losses is at .778, which is better than 17 wins and 5 losses, while 20 wins and 6 losses is at .769, so is slightly worse.

To get a number based on a power of 10 scale, you take the proportion and multiply by the power of ten, so it is either p*scale or p-hat*scale, depending on population or sample. Besides greater precision for breaking ties, sometimes we need greater precision because the proportions are so small.

When I ask a class what is the legal limit for blood alcohol while driving, invariably someone will say "point oh eight" and most people will agree. But .08 is wrong. .08 = 8%, and the correct answer is .08% = .0008. I don't blame the students. The number is badly represented and it is an easy mistake to make. Let's take a look at the number on other scales of 10.

.08 out of 100 is the same as
.8 out of 1,000 0r
8 out of 10,000 or
80 out of 100,000

80 parts out of 100,000 is a tiny proportion. To give an idea, ounce of pure alcohol mixed into ten gallons of blood would give you 78 parts out of 100,000, and most people have between a half gallon and a gallon and a half of blood in their body, between 4 and 12 pints. The amount of alcohol in a person's blood stream that is over the legal limit is about the same amount of alcohol as found in a capful of mouthwash used after brushing your teeth.

We will be dealing with much smaller proportions later in the class, where there are things that can be hazardous to your health at ranges measure in parts per billion, but for now, we will look at the per 100,000 scale for another type of statistic, measurements of mortality rates.

Here are the number of homicides in some local cities in 2007.

Oakland: 124 homicides
Richmond: 28 homicides
San Francisco: 98 homicides

Clearly, comparing these numbers is misleading, because we know these cities have very different numbers of citizens, so the standard way to measure these statistics is the per 100,000 population scale, which we find by the formula

f/n* scale

which in this case is

(# of homicides)/(city population) * 100,000

Oakland's population in 2007 is estimated at 415,000, Richmond at 106,000 and San Francisco at 825,000, so the murder rates on this standard scale are as follows

Oakland: 124/415000 * 100000 = 29.9
Richmond: 28/106000 * 100000 = 26.4
San Francisco: 98/825000 * 100000 = 11.9

So even though more people were murdered in San Francisco than in Richmond in 2007, the murder rate in Richmond was over twice as high, because Richmond has barely 1/8 of the population of San Francisco. (note: The trends for the three cities this decade are going in different directions. Oakland's murder rate is on the rise, while Richmond's is falling and San Francisco's has stayed about the same.)

Calculating proportions (probabilities): There are times when we will need to find new proportions from information previously calculated, either adding and subtracting old numbers or multiplying or dividing. It's best to use the fractional forms of the data when available, then round the answers after using the exact numbers instead of using answers that might have been rounded. Every time you use a rounded answer in a calculation, there is a change to increase the rounding error even more.

The words "proportions" and "probabilities" will be used interchangeably in the rest of this post.

Contingency tables and compound probabilities: Let's take the data from data set #2 regarding gender and left/right handedness and turn it into a contingency table.

___R__L_
M__9__3_
F_29__1_
What these numbers represent is there are 9 right-handed males, 3 left-handed males, 29 right-handed females and 1 left-handed females. We will now add the row totals, the column totals and the grand total, which will be marked in red.

___R__L_
M__9__3_ 12
F_29__1_ 30
__38__4_ 42=grand total

This gives us the following probabilities. We assume this is a sample so these values are p-hat.

p-hat(female) = 30/42
p-hat(male) = 12/42
p-hat(left) = 4/42
p-hat(right) = 38/42

We can also combine values from different variables as follows.

p-hat(female and left) = 1/42
p-hat(female and right) = 29/42
p-hat(male and left) = 3/42
p-hat(male and right) = 9/42

When we use the conjunction "and", we take the number of subjects that would answer yes to being both female and left handed, for example, and divide by the size of the data set. This means a single entry from the contingency table divided by the grand total.

The conjunction "or" means we want all the subjects that are in the combination or a row and a column together, but being careful that we did not count anyone twice. We use the principle of inclusion and exclusion when calculating this, which is as follows.

p(A or B) = p(A) + p(B) - p(A and B)

The rule is the same whether we are dealing with p or p-hat.

The reason we subtract is as follows. If I count all the women in the set, and then all the left handed people in the set and add those together, any left handed women were counted twice, so we subtract the total of left handed women to get count correct.

p-hat(female) = 30/42
p-hat(left) = 4/42
p-hat(female and left) = 1/42

p-hat(female or left) = 30/42 + 4/42 - 1/42 = 33/42

Another way to combine proportions is the conjunction "given". The idea of p(female, given left) means how many females are there in the subset of left handed people while p(left, given female) means how many left-handers are there among the women. The formula for this is

p-hat(A, given B) = p(A and B)/p(B)

In a contingency table, the easiest way to calculate this is to find the place in the table that tells us how many people are in the row and column that correspond to A and B, then divide by the row or column total that corresponds to B. Here are the eight different values we have for the given probabilities.

p-hat(female, given left) = 1/4
p-hat(female, given right) = 29/38
p-hat(male, given left) = 3/4
p-hat(male, given right) = 9/38
p-hat(left, given female) = 1/30
p-hat(right given, female) = 29/30
p-hat(left given, male) = 3/12 = 1/4
p-hat(right given, male) = 9/12 = 3/4

Complementary events and their probabilities: The complement of a subset is all the elements that are in the whole set but not in the subset. In a variable with two values, the complement of one value is simply the other value, so the complement of lefthanders is righthanders, and the complement of men is women. In a variable with more than one value, the complement of a value is all the other values. The complement of the 20-29 value in age group would be the subjects 19 and under combined with the subjects 30 and over. Since we are using the words "and", "or" and "given" as our conjunctions, the word used for complement in such a setting is "not".

If we know p(A), the easiest way to calculate the probability of the complement is

p(not A) = 1 - p(A).

In some books and in my notes, the probability of the complementary event will be denoted by the letter q, defined by the equation p + q = 1, or q = 1 - p.

Again, the rules for p and q are the same for p-hat and q-hat.

Here are some complements of the some of ideas we have defined using the conjunctions.

female complement = male
male complement = female
left complement = right
right complement = left

female and left complement = male or right
male and left complement = female or right

female, given left complement = male, given left
male, given left complement = female, given left
female, given right complement = male, given right
male, given right complement = female, given right

Practice problems: (answers given in comments)

1) Here are the homicide numbers for Oakland, Richmond and San Francisco from earlier in this century.

Oakland: 96 homicides, 399,000 population
Richmond: 40 homicides, 99,000 population
San Francisco: 96 homicides, 775,000 population

Find the murder rates from these years, rounded to the nearest tenth per 100,000 population and rank them from lowest (1st) to highest (3rd).

2) Find the complements of the following sets and the probabilities for the set and the complement rounded to the nearest tenth of a percent. (If it rounds exactly to a percent, you can write the answer as 42% instead of 42.0%, to give an example.)

a) left, given female
b) left and female
c) left or female
d) female, given left

Sunday, February 8, 2009

practice data

The following data set gives the values for the following variables.

Column #1: Abbreviation of state name
Column #2: Number of electoral votes in the state
Column #3: Political party that won that state's electoral vote in 2008
Column #4: Political party that won that state's electoral vote in 2004
Column #5: Political party that won that state's electoral vote in 2000

CA 55 D D D
TX 34 R R R
NY 31 D D D
FL 27 D R R
IL 21 D D D
PA 21 D D D
OH 20 D R R
MI 17 D D D
NJ 15 D D D
NC 15 D R R
GA 15 R R R
VA 13 D R R
MA 12 D D D
WA 11 D D D
IN 11 D R R
MO 11 R R R
TN 11 R R R
MD 10 D D D
MN 10 D D D
WI 10 D D D
AZ 10 R R R
CO _9 D R R
AL _9 R R R
LA _9 R R R
KY _8 R R R
SC _8 R R R
CT _7 D D D
OR _7 D D D
IA _7 D R D
OK _7 R R R
AR _6 R R R
KS _6 R R R
MS _6 R R R
NM _5 D R D
NV _5 D R R
NE _5 R R R
UT _5 R R R
WV _5 R R R
HI _4 D D D
ME _4 D D D
RI _4 D D D
NH _4 D D R
ID _4 R R R
DC _3 D D D
DE _3 D D D
VT _3 D D D
AK _3 R R R
MT _3 R R R
ND _3 R R R
SD _3 R R R
WY _3 R R R

Answer the following questions.
1) Give the five number summary for the number of electoral votes. Also find the IQR, the high and low thresholds for outliers and note what states, if any, have an outlying number of electoral votes.

2) Find the relative frequencies for groups of states with the following values. Use the +/- .1% rule to decide how far to round the numbers.

a) 3 D's in the last three columns
b) 2 D's and 1 R in the last three columns
c) 1 D and 2 R's in the last three columns
d) 3 R's in the last three columns

3) Given the groups of states in part 2), instead of counting the number of states, count instead the number of electors in each group and give the frequency and relative frequency for the elector totals (Note: n = 538 for the total number of electors. To make the problem easier, we will ignore the fact that Maine and Nebraska do not use the winner take all system).

Answers in the comments. The box and whiskers for problem #1 is as follows.

Wednesday, February 4, 2009

Class notes for 2/4

More on the five number summary and outlying data. Consider the number of wins for each team in the NFC at the end of the 2008 season. In order, the list looks like this.

12, 12, 11, 10, 9, 9, 9, 9, 9, 8, 8, 7, 6, 4, 2, 0

The five number summary is as follows.

High: 12
Q3 : 9.5
Q2 : 9
Q1 : 6.5
Low: 0

Now we check to see if any of the numbers are outliers.

IQR = 9.5 - 6.5 = 3
Q3 + 1.5*IQR = 9.5 + 4.5 = 14 (no data higher than this, so no high outliers)
Q1 - 1.5*IQR = 6.5 - 4.5 = 2 (the value 0 is a low outlier, the value 2 is just inside the threshold)

The Detroit Lions' total of no wins in 16 games was the low outlier in the NFC.

Here is the data for the AFC.

13, 12, 12, 11, 11, 11, 9, 8, 8, 8, 7, 5, 5, 4, 4, 2

The five number summary is as follows.

High: 13
Q3 : 11
Q2 : 8
Q1 : 5
Low: 2

Now we check to see if any of the numbers are outliers.

IQR = 11 - 5 = 6
Q3 + 1.5*IQR = 11 + 6 = 17 (no data higher than this, so no high outliers)
Q1 - 1.5*IQR = 5 - 6 = -1 (no data lower than this, so no low outliers)

Here are the two conferences' box and whisker plots shown above and below a scale that goes from 0 to 13.

Parameters and statistics for categorical data. We have already defined some numbers that are associated with data sets, which we call parameters if the data set is a population and statistics if the data set is a sample. Statistics has the idea of reserved symbols, like N and n for size of the data set, population and sample respectively, and mux and x-bar for the mean. With categorical data, the number of units that share a value for a given categorical variable is called the frequency of the value, denoted by F(value) in a population of f(value) in a sample. Two data sets with the same variable can be compared, but if the size of the sets is significantly different, it is fairer to compare the relative frequency, which if the frequency divided by the size of the data set, denoted by p in a population and p-hat in a sample.

For example, let's look at Data Set #2 and the variable of Major. Here are the frequencies of each value that has at least one subject.

f(AH) = 6
f(BF) = 13
f(CESM) = 2
f(HE) = 4
f(SS) = 4
f(Other) = 11
f(Und) = 2

To get the relative frequencies, we divide the frequencies by n, which in this case is 42. This gives us the p-hat values, which we will usually express as a percent. In the following equations, the symbol ~= means "approximately equal to".

p-hat(AH) = 6/42 ~= 14%
p-hat(BF) = 13/42 ~= 31%
p-hat(CESM) = 2/42 ~= 5%
p-hat(HE) = 4/42 ~= 10%
p-hat(SS) = 4/42 ~= 10%
p-hat(Other) = 11/42 ~= 26%
p-hat(Und) = 2/42 ~= 5%

The sum of all of relative frequency should give us 100%, but sometimes we get rounding error. In this case, the sum is 101%. When this happens, we round to the nearest tenth of a percent, hoping that the total will be either 99.9%, 100% or 100.1%. This is called the +/- .1% rule for rounding relative frequency.

p-hat(AH) = 6/42 ~= 14.3%
p-hat(BF) = 13/42 ~= 31.0%
p-hat(CESM) = 2/42 ~= 4.8%
p-hat(HE) = 4/42 ~= 9.5%
p-hat(SS) = 4/42 ~= 9.5%
p-hat(Other) = 11/42 ~= 26.2%
p-hat(Und) = 2/42 ~= 4.8%

The sum of the more closely rounded numbers is now 100.1%, which is close enough. With some sets of fractions that add up to 1, which in general are called probability distributions, it doesn't matter how far you go when rounding, you will never get exactly 100%, which is why we decide on this "close enough" method. For instance 1/3 + 1/3 + 1/3 = 1, but if we round 1/3 and then add them up, we get 99% or 99.9% or 99.99%, etc. depending on how far we round the numbers. Because these kinds of cases can show up, we decide on a fixed number we consider close enough, known in math as an epsilon value, and round all the numbers to the same number of digits after the decimal until the total is close enough.

Example of a worst case scenario: If we have 16 values for a variable and all the values show up once, every number in our probability distribution is 1/16 = .0625 = 6.25%

Rounded to nearest percent: 1/16 ~= 6%. 16*6% = 96%, way too low
Rounded to nearest tenth of a percent: 1/16 ~= 6.3%. 16*6.3% = 100.8%, too high
Rounded to nearest hundredth of a percent: 1/16 = 6.25%. Now the total is exactly 100%, since there is no rounding at all.

Monday, February 2, 2009

Class notes for 2/2

Graphical representation of data

Frequency tables and dotplots: A frequency table is a list of values, which we will call x, followed by the number of times each value shows up on the list, which is called the frequency, denoted by f or f(x). A dotplot is a graph that represents a frequency table. For example, let's take the frequency table from the last class notes and turn it into a dotplot with a vertical scale.

x || f(x)

20 || 2
19 || 1
18 || 2
17 || 5
16 || 7
15 || 2
14 || 5
13 || 5
12 || 3
11 || 5
_9 || 2
_8 || 1
_7 || 2
_4 || 3

For the dot plot version, I type in the information in the Courier font, which is mono-spaced. This was the same number of characters on a line will take up the same width

20| **
19| *
18| **
17| *****
16| *******
15| **
14| *****
13| *****
12| ***
11| *****
10|
9 | **
8 | *
7 | **
6 |
5 |
4 | ***
3 |
2 |
1 |

Notice that every value from 1 to 20 is included in the scale, even the ones that did not appear on the list. This is so the dotplot will be correctly spaced.

The five number summary, box and whiskers, IQR and outlying data. Examples of five number summaries and box and whisker plots can be seen in the previous class notes, but John Tukey added one more idea to the box and whisker definition, which would take into account the idea of outlying data, values that are far larger or far smaller than most of the other data. The question is how to define "far larger" and "far smaller", and we will see that in all the different systems we use, the cut-off point for these decisions is arbitrary. Tukey's arbitrary decision involves a number called the Inter-quartile Range or IQR for short. The formula is as follows:

IQR = Q3 - Q1

This is the distance between the values that define the cut-off points for the 75th percentile and the 25th percentile, so the middle half of the data lies inside that range. That is represented by the box, and Tukey decided that the neither the left or right whisker should ever be more than one and a half times as long as the width of the box. If there is any data that is higher that Q3 + 1.5*IQR, it is represented by a dot. Likewise, any data that is lower than Q1 - 1.5*IQR is represented by a dot. The whiskers only extend to the highest data less than the high outlier threshold or the lowest data more than the low outlier threshold.

Example: Here are the number of times the most popular twelve stories on the Huffington Post website were read today, where the scale is in thousands or readers.

759 165 162 152 132 129 128 118 113 65 55 44 35

Obviously, the most popular story* is being read by more people than the next five stories combined, so we would expect it to be outlying data. Let's create the five number summary, the IQR and find the outlier thresholds.

High: 759
Q3: 157 (The average of 162 and 152)
Q2: 128.5 (The average of 129 and 128)
Q1: 60 (The average of 65 and 55)
Low: 35

IQR = 157 - 60 = 97

High outlier threshold = Q3 + 1.5*IQR = 157 + 1.5*97 = 157 + 145.5 = 302.5
Low outlier threshold = Q1 - 1.5*IQR = 60 - 1.5*97 = 60 - 145.5 = -85.5

As we might expect, there is no low outlying data, but the highest value of 759 is much higher than any of the rest of the data. The way this will be presented is by the upper whisker only extending to the value of 165, which is the highest non-outlying value, while the outlier will be represented by a dot. Here is a horizontal box and whisker plot of this data. You can click on this picture to get a larger version.

(* The most popular story the day this data was taken were pictures of Olympic champion Michael Phelps smoking a bong in the dorm room of a friend.)

Rounding: There are several different rules for rounding values, depending on what the data looks like. In general, when taking data from a list and finding the average, we round to one decimal place more than the data is represented. If numbers are given as whole numbers, we round to the nearest tenth. If numbers are given to the nearest penny, we round the average to the nearest tenth of a penny.

The numbers I gathered from the website I decided to round to the nearest thousand then write the numbers in thousands. For this decision, I used the significant digit method. If you put too many places of numbers, people can get caught up in the precision of the number instead of the general size of the number. For big numbers reported in newspapers and magazines, most writers use either one or two or three significant digits, which is to say how many digits are reported before you round the rest of the number to zeroes.

Here are some examples. Let's say that the bailout of the banks cost $763,922,575,873.22. That's too much information, and who knows if being precise makes any sense, since the final numbers might change.

One significant digit: The bank bailout costs $800 billion.
Two significant digits: The bank bailout costs $760 billion.
Three significant digits: The bank bailout costs $764 billion.

All of these answers are correct, depending on how many significant digits we decide to use. The more significant digits, the more precise the answer is, but using more than three or so significant digits for data presented to the public is probably too much information. If we are using numbers in calculations, we usually want to be as precise as possible, and only do the rounding at the end, when the number is presented as the answer.

Order of operations: You may have learned the order of operations by remembering the word PEMDAS or the mnemonic device

Please (parentheses)
Excuse (exponents)
My Dear (multiplication and division)
Aunt Sally (addition and subtraction)

With calculators like the TI-30X II s or the TI-83 or TI-84, the display screen has a line which shows you the equation as you press the keys, so you can enter an entire equation just once and then press the [ENTER] key. The calculator understands the order of operations rules, but sometimes you still have to be careful how you type your formulas in.

A fraction is really two sets of parentheses. You might remember the formula for the slope of a line in algebra class as being the difference of the y values divided by the difference of the x values. Let's say the two points were (8, 7) and (4, 2). Because of the fraction sign we want to do 7-2 = 5 and 8-4 = 4 BEFORE we do the division, so to make that clear to the calculator, we have to put parentheses around the numerator equation, press the division key, then parentheses around the denominator equations, then press [ENTER].

On your calculator, the square root is followed by an open parentheses. In some texts, a formula that includes a square root might show two number being multiplied together by putting the numbers in parentheses and next to each other instead of using the multiplication sign. That can be a problem on the calculator, because the square root starts it own set of parentheses. Instead of typing (.3)(.7), on the calculator you should type .3*.7. The instructions for different equations are shown below.

On most calculators, you don't have to close every set of parentheses. The only calculator I know that is a stickler for parentheses agreeing with one another left and right is the TI-89. On the other hand, if you use a spreadsheet program, it will always complain if your left and right parentheses don't agree.

Statistics on a budget