Statistics on a budget: April 2014

Tuesday, April 22, 2014

Notes for April 22 and April 24

Confidence interval for standard deviation of population

The confidence interval formula for sigmax is completely different than the formula for mux, and uses values from a new table, Table A-4, known as the chi squared table. Chi is pronounced like the chi in chiropractor, "kai", not "chee" or "chai".

Note that we don't have to test for normal distribution of the underlying data.

Let's give an example. If we have a set where n = 13 and sx = 3.21, we use n-1 as our degrees of freedom. Here is the line from Table A-4 for the row that corresponds to 12.

_____0.995___0.99___0.975___0.95___0.90___0.10____0.05___0.025____0.01___0.005
_12__3.074__3.571___4.404__5.226__6.304__18.549__21.026__23.337__26.217__28.299

The values of chi²Right and chi²Left are taken from the table as follows.

90% confidence: The right value is from the 0.05 column, the left value s from the 0.95 column. (Note that 0.95 - 0.05 = 0.90 or 90%)

95% confidence: The right value is from the 0.025 column, the left value s from the 0.975 column. (Note that 0.975 - 0.025 = 0.95 or 95%)

99% confidence: The right value is from the 0.005 column, the left value s from the 0.995 column. (Note that 0.995 - 0.005 = 0.99 or 99%)

In this instance, here are the confidence intervals for each of the standard percentages of confidence.

90% confidence: 3.21*sqrt(12/21.026) < sigmax < 3.21*sqrt(12/5.226)
2.425 < sigmax < 4.864

95% confidence: 3.21*sqrt(12/23.337) < sigmax < 3.21*sqrt(12/4.404)
2.302 < sigmax < 5.299

99% confidence: 3.21*sqrt(12/28.299) < sigmax < 3.21*sqrt(12/3.074)
2.090 < sigmax < 6.342

Like confidence intervals for other parameters like proportion and average, the 99% is the largest, the 95% is nested inside the 99% and the 90% is nested inside both the others. Unlike other confidence intervals, we get the endpoints by multiplying our statistic sx by two positive numbers, one less than 1 and the other greater than 1. The other thing that is unlike the early confidence intervals is that our statistic is not in the exact center. for example 3.21 - 2.425 = .785, while 5.226 - 3.21 = 2016, so there is a greater distance to the high end of the interval than there is to the low end.

Goodness of fit

If we want to know if a coin is fair we can do a two tailed z-score test of the proportion of heads to tails, where the null hypothesis is that both should be 50%. For example, if I flip a coin 100 times and get 52 heads and 48 tails, that isn't very far off from 50-50. The test statistic would be

z = (.52-.5)/sqrt(.5*.5/100) = .4

A z-score of .4 corresponds to a proportion of .6554. Because this is a two-tailed test, we use the higher number (either heads or tail percentage) and the percentage has to be fairly high.

Two-tailed 90% confidence: over .9500
Two-tailed 95% confidence: over .9750
Two-tailed 99% confidence: over .9950

If instead we got 60 heads and 40 tails, the z-score would produce a much higher proportion.

z = (.60-.5)/sqrt(.5*.5/100) =2.0

A z-score of 2.0 corresponds to a proportion of .9772. If we wanted proof to 90% confidence or 95% confidence, we would say this isn't a fair coin, rejecting the null hypothesis. If we want the proof to 99% confidence, the coin would have to be even more unfair, at least 62 to 38 in 100 flips.

We can't use this method to figure out if a six-sided die is fair, because we need to test that all six possibilities are coming up an equal number of times. For example, if we rolled the die 60 times, we would expect every number to show up exactly 10 times each. In the real world, that's not likely to happen, but we should expect something close. Here is the result of an experiment done with the random number generator in Excel.

number::: 1 2 3 4 5 6
expected: 10 10 10 10 10 10
observed: 14 9 10 10 7 10

Our test statistic will be the sum of (Observed - Expected)²/Expected. Here are the six values.

1st: (14-10)²/10 = 16/10 = 1.6
2nd: (9-10)²/10 = 1/10 = 0.1
3rd: (10-10)²/10 = 0/10 = 0.0
4th: (10-10)²/10 = 0/10 = 0.0
5th: (7-10)²/10 = 9/10 =0.9
6th: (10-10)²/10 = 0/10 = 0.0

The sum is 2.6. This is our test statistic. The degrees of freedom is the number of categories - 1, which in this case is 5. Look on the right side of chi square table for the thresholds in row d.f. = 5.

90% threshold: The 0.10 column, which has 9.236 in row 5
95% threshold: The 0.05 column, which has 11.071 in row 5
99% threshold: The 0.01 column, which has 15.086 in row 5

Our test statistic is much lower than even the 90% confidence level, so we fail to reject the null hypothesis. In other words, while this test wasn't perfect, it was much too close to what we expected to doubt the die was unfair.

Independence of contingency tables

You may remember the idea of dependent probability, where p(A, given B) would not be equal to p(A). It is possible to make a contingency table where p(A, given B) = p(A) for any A and B, where one is a row and the other is a column. Here is an example of a team whose road record and home record are exactly the same.

___H || A|| Total
W |16||16|| 32
L |10||10|| 20
|26||26|| 52 = grand total

We see that p(Wins) = p(Wins GIVEN Home) = p(Wins GIVEN Away). This hypothetical win-loss record is at the same proportion whether on the road or at home.

When we look at a contingency table, it most likely won't be exactly independent, but we have a new test using chi square table once again and using a test statistic that has the same formula as goodness of fit, the sum of (Observed - Expected)²/Expected. This time, the table we are given is the Observed and we must create the Expected using the formula

(row total) * (column total) = grand total.

Here is an example.

___H || A|| Total
W |22||18|| 40
L | 8||12|| 20
|30||30|| 60 = grand total

Now the home and road records aren't identical. We create the expected by keeping the row and column totals and blanking out the values inside.

___H || A|| Total
W |__||__|| 40
L |__||__|| 20
|30||30|| 60 = grand total

In the box for Home Wins, upper left hand corner we put

40 * 30/60 = 20. (Usually this is a decimal number, but I designed it to be a whole number to make this example easier.

___H || A|| Total
W |20||__|| 40
L |__||__|| 20
|30||30|| 60 = grand total

We could do the (row total) * (column total) = grand total method three more times to fill in the rest, but we don't have to. We can simply subtract the number in the box we just filled in to get the rest of the top row and the left column.

___H || A|| Total
W |20||20|| 40
L |10||__|| 20
|30||30|| 60 = grand total

Now it's easy to fill in the last box as well.

___H || A|| Total
W |20||20|| 40
L |10||10|| 20
|30||30|| 60 = grand total

Now we get the four values of (Observed - Expected)²/Expected. Here is the Observed contingency table once again.

___H || A|| Total
W |22||18|| 40
L | 8||12|| 20
|30||30|| 60 = grand total

Top left: (22-20)²/20 = 4/20 = 0.2
Top right: (18-20)²/20 = 4/20 = 0.2
Bottom left: (8-10)²/10 = 4/10 = 0.4
Bottom right: (12-10)²/10 = 4/10 = 0.4

Sum = 1.2

The degrees of freedom for a contingency table is (# of rows - 1) * (# of columns - 1), which in this case is (2-1)*(2-1) = 1*1 = 1. (Notice that we only had to fill in one box in the contingency table and the rest could be found by subtraction.) When degrees of freedom = 1, our thresholds are as follows.

90% confidence level (column 0.10): 2.706
95% confidence level (column 0.05): 3.841
99% confidence level (column 0.01): 6.635

Our test statistic of 1.2 does not let us reject the null hypothesis. This means that while the home and road records are different, we do not consider them to be statistically significantly different.

Restriction of range and correlation

If we take two matched sets of data, we can create a trendline, yp = ax + b, which is also know as the predictor line or the line of regression or the line of least squares. This is the illustration of a set of miles per gallon highway on the x axis and weight on the y axis. Not surprisingly, lighter cars get better gas mileage in general and heavier cars get less miles per gallon. Here is the set of 18 matched pairs, written as (mpg_highway, weight).

(24, 3930)

(24, 3985)

(26, 3995)

(26, 4020)

(27, 3515)

(27, 3175)

(27, 3225)

(28, 3220)

(29, 3115)

(29, 3450)

(30, 3525)

(30, 3245)

(30, 3115)

(31, 2795)

(32, 3235)

(34, 2500)

(37, 2440)

(37, 2290)

The correlation coefficient R² isn't as perfect 1.000, but according to the table we use to check how strong the correlation is, For 18 points the 95% confidence level threshold is 0.2190 and the 99% is 0.3841. So a value of R² = 0.7643 for these two variables shows they have very strong correlation and that isn't surprising. Strong negative correlation here (we can tell it's negative because the line slopes downward) means lighter cars generally get better gas mileage than heavier cars. That makes sense.

The statistical concept that is new here is restriction of range. This says that there is a tendency if you only look at data where the x values are limited, the R² value will go down. Let's say we only look at the first eight cars on the list, the ones that get under 30 mpg highway. We see that the trend is still downward and the slope is steeper. The thing our rule says is that
R² will be less usually, and it is less here. 0.60711 instead of 0.7643. For 8 points, this level of correlation surpasses the 95% threshold of 0.4998, but does not surpass the 99% threshold of 0.6956.

We see the same tendency come true when we look only at the cars getting 30 mpg highway or better. Here there are ten data points and the R² value is 0.61952. The thresholds for 10 pairs are 0.3994 and 0.5852 for 95% and 99% confidence, respectively. By our measure, this set has better correlation than the heavier cars do, but not as good as the set of all 18 cars.

The Monty Hall Problem (or for younger people, The Game Show Problem)

Way back in the day, there was a game show called Let's Make a Deal and the host was Monty Hall. (As a survey in class showed, only one student was aware of this, watching it in re-runs on the cable channel The Game Show Network.) There were many different games played in a half hour using many rules, but one of the famous ones is called The Monty Hall Problem. We can call it The Game Show Problem or more descriptively One Brand New Car and Two Goats.

The rules of the game are as follows. There are three closed doors and the contestant must choose one. Behind two of the doors there are bad prizes and behind the last door there is a great prize. Usually but not always, the bad prizes were goats. Usually but not always, the great prize was a brand new car.

(Note from a different perspective: If you have a place to raise them, goats are excellent producers of meat and milk. The most famous cheese of Greece, known as Feta Cheese, is almost always made from goat's milk. A new car, on the other hand, usually means higher insurance rate and even though it's a prize, the winner has to pay the sales tax.)

Back to the game. After the contestant chooses a door, not knowing if the prize is good or bad, the game show host (Monty Hall) shows that there's a goat behind some unchosen door and asks the contestant if he (or she) wants to switch to another door. (In the three door game, switching means over to the only other door available.)

The math question here is this. Does it make sense to stick with your original door or switch?

Explaining the simplest situation, the three door, one car, two goat version: Okay, on the contestant's first choice, there is a 1/3 chance of getting the car and 2/3 probability of getting a goat. If the contestant doesn't switch, the chance of winning is 1/3.

If the contestant switches, we have to look at two possible situations.

1. The contestant picked the car in the first place. There is a 1/3 chance of this and if the contestant switches, there are no other cars so the contest will lose.

2. The contest picked a goat in the first place. There was 2/3 chance this would happen. If your door has a goat and Monty shows you the second goat, the door you would switch to must have the car, so if the contestant switches, the chance of winning is 2/3.

Generalizing the problem. Let's call the number of doors D, the number of bad prizes B and the number of good prizes G, where G + B = D.

Change the rules so that there can be three doors or more with at least two bad prizes. (If you have a bad prize, the host still needs to show a bad prize.)

If you don't switch, the odds are G/D.

After doing some algebra, we see that the are G/D * (D-1)/(D-2). The second fraction is of the form big/little, so it must be more than 1. What this means is that no matter how many goats (at least two or the game doesn't work) and how many cars (at least one or you can only pick goats), it always makes sense to switch.

Tuesday, April 8, 2014

Notes for April 8

The relations between variables in the line of regression

When we input the points into our calculators for two variable statistics, there are a lot of numbers produced. Here is the list from the TI-30XIIs

n
x-bar
s_x
sigma_x

y-bar
s_y
sigma_y

sum(x)
sum(x²)

sum(y)
sum(y²)
sum(xy)

a
b
r

On the take-home section of the second midterm, you see the messy formula for r that you need to use if you don't have a calculator to do it for you. There is a relationship between a, the slope and r that goes as follows.

a = r × s_y/s_x

Remember that a is the slope of the trendline, which means the rise over run. The two standard deviations become our scaling factor and r decides if the line slopes positively (uphill from left to right) or negatively (downhill from left to right).

The formula yp = ax + b is in slope intercept form, which means when x = 0, yp = b. The only x values we can plug into the formula are ones between the min and max values of x. We have a workaround to this, which that the centroid (x-bar, y-bar) is always on the line. this means we can change the formula to point slope form.

yp - y-bar = a(x - x-bar)

Getting rid of the parentheses it becomes

yp - y-bar = ax - a×x-bar

Adding y-bar to both sides we get

yp = ax - a×x-bar + - y-bar

What this means is b = y-bar - a×x-bar

Note on the midterm and on the board, I gave the residuals as yp - y. In point of fact, it should be the other way around y - yp. It's okay to use the formula given in class on the test.

Notes for April 1st and 3rd

Two averages from two populations

In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.

Question: Do female Laney students sleep a number of hours each night different from male Laney students?

This uses data sets from a previous class. Here are the numbers for the students who submitted data, with the males listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.

With a test like this, we can arbitrary choose which set is the first set and which is the second. Let's do it so x-bar1 > x-bar2. This way, our t-score will be positive, which is what the table expects.

H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)

x-bar1 = 7.54
s1 = 1.47
n1 = 12

x-bar2 = 7.31
s2 = .94
n2 = 26

The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.

(7.54-7.31)/sqrt(1.47^2/26+.94^2/12)[enter]

0.4971...

2 tails__0.01_______0.02_____0.05_______0.10______0.20
11_______3.106_____2.718_____2.201_____1.796_____1.383

This number is less than every threshold, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.

Matched pairs

Was the price of silver in 2007 significantly different than it was in 2008?

Side by side, we have two lists of prices of silver, the highest price in a given month in 2007, followed by the highest price in that same month in 2008. Take the differences in the prices and find the average and standard deviation. The size of the list is 12, so the degrees of freedom are 11. If we assume we did not know which year showed higher prices when we started this experiment, it make sense to make this a two-tailed test. Just for a change of pace, let us use the 90% confidence level.

Mo.___2007___2008
Jan.__13.45__16.23
Feb.__14.49__19.81
Mar.__13.34__20.67
Apr.__14.01__17.74
May___12.90__18.19
Jun.__13.19__17.50
Jul.__12.86__18.84
Aug.__12.02__15.27
Sep.__12.77__12.62
Oct.__14.17__11.16
Nov.__14.69__10.26
Dec.__14.76__10.66

Find the test statistic t, the threshold from Table A-3 and determine if we should reject H0, which in matched pairs tests is always that mu1 = mu2.

Answers in the comments.

Correlation
When we have a data set, sometimes we collect more than one variable of information about the units. For example, in class surveys taken in previous classes, among the numerical variables were the height in inches, the GPA, the opinion about the difficulty of the class, age and average hours of sleep per night.

A question about two variables is if they are related to one another in some simple way. One simple way is correlation, which can be positive or negative. Here is a general definition of each.

Positive correlation between two numerical variables, call them x and y, means that the high values of x tend to be paired with the high values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the low values of y.

The variables x and y show negative correlation if that the high values of x tend to be paired with the low values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the high values of y.

If we pick two variables at random, we do not expect to see correlation. We can write this as a null hypothesis, where the test statistic is r², the correlation coefficient. The sign of low correlation is r² = 0. The values of rx,y are always between -1, which means perfect negative correlation, and +1, which means perfect positive correlation. This means 0 <= r² <= 1.

The second orange sheet gives us threshold numbers for the 99% confidence level and 95% confidence level for correlation given the number of points n. For instance, when n = 5, the thresholds are .7709 for 95% confidence and .9197 for 99% confidence. This splits up the numbers from 0 to 1 into three regions.

0 < r² < .7709 We fail to reject the null hypothesis, which means not strong correlation.
.7709 < r² < .9197We reject the null hypothesis with 95% confidence, but not 99% confidence. This is fairly strong correlation.
.9197 < r² We reject the null hypothesis with 99% confidence this is very strong correlation.

Just like with any hypothesis test, we should decide the confidence level before testing. This is a two-tailed test, because whether correlation is positive or negative, the relationships between number sets can often give us vital scientific information.

There is an important warning: Correlation is not causation. Just because two number sets have a relation, it doesn't mean that x causes y or y causes x. Sometimes there is a hidden third factor that is the cause of both of the things we are looking at. Sometimes, it's random chance and there is no causative agent at all.

Here is a set of five points, listed as (x,y) in each case.

(1,1)
(2,2)
(3,4)
(4,4)
(6,5)

As we can see, the points are ordered from low to high in both coordinates, so we expect some correlation. If we input the points into our calculator, we get a value for r (which is the same as rx,y) of .933338696..., and r² = .8712, which is strong positive correlation, but not very strong positive correlation. Assuming the 95% confidence level is good enough for us, we can use the a and b variables from out calculator to give us the equation of the line

yp = .797x + .649

This is called the predictor line (that's where the p comes from) or the line of regression or the line of least squares or the trendline. Any such line for a given data set has two important criteria it meets. It passes through the centroid (x-bar, y-bar), the center point of all the data, and it minimizes the sum of the absolute values of the residuals, which is |y - yp| for all points.

Let's find the absolute values of the residuals for each of the five points, using the rounded values of a and b.

Point (1,1): |1 - .797*1 - .649| = 0.446
Point (2,2): |2 - .797*2 - .649| = 0.243
Point (3,4): |4 - .797*3 - .649| = 0.96
Point (4,4): |4 - .797*4 - .649| = 0.163
Point (6,5): |5 - .797*6 - .649| = 0.431

As we can see, the point (3,4) is farthest from the line, while the point (4,4) is the closest. The centroid (3.2, 3.2) is exactly on the line if you use the un-rounded values of a and b, and even using the rounded values, the centroid only misses the line by .0006.

In class, we used five points, but the last point was (5,6) instead of (6,5). This changes the numbers. rx,y goes up to .973328527..., which is above the 99% confidence threshold. The formula for the new predictor line is

yp = 1.2x - .2

Where we see the difference in these two different examples is in the residuals.

Point (1,1): |1 - 1.2*1 + .2| = 0
Point (2,2): |2 - 1.2*2 + .2| = 0.2
Point (3,4): |4 - 1.2*3 + .2| = 0.6
Point (4,4): |4 - 1.2*4 + .2| = 0.6
Point (5,6): |6 - 1.2*5 + .2| = 0.2

The closest point is now exactly on the line, which is a rarity, but even the farthest away point is only .6 units away, closer than the farthest away on the line with the lower correlation coefficient.

As we get more points in our data set, we lower our threshold that shows correlation strength. This way, a few points that are outliers do not completely ruin the chances of the data showing correlation, though sometimes strong outliers can mess up the data set and the correlation coefficient gets so close to zero that we cannot reject the null hypothesis that the two variables are not simply related.

Statistics on a budget

Tuesday, April 22, 2014

Notes for April 22 and April 24

Tuesday, April 8, 2014

Notes for April 8

Notes for April 1st and 3rd

Links to special posts

You need a calculator

Labels

Blog Archive

About Me

Site Meter