Tuesday, April 8, 2014

Notes for April 1st and 3rd

Two averages from two populations


In the tests to see if the average of some numerical value is significantly different when comparing two populations, we need the averages, standard deviations and sizes of both populations. The score we use is a t-score and the degrees of freedom is the smaller of the two sample sizes minus 1.

Question: Do female Laney students sleep a number of hours each night different from male Laney students?

This uses data sets from a previous class. Here are the numbers for the students who submitted data, with the males listed as group #1. Again, let's assume a two-tailed test, since we don't have any information going in which should be greater, and let's do this test to 90% level of confidence.

With a test like this, we can arbitrary choose which set is the first set and which is the second. Let's do it so  x-bar1 > x-bar2. This way, our t-score will be positive, which is what the table expects.

H0: mu1 = mu2 (average hours of sleep are the same for males and females at Laney)

x-bar1 = 7.54
s1 = 1.47
n1 = 12


x-bar2 = 7.31
s2 = .94
n2 = 26

The degrees of freedom will be 12-1=11, and 10% in two tails gives us the thresholds of +/-1.796. Here is what to type into the calculator.

(7.54-7.31)/sqrt(1.47^2/26+.94^2/12)[enter]

0.4971...

2 tails__0.01_______0.02_____0.05_______0.10______0.20
11_______3.106_____2.718_____2.201_____1.796_____1.383


This number is less than every threshold, and so does not impress us enough to make us reject the null hypothesis. It's possible that larger samples would give us numbers that would show a difference, which if true would mean this example produced a Type II error, but we have no proof of that.

Matched pairs


Was the price of silver in 2007 significantly different than it was in 2008?

Side by side, we have two lists of prices of silver, the highest price in a given month in 2007, followed by the highest price in that same month in 2008. Take the differences in the prices and find the average and standard deviation. The size of the list is 12, so the degrees of freedom are 11. If we assume we did not know which year showed higher prices when we started this experiment, it make sense to make this a two-tailed test. Just for a change of pace, let us use the 90% confidence level.

Mo.___2007___2008
Jan.__13.45__16.23
Feb.__14.49
__19.81
Mar.__13.34__20.67
Apr.__14.01
__17.74
May___12.90__18.19
Jun.__13.19__17.50
Jul.__12.86__18.84
Aug.
__12.02__15.27
Sep.__12.77__12.62
Oct.
__14.17__11.16
Nov.__14.69__10.26
Dec.
__14.76__10.66

Find the test statistic t, the threshold from Table A-3 and determine if we should reject H0, which in matched pairs tests is always that mu1 = mu2.

Answers in the comments.

Correlation
When we have a data set, sometimes we collect more than one variable of information about the units. For example, in class surveys taken in previous classes, among the numerical variables were the height in inches, the GPA, the opinion about the difficulty of the class, age and average hours of sleep per night.

A question about two variables is if they are related to one another in some simple way. One simple way is correlation, which can be positive or negative. Here is a general definition of each.

Positive correlation between two numerical variables, call them x and y, means that the high values of x tend to be paired with the high values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the low values of y.

The variables x and y show negative correlation if that the high values of x tend to be paired with the low values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the high values of y.

If we pick two variables at random, we do not expect to see correlation. We can write this as a null hypothesis, where the test statistic is r², the correlation coefficient. The sign of low correlation is r² = 0. The values of rx,y are always between -1, which means perfect negative correlation, and +1, which means perfect positive correlation. This means 0 <= r² <= 1.

The second orange sheet gives us threshold numbers for the 99% confidence level and 95% confidence level for correlation given the number of points n. For instance, when n = 5, the thresholds are .7709 for 95% confidence and .9197 for 99% confidence. This splits up the numbers from 0 to 1 into three regions.

0 < r² < .7709 We fail to reject the null hypothesis, which means not strong correlation.
.7709 < r² < .9197We reject the null hypothesis with 95% confidence, but not 99% confidence. This is fairly strong correlation.
.9197 < r² We reject the null hypothesis with 99% confidence this is very strong correlation.

Just like with any hypothesis test, we should decide the confidence level before testing. This is a two-tailed test, because whether correlation is positive or negative, the relationships between number sets can often give us vital scientific information.

There is an important warning: Correlation is not causation. Just because two number sets have a relation, it doesn't mean that x causes y or y causes x. Sometimes there is a hidden third factor that is the cause of both of the things we are looking at. Sometimes, it's random chance and there is no causative agent at all.


Here is a set of five points, listed as (x,y) in each case.

(1,1)
(2,2)
(3,4)
(4,4)
(6,5)

As we can see, the points are ordered from low to high in both coordinates, so we expect some correlation. If we input the points into our calculator, we get a value for r (which is the same as rx,y) of .933338696...,  and r² = .8712, which is strong positive correlation, but not very strong positive correlation. Assuming the 95% confidence level is good enough for us, we can use the a and b variables from out calculator to give us the equation of the line

yp = .797x + .649

This is called the predictor line (that's where the p comes from) or the line of regression or the line of least squares or the trendline. Any such line for a given data set has two important criteria it meets. It passes through the centroid (x-bar, y-bar), the center point of all the data, and it minimizes the sum of the absolute values of the residuals, which is |y - yp| for all points.

Let's find the absolute values of the residuals for each of the five points, using the rounded values of a and b.

Point (1,1): |1 - .797*1 - .649| = 0.446
Point (2,2): |2 - .797*2 - .649| = 0.243
Point (3,4): |4 - .797*3 - .649| = 0.96
Point (4,4): |4 - .797*4 - .649| = 0.163
Point (6,5): |5 - .797*6 - .649| = 0.431

As we can see, the point (3,4) is farthest from the line, while the point (4,4) is the closest. The centroid (3.2, 3.2) is exactly on the line if you use the un-rounded values of a and b, and even using the rounded values, the centroid only misses the line by .0006.

In class, we used five points, but the last point was (5,6) instead of (6,5). This changes the numbers. rx,y goes up to .973328527..., which is above the 99% confidence threshold. The formula for the new predictor line is

yp = 1.2x - .2

Where we see the difference in these two different examples is in the residuals.

Point (1,1): |1 - 1.2*1 + .2| = 0
Point (2,2): |2 - 1.2*2 + .2| = 0.2
Point (3,4): |4 - 1.2*3 + .2| = 0.6
Point (4,4): |4 - 1.2*4 + .2| = 0.6
Point (5,6): |6 - 1.2*5 + .2| = 0.2

The closest point is now exactly on the line, which is a rarity, but even the farthest away point is only .6 units away, closer than the farthest away on the line with the lower correlation coefficient.

As we get more points in our data set, we lower our threshold that shows correlation strength. This way, a few points that are outliers do not completely ruin the chances of the data showing correlation, though sometimes strong outliers can mess up the data set and the correlation coefficient gets so close to zero that we cannot reject the null hypothesis that the two variables are not simply related.

1 comment:

Prof. Hubbard said...

Here are the differences,

Mo.___2007___2008
Jan.__13.45__16.23 -2.78
Feb.__14.49__19.81 -5.32
Mar.__13.34__20.67 -7.33
Apr.__14.01__17.74 -3.73
May___12.90__18.19 -5.29
Jun.__13.19__17.50 -4.31
Jul.__12.86__18.84 -5.98
Aug.__12.02__15.27 -3.25
Sep.__12.77__12.62 +0.15
Oct.__14.17__11.16 +3.01
Nov.__14.69__10.26 +4.43
Dec.__14.76__10.66 +4.10

d-bar = -2.19 (absolute value 2.19)
s_d = 4.09

When Area in Two tails = .10 and d.f. = 11, our threshold is 1.796. we will use the absolute value to get a positive t-score.

t = 2.19/4.09*sqrt(12) = 1.855

(If you don't round the answers first, it's 1.854.)

In either case, we are below the threshold for 90% confidence, so we reject the null hypothesis that the values were the same.