Monday, May 18, 2009

Class notes for 5/18: Regression and correlation

When we have a data set, sometimes we collect more than one variable of information about the units. For example, in our class survey, among the numerical variables were the height in inches, the GPA, the opinion about the difficulty of the class, age and average hours of sleep per night.

A question about two variables is if they are related to one another in some simple way. One simple way is correlation, which can be positive or negative. Here is a general definition of each.

Positive correlation between two numerical variables, call them x and y, means that the high values of x tend to be paired with the high values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the low values of y.

The variables x and y show negative correlation if that the high values of x tend to be paired with the low values of y, the middle values of x tend to be paired with the middle values of y and the low values of x tend to be paired with the high values of y.

If we pick two variables at random, we do not expect to see correlation. We can write this as a null hypothesis, where the test statistic is rx,y, the correlation coefficient. The sign of low correlation is rx,y = 0. The values of rx,y are always between -1, which means perfect negative correlation, and +1, which means perfect positive correlation.

The seventh page of the yellow sheets gives us threshold numbers for the 99% confidence level and 95% confidence level for correlation given the number of points n. For instance, when n = 5, the thresholds are .878 for 95% confidence and .959 for 99% confidence. This splits up the numbers from -1 to 1 into five regions.

-1 <= rx,y <= -.959: Very strong negative correlation
-.959 < rx,y <= -.878: Strong negative correlation
-.878 < rx,y < .878: The correlation is not particularly strong, regardless of positive or negative.
.878 <= rx,y < .959: strong positive correlation
.959 <= rx,y <= 1: very strong positive correlation

Just like with any hypothesis test, we should decide the confidence level before testing. This is a two-tailed test, because whether correlation is positive or negative, the relationships between number sets can often give us vital scientific information.

There is an important warning: Correlation is not causation. Just because two number sets have a relation, it doesn't mean that x causes y or y causes x. Sometimes there is a hidden third factor that is the cause of both of the things we are looking at. Sometimes, it's random chance and there is no causative agent at all.


Here is a set of five points, listed as (x,y) in each case.

(1,1)
(2,2)
(3,4)
(4,4)
(6,5)

As we can see, the points are ordered from low to high in both coordinates, so we expect some correlation. If we input the points into our calculator, we get a value for r (which is the same as rx,y) of .933338696..., which is strong positive correlation, but not very strong positive correlation. Assuming the 95% confidence level is good enough for us, we can use the a and b variables from out calculator to give us the equation of the line

yp = .797x + .649

This is called the predictor line (that's where the p comes from) or the line of regression or the line of least squares. Any such line for a given data set has two important criteria it meets. It passes through the centroid (x-bar, y-bar), the center point of all the data, and it minimizes the sum of the absolute values of the residuals, which is |y - yp| for all points.

Let's find the absolute values of the residuals for each of the five points, using the rounded values of a and b.

Point (1,1): |1 - .797*1 - .649| = 0.446
Point (2,2): |2 - .797*2 - .649| = 0.243
Point (3,4): |4 - .797*3 - .649| = 0.96
Point (4,4): |4 - .797*4 - .649| = 0.163
Point (6,5): |5 - .797*6 - .649| = 0.431

As we can see, the point (3,4) is farthest from the line, while the point (4,4) is the closest. The centroid (3.2, 3.2) is exactly on the line if you use the un-rounded values of a and b, and even using the rounded values, the centroid only misses the line by .0006.

In class, we used five points, but the last point was (5,6) instead of (6,5). This changes the numbers. rx,y goes up to .973328527..., which is above the 99% confidence threshold. The formula for the new predictor line is

yp = 1.2x - .2

Where we see the difference in these two different examples is in the residuals.

Point (1,1): |1 - 1.2*1 + .2| = 0
Point (2,2): |2 - 1.2*2 + .2| = 0.2
Point (3,4): |4 - 1.2*3 + .2| = 0.6
Point (4,4): |4 - 1.2*4 + .2| = 0.6
Point (5,6): |6 - 1.2*5 + .2| = 0.2

The closest point is now exactly on the line, which is a rarity, but even the farthest away point is only .6 units away, closer than the farthest away on the line with the lower correlation coefficient.

As we get more points in our data set, we lower our threshold that shows correlation strength. This way, a few points that are outliers do not completely ruin the chances of the data showing correlation, though sometimes strong outliers can mess up the data set and the correlation coefficient gets so close to zero that we cannot reject the null hypothesis that the two variables are not simply related.

No comments: