Tuesday, March 24, 2009

Class notes for 3/23

So far, we have been using numerical data and coming up with confidence intervals for the average or standard deviation of the underlying population using formulas involving n, x-bar and sx, as well as multipliers from the Student's t-score tables (Table A-3) and the Chi square table (Table A-4). We use Student's t-scores instead of z-scores because the standard deviation of the sample (sx) might not be very close to the standard deviation of the population (sigmax), and the t-scores numerically deal with the extra uncertainty created by this variation.

When we use categorical data, the most important statistic we try to predict is the proportion of a value in the population, which we call p, which we will estimate using the proportion from the sample, known as p-hat.

Again we will create a confidence interval, and the formula for standard deviation is very different.

sp-hat = sqrt(p-hat * q-hat/n)

The confidenced level multipliers for xx% are taken from the z-score table (Table A-2) instead of the t-score table, and this is because the standard deviation for the sample and the standard deviation for the population are expected to be relatively close to one another. The values for the CLMxx% are given on the first page of your yellow sheets in the lower left hand corner.

CLM90% = 1.645
CLM95% = 1.96
CLM99% = 2.575

Example: Consider data sets #1 and #2, and the proportion of males. Let's find the 95% confidence interval for the underlying population, which we will limit to students at Laney who take statistics.

Data set #1:
n = 38
f(males) = 18
p-hat(males) = 18/38 ~= .474
q-hat(males) = 1 - p-hat(males) ~= .526

.474 - 1.96*sqrt(.474*.526/38) < p < .474 + 1.96*sqrt(.474*.526/38)
.315 < p < .633

Given this sample of 38 students, we are 95% confident the percentage of male students taking statistics at Laney is between 31.5% and 63.3%.


Data set #2: n = 42
f(males) = 12
p-hat(males) = 12/42 ~= .286
q-hat(males) = 1 - p-hat(males) ~= .714

.286 - 1.96*sqrt(.286*.714/42) < p < .286 + 1.96*sqrt(.286*.714/42)
.149 < p < .423

Given this sample of 42 students, we are 95% confident the percentage of male students taking statistics at Laney is between 14.9% and 42.3%.

Data sets #1 and #2 combined: n = 80
f(males) = 30
p-hat(males) = 30/80 = .375
q-hat(males) = 1 - p-hat(males) = .625

.375 - 1.96*sqrt(.375*.625/80) < p < .375 + 1.96*sqrt(.375*.625/38)
.269 < p < .481


Given this sample of 40 students, we are 95% confident the percentage of male students taking statistics at Laney is between 26.9% and 48.1%.

Notice how much our intervals disagree with one another. This is because our best point estimates from the three sets are .474, .286 and .375. Also notice that the width of the 95% confidence interval tends to be smaller as n gets bigger. When the sample size is 38, the width of the confidence interval is .318. At n = 42, it is .274 wide. At n = 80, the width is .212. The most common way to make a confidence interval narrower is to increase the size of the sample.

There are two other ways to change the width. If you ask for a higher confidence level, the interval will get wider. If p and q are close to 50%, the confidence interval will be wider than if the are both far away from 50%.

Question:

1. Use data set #1 to find the 95% confidence interval for the proportion of 20-29 year olds among Laney statistic students. Write the sentence explaining the confidence interval.

2. Use data set #2 to find the 95% confidence interval for the proportion of 20-29 year olds among Laney statistic students. Write the sentence explaining the confidence interval.

3. Use data sets #1 and #2 combined to find the 95% confidence interval for the proportion of 20-29 year olds among Laney statistic students. Write the sentence explaining the confidence interval.

Answers in the comments.

1 comment:

Prof. Hubbard said...

#1

n = 38
p-hat = 20/38 ~= .526
q-hat = 18/38 ~= .474

.526 - 1.96*sqrt(.474*.526/38) < p < .526 + 1.96*sqrt(.474*.526/38)
.367 < p < .685

Given this sample of 38 students, we are 95% confident the percentage of twentysomething students among all students taking statistics at Laney this term is between 36.7% and 68.5%.

#2

n = 42
p-hat = 26/42 ~= .619
q-hat = 16/42 ~= .381

.619 - 1.96*sqrt(.619*.381/42) < p < .619 + 1.96*sqrt(.619*.381/42)
.472 < p < .766

Given this sample of 42 students, we are 95% confident the percentage of twentysomething students among all students taking statistics at Laney this term is between 47.2% and 76.6%.

#3
n = 80
p-hat = 46/80 = .575
q-hat = 34/80 = .425
.575 - 1.96*sqrt(.575*.425/80) < p < .575 + 1.96*sqrt(.575*.425/80)
.467 < p < .683

Given this sample of 80 students, we are 95% confident the percentage of twentysomething students among all students taking statistics at Laney this term is between 46.7% and 68.3%.