Tuesday, September 29, 2015

Notes for September 29 and October 1


Here are links to posts about how to get average and both standard deviations on the TI-30XIIs.

Here is a link to a post about t-scores and their use in confidence intervals.

Here is a link to the posts about Confidence of Victory.

A list of what the major things that can go wrong with samples.

Too small a sample size. A very small sample will have huge confidence intervals for the values of proportions for categorical variables, which should be a red flag for anyone reading it. But often, people only mention n and the confidence intervals as afterthoughts and many papers have been published and quoted in much larger publication before anyone notices how small the samples were.

Convenience sampling: Our class could be considered a sample of students at Laney, but is it representative? It's convenient for me to get information from the students, but groups of students who would be ignored include:

1. Students whose majors do not require statistics
2. Students who primarily take night classes or distance learning classes
3. Students who primarily take Monday and Wednesday classes

It's not inevitable that excluding these groups would change the proportions of males and females, for example, but a convenience sample is always suspect.

Self-selection. Internet polls on websites might ask you about politics or sports or entertainment. You are under no compulsion to answer the questions and you do so only because the topic interests you. Instead of being convenient for the researcher, self selection polls are convenient for the responders. Almost every such poll will have a disclaimer stating "not a scientific poll" and the numbers aren't a good place to start using statistical methods to find out about the underlying population.

Leading questions. In polling data for opinions, leading questions can create bias.

Under-sampling and oversampling of demographic groups. I have been following polls for several elections now and in nearly every poll, someone will complain that some group is under-represented.  Too many conservatives or too many liberals, too many men or too many women, not enough people from outside of major cities or too many from outside major cities, some age group is under or over represented.

No sample is completely perfect, but honest sampling companies do work at using acceptable methods.
 

Tuesday, September 15, 2015

Notes for September 15th and 17th

Link to a post about the shared birthday problem.

Link to a post about the Game Show problem, a.k.a. the Monty Hall problem. (many topics discussed, this topic at the bottom of the post.)

Probability of r successes in n dependent trials using sampling without replacement, which is like drawing cards from a deck.

A new use for independent probability: Missing a rare side effect. Let us consider a drug company running tests on a new drug. The tests are designed to check the drug's effectiveness in comparison to other drugs on the market, but they are also designed to see if the subjects experience side effects. If you've ever listened to a drug commerical on TV, you know that some side effects can be quite dangerous.  If the probability of a side effect is p and the size of the sample is n, the expected value for the frequency is np.

Example: Let's say the drug company is testing a new drug on 500 subjects. Let's also stipulate there is a fairly rare side effect that we should see in 1% of the population, so p = .01.  500 * .01 = 5, so the expected value of people with the side effect in the sample is 5. Since the expected value is a whole number, this means the most likely number people with the side effect is 5.  Let's do the binomial distribution for 4, 5 and 6, rounding to four places after the decimal.

Probability of exactly 4 people out of 500 having the side effect:

500 nCr 4 * .01 ^ 4 * .99 * 496 = .1760 or 17.6%

Probability of exactly 5 people out of 500 having the side effect:

500 nCr 5 * .01 ^ 5 * .99 * 495 = .1764 or 17.64%

Probability of exactly 6 people out of 500 having the side effect:

500 nCr 6 * .01 ^6 * .99 * 494 = .1470 or 14.70%

As we can see, the odds of 5 out of 500 are slightly greater than 4 out of 500, and about 3% more than 6 out of 500. No other outcome is more likely than 5 out of 500.

Here's a different question: what are the chances of 0 out of 500? The reason to ask this is if the trial misses the side effect completely and drug goes to market, the company could face a lot of lawsuits they didn't expect when the side effect starts showing up in the much larger sample of patients taking the drug.

Probability of 0 people out of 500 having the side effect:
500 nCr 0 * .01 ^0 * .99 * 500 = .0066 or 00.66%

(Note: when we have "n choose 0" the answer is always 1, and likewise any non zero number raised to the power of 0 is one.  For this problem only, we can just type in the last term (1 - p)^n

Because the sample was large enough and the side effect was not all that rare, the odds of a sample missing this side effect are relatively low. But what if the side effect were rarer, say 1 in 400, which is the decimal .0025.  This changes the numbers, of course. The expected value is now 500 * .025 = 1.25, which means the most likely event should be either 1 person or maybe 2 people showing the side effect. Let's look at 0, 1 and 2 people having the side effect.

Probability of exactly 0 people out of 500 having the side effect:
500 nCr 0 * .0025 ^ 0 * .9975 * 500 = .2861 or 28.61%

Probability of exactly 1 person out of 500 having the side effect:
500 nCr 1 * .0025 ^ 1 * .9975 * 499 = .3585 or 35.85%

Probability of exactly 2 people out of 500 having the side effect:
500 nCr 2 * .0025 ^ 2 * .9975 * 498 = .2242 or 22.42%

So the most likely event is to have one person showing the side effect, which will happen about 36% of the time. But the next most likely event is not 2 out of 500 but 0 out of 500, which happens over 28% of the time. 1 in 400 people showing a side effect might not seem that high, but a successful drug can be given to hundreds of thousands of patients, possibly more, and having 1 in every 400 showing a very bad side effect could get very expensive for the company.

Here are some practice problems. Assume the sample size is n = 1000 and we are interested in 0 people showing the side effect. Round the answers to the nearest tenth of a percent.

a) the side effect shows up in 1 in 500 patients

b) the side effect shows up in 1 in 1,000 patients

c) the side effect shows up in 1 in 1,500 patients

Answers in the comments.

Thursday, September 3, 2015

Homework 2 (due September 8)
Last four points of homework and older posts about z-score and lookup tables.


4 points for Homework 2

Here is the list of wins for the 30 NBA wins in the 2014-15 season, including playoff wins. Find the z-score for the highest number of wins and lowest number of wins (note: list not in order) and determine if these two numbers of wins count as outliers by the following method.

If z >= 3, the value is very unusually high.
If z >= 2, the value is unusually high.
If z <= -2, the value is unusually low.
If z <= -3, the value is very unusually low. 

List

68 76 56 49 52 43 40 40 38 37 33 32 25 18 17 83 65 62 52 61 58 51 45 45 39 38 30 29 21 16 

Average = 43.97
Standard deviation = 16.99

Round z-score to the nearest hundredth.

z(high value) = ______________

Is this any kind of outlier, and of so, which one? ____________


z(low value) = ______________
Is this any kind of outlier, and of so, which one? ____________

===============

If you are looking for more information on look-up tables, follow this link to a previous post.

Here is a link to some practice problems for raw scores to z-scores to proportions and vice versa.