Friday, June 26, 2009

Five number summary and box and whiskers plot

The five number summary does not list all the information, but instead lists the low value, Q1, Q2, Q3 and the high value. The Q's stand for quartile, which means Q1 is the split for the low 25% of the data, Q2 is the median and Q3 is the split for the high half of the data. You can think of Q1 and Q2 as the medians of the low half and the high half, respectively. The interquartile range or IQR is the distance from the third quartile to the first, which is Q3 - Q1.

Let's look at the list.

7 | 678
7 | 00111224
6 | 556666666688999
6 | 000122334

With 35 entries, the middle entry is the 18th, either counting from top to bottom or bottom to top. That is the 6 marked in red and bold, so Q2 = 66.

There are 17 entries in the top half and 17 in the low half, so the position of the high and low quartiles is 9 away from the top and 9 away from the bottom, respectively. These are marked in bold and blue, with Q1 = 64 and Q3 = 71.

The five number summary is as follows

High = 78
Q3 = 71
Q2 = 66
Q1 = 64
Low = 60

Now we draw the box and whiskers in three steps, shown in the picture.


The first step is to draw the box between the values for the first and third quartiles, with a dotted line at the second quartile. We also can put dots to mark the high and low values.

The next step is computing the Interquartile Range, IQR = Q3 - Q1, which in this case is 71-64 = 7. We mark boundaries called the threshold which will be 1.5 times IQR above the third quartile and 1.5 time IQR below the first quartile. In our case, this would be at 71+ 1.5*7 = 81.5 and 64 - 1.5*7 = 63.5.

The third step is checking for outliers, If the high and low values are within these thresholds, the whiskers extend to the high value to the right and the low value to the left. (Box and whiskers can also be drawn vertically, so change the directions to high an dlow in that case.) If all the data is inside, the whiskers extend all the way to high and low. If there is data outside, those count as outliers and the whiskers are drawn to the highest data inside the high threshold and the lowest data inside the low threshold. One this has been done, we erase the thresholds.

No comments: