Tuesday, September 25, 2007

Calculating Percentile Ranks

Looks like I've had this wrong in my head for a while now.

Let k be the percentile rank in question, divided by 100, let S be the data set in question, and let N be the number of readings in S, i.e. N = #(S).

Vic and I were saying this afternoon that in the case when (1/N)|k (i.e. k is a multiple of 1/N) then the (k*100)th percentile, P(k), should be the (k*N)th reading when they're ordered. A quick check in Excel shows immediately that this is not the case, as P(0.9) of the the set S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} is actually 91, not 90, the k*N = 0.9*10 = 9th reading.

This is less of a blow to our instincts when you consider P(0.5), the median, of the above data set S. Our procedure as stated above would suggest that the median value was the 5th reading, 50. But clearly the median of this set (if it could be said to exist) is half way between the two centre readings, i.e. 55. For comparison, 50 is actually the median P(0.5) of the set S = {10, 20, 30, 40, 50, 60, 70, 80, 90}.

Instead, we need to refer to the exact definition of P(k), which is the number below which exactly (k*100)% of the readings fall.

Sticking to the example of P(0.5) for the moment, let's consider S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}, which has an even number of readings). There are 10 readings, so we want a value this is greater than exactly k*10 = 5 of those readings. Any value between 50 and 60 would do by that definition. We've taken the midpoint of the two values in our example above.

The calculation of P(0.5) with respect to S = {10, 20, 30, 40, 50, 60, 70, 80, 90} (note that there are an odd number of readings) is actually a little more subtle using this definition, as opposed to our usual definition for the median: "just take the middle one". We want the value below which exactly k*#(S) 0.5*9 = 4.5 readings lie. How can we get 0.5 of a reading to be below the figure we come up with?

For a discrete application, it is said that half of that reading X at a point lies below X, and the other half above it. So, with that in mind, 4.5 readings lie below 50 (including half of 50 itself), and 4.5 above (including the other half of 50).

Take another example, k = 0.5, S = {10, 20, 30, 40, 50, 50, 60, 70, 80, 90}, #(S) = 10. We want the value below which k*#(S) = 0.5*10 = 5 readings lie. 50 fits that definition, because half of the first 50 is below 50 itself, and half of the second one is too. (I've kind of cheated here, by simply justifying my answer, instead of showing how I came to it).

For a continuous application such as ours, Vic G suggests we consider the matter in the following way:

Take all the values in question and assign each of them a percentile ranking. To find a percentile that isn't given by that mapping, we do a linear interpolation between the two bounding percentiles.

By definition the lowest reading is the 0th percentile and the highest reading is the 100th percentile. The figures in between are mapped equally, at 1/(N-1) points, i.e. 11.1..% each.

e.g.

10 = 0%
20 = 11.1..%
30 = 22.2..%
40 = 33.3..%
50 = 44.4..%
60 = 55.5..%
70 = 66.6..%
80 = 77.7..%
90 = 88.8..%
100 = 100%

The 90th percentile is to be found between the 88.8..th (90) and 100th (100) percentiles, which we already have. From the 88.8..th we need to get to 90, that is, another 1.1.. percentile points. As 11.1... percentile points go 10 readings, 1.1.. percentile points will go 1 reading. That gives us a 90th percentile of 90+1 = 91.

To use another example, say we have test scores of S = {75, 76, 78.5, 79, 80, 83, 83, 90, 92}. The lowest reading is the 0th percentile and the highest reading is the 100th percentile. N=9, so each interval represents 1/(N-1) points, i.e. 1/8 = 12.5%.

75 = 0%
76 = 12.5%
78.5 = 25%
79 = 37.5%
80 = 50%
83 = 62.5%
83 = 75%
90 = 87.5%
92 = 100%

The 90th percentile lies between the 87.5th percentile (which is 90) and the 100th percentile (which is 92). We need to go another 2.5 points above the 87.5th percentile to get the the 90th.

Since (90-92) = 2 marks represents 12.5 percentile points, 1 percentile point is represented by (90-92)/12.5 = 0.16 marks. 2.5 points is therefore represented by 0.16*2.5 = 0.4 marks, making the 90th percentile 90 + 0.4 = 90.4 marks.

Note that we're assuming that any (theoretical) marks between 90 and 92 would be evenly distributed, which is a bit of an assumption, but it's the best we can do.

Labels: