Mathematical Ramblings

Tuesday, September 25, 2007

Calculating Percentile Ranks

Looks like I've had this wrong in my head for a while now.

Let k be the percentile rank in question, divided by 100, let S be the data set in question, and let N be the number of readings in S, i.e. N = #(S).

Vic and I were saying this afternoon that in the case when (1/N)|k (i.e. k is a multiple of 1/N) then the (k*100)th percentile, P(k), should be the (k*N)th reading when they're ordered. A quick check in Excel shows immediately that this is not the case, as P(0.9) of the the set S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} is actually 91, not 90, the k*N = 0.9*10 = 9th reading.

This is less of a blow to our instincts when you consider P(0.5), the median, of the above data set S. Our procedure as stated above would suggest that the median value was the 5th reading, 50. But clearly the median of this set (if it could be said to exist) is half way between the two centre readings, i.e. 55. For comparison, 50 is actually the median P(0.5) of the set S = {10, 20, 30, 40, 50, 60, 70, 80, 90}.

Instead, we need to refer to the exact definition of P(k), which is the number below which exactly (k*100)% of the readings fall.

Sticking to the example of P(0.5) for the moment, let's consider S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}, which has an even number of readings). There are 10 readings, so we want a value this is greater than exactly k*10 = 5 of those readings. Any value between 50 and 60 would do by that definition. We've taken the midpoint of the two values in our example above.

The calculation of P(0.5) with respect to S = {10, 20, 30, 40, 50, 60, 70, 80, 90} (note that there are an odd number of readings) is actually a little more subtle using this definition, as opposed to our usual definition for the median: "just take the middle one". We want the value below which exactly k*#(S) 0.5*9 = 4.5 readings lie. How can we get 0.5 of a reading to be below the figure we come up with?

For a discrete application, it is said that half of that reading X at a point lies below X, and the other half above it. So, with that in mind, 4.5 readings lie below 50 (including half of 50 itself), and 4.5 above (including the other half of 50).

Take another example, k = 0.5, S = {10, 20, 30, 40, 50, 50, 60, 70, 80, 90}, #(S) = 10. We want the value below which k*#(S) = 0.5*10 = 5 readings lie. 50 fits that definition, because half of the first 50 is below 50 itself, and half of the second one is too. (I've kind of cheated here, by simply justifying my answer, instead of showing how I came to it).

For a continuous application such as ours, Vic G suggests we consider the matter in the following way:

Take all the values in question and assign each of them a percentile ranking. To find a percentile that isn't given by that mapping, we do a linear interpolation between the two bounding percentiles.

By definition the lowest reading is the 0th percentile and the highest reading is the 100th percentile. The figures in between are mapped equally, at 1/(N-1) points, i.e. 11.1..% each.

e.g.

10 = 0%
20 = 11.1..%
30 = 22.2..%
40 = 33.3..%
50 = 44.4..%
60 = 55.5..%
70 = 66.6..%
80 = 77.7..%
90 = 88.8..%
100 = 100%

The 90th percentile is to be found between the 88.8..th (90) and 100th (100) percentiles, which we already have. From the 88.8..th we need to get to 90, that is, another 1.1.. percentile points. As 11.1... percentile points go 10 readings, 1.1.. percentile points will go 1 reading. That gives us a 90th percentile of 90+1 = 91.

To use another example, say we have test scores of S = {75, 76, 78.5, 79, 80, 83, 83, 90, 92}. The lowest reading is the 0th percentile and the highest reading is the 100th percentile. N=9, so each interval represents 1/(N-1) points, i.e. 1/8 = 12.5%.

75 = 0%
76 = 12.5%
78.5 = 25%
79 = 37.5%
80 = 50%
83 = 62.5%
83 = 75%
90 = 87.5%
92 = 100%

The 90th percentile lies between the 87.5th percentile (which is 90) and the 100th percentile (which is 92). We need to go another 2.5 points above the 87.5th percentile to get the the 90th.

Since (90-92) = 2 marks represents 12.5 percentile points, 1 percentile point is represented by (90-92)/12.5 = 0.16 marks. 2.5 points is therefore represented by 0.16*2.5 = 0.4 marks, making the 90th percentile 90 + 0.4 = 90.4 marks.

Note that we're assuming that any (theoretical) marks between 90 and 92 would be evenly distributed, which is a bit of an assumption, but it's the best we can do.

Labels: statistics

Monday, April 30, 2007

The Chomsky Hierarchy

A formal language consists of a finite set of terminal symbols, a finite set of non-terminal symbols, a finite set of production rules, and a start symbol. They can be organised into a hierarchical structure like so:

Type 0 - Unrestricted grammars
Unrestricted grammars generate all languages that can be recognised by a Turing machine, which can also be called recursively enumerable langauges.

Type 1 - Context-sensitive grammars
Context-sensitive grammars generate context-sensitive languages, and have rules of the form αA&beta &rarr αγβ with A non-terminal, α, β, γ strings of terminals and non-terminals. These languages can be recognised by a linear bounded automaton.

Type 2 - Context-free grammars
These have rules of the form A &rarr &gamma, with A non-terminal, γ a string of terminals and non-terminals. Languages generated by these grammars can be recognised by a non-deterministic push-down automaton. These grammars are the theoretical basis for the syntax of most programming languages.

Type 3 - Regular grammars
Regular languages can be decided by a finite-state automaton. They have rules of the form A &rarr &epsilon, A &rarr αB

Labels: formal grammars, NLP

Thursday, June 08, 2006

Cosets are either identical or disjoint

In the same way that equivalence classes are either identical or disjoint, so are cosets.

Consider a subgroup H of G, and elements a, b &isin G. Then H has right cosets Ha and Hb.

Either:

Ha &cap Hb = ∅, in which case there's nothing more to show

OR:

∃c &isin G s.t. c ∈ Ha &cap Hb

In the second case,

c = h₁a = h₂b for some h₁, h₂ &isin H

and so

a = h₁^-1h₂b = h₃b, where h₃ = h₁^-1h₂ &isin H

and

b = h₂^-1h₁a = h₄a, where h₄ = h₂^-1h₁ &isin H

Saturday, June 03, 2006

A coset of a subgroup has the same order as the subgroup.

This one is just for practice at constructing a proof. The result is trivial.

Consider a finite subgroup H of G with order n. Then

H = {h₁, h₂, ..., h_n}

and ∀g &isin G,

Hg = {h₁g, h₂g, ..., h_ng}

where h₁g, h₂g, ..., h_ng &isin G.

Assume ∃a,b<n, a &ne b such that

h_ag = h_bg

Then by the cancellation laws,

h_a = h_b

a contradiction. Therefore, Hg has n distinct elements. That is, the coset has the same order as the subgroup.

Friday, June 02, 2006

A closed subset of a group is a group

Let (G, ⋅) be a group, and let H ⊂ G. If H is closed under ⋅, then H is a subgroup.

Proof:

Let H have m elements, so that

H = {a₁, a₂, ..., a_m}

Then

Ha₁ = {a₁², a₂a₁, ..., a_ma₁}

By the cancellation laws, Ha₁ has m elements as well, and since H is closed and a₁ ∈ H, Ha₁ = H.

Since a₁ &isin H = Ha₁, ∃a_j &isin H such that a₁ = a_ja₁. Therefore, a_j = e (∈H). __(1)

Also, ∃a_k &isin H such that a_ka₁ = a_j. That is, a_k = a₁^-1 &isin H.

Similarly, by looking at Ha_i, for i = 2..m, a_i^-1 &isin H. __(2)

So, H:

is closed (hypothesis)
is associative (inherited)
has identity (1), and
contains inverses (2)

Therefore, H is a group, and so H is a subgroup of G.

Equivalence classes are either identical or disjoint

Let S be a non-empty set with equivalence relation ~, and ∀a ∈ S let

[a] = {x∈S | a ~ x}

be the equivalence class of a.

Then two equivalence classes [a] and [b] are either identical or disjoint.

Proof:

Consider [a] ∩ [b]. If [a] ∩ [b] = ∅, well and good. Otherwise, ∃c ∈ S such that c ∈ [a] ∩ [b].

That is, c ~ a and c ~ b. By transitivity of equivalence relations, a ~ b. If x ∈ [a] then x ∈ [b], so [a] ⊆ [b].

Similarly, and by reflexivity of equivalence relations, [b] ⊆ [a].

Since [a] ⊆ [b] and [b] ⊆ [a], [a] = [b].

Thus, equivalence classes are either identical or disjoint.

Wednesday, May 31, 2006

An attempt at writing maths in HTML

Some definitions:

A group is a non-empty set G with an operation ⋅ such that:

the set is closed under the operation (closure)
a⋅(b⋅c) = (a⋅b)⋅c (associativity)
∃e ∈ G such that ∀a ∈ G a⋅e = e⋅a = a (identity)
∀g ∈ G ∃g^-1 ∈ G such that g⋅g^-1 = g^-1⋅g = e (inverses)

A semi-group is a non-empty set with an operation such that

the set is closed under the operation (closure)
a⋅(b⋅c) = (a⋅b)⋅c (associativity)

Or, a semi-group is a non-empty set which is closed under an associative operation, or a semi-group is an associative groupoid.

A groupoid or magma is a non-empty set equipped with a single binary operation M × M → M. An operation is closed by definition (so I could have removed that axiom from each of the definitions above).

A monoid is a semi-group with identity.

A ring is a non-empty set S with two operations (called addition (+) and multiplication (×), however they might be defined) such that:

(S, +) is an Abelian group with identity 0.
(S, ×) is a monoid with identity 1.
∀a,b,c ∈ S, a(b+c) = ab + ac

A commutative ring is a ring where (S, ×) is commutative.