Statistics: computing quantiles with Python

Nov. 30, 2009 Visit the link Quantile function in R for more information. It is said that the quantile function we implemented cannot do the type #2. We will investigate this when we have the time.

I used to open R just to compute summary statistics on data. R is big and so is Python. But I find it strange when people finds Python lacking on some statistical features, other people say just use R. You can interface to R using Rpy. But for a few calculations, it is just much more efficient when you have Python code handy for doing statistics. Scipy comes with a stats module but there is no function for computing quantiles from an array of data.

Inspired by R, and helped by the clear documentation from the Mathematica page, I am able to replicate the quantile function in Python. Here it is (Again, I find it crazy that my Python indents are not being shown! [Sep. 29, 2009, fixed by using CodeBox]

Given an array x, and a quantile value from from 0 to 1.0, the function will return a value of x which may not be an element of the array such that P(X <= x_q) < q, i.e. q * 100 percent of the data are less than or equal to x_q. Quantiles are similar to percentiles.

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
"""
File    quantile.py
Desc    computes sample quantiles 
Author  Ernesto P. Adorio, PhD.
        UPDEPP (U.P. at Clarkfield)
Version 0.0.1 August 7. 2009
"""
 
from math import modf, floor
 
def quantile(x, q,  qtype = 7, issorted = False):
    """
    Args:
       x - input data
       q - quantile
       qtype - algorithm
       issorted- True if x already sorted.
 
    Compute quantiles from input array x given q.For median,
    specify q=0.5.
 
    References:
       http://reference.wolfram.com/mathematica/ref/Quantile.html
       http://wiki.r-project.org/rwiki/doku.php?id=rdoc:stats:quantile
 
    Author:
	Ernesto P.Adorio Ph.D.
	UP Extension Program in Pampanga, Clark Field.
    """
    if not issorted:
        y = sorted(x)
    else:
        y = x
    if not (1 <= qtype <= 9): 
       return None  # error!
 
    # Parameters for the Hyndman and Fan algorithm
    abcd = [(0,   0, 1, 0), # inverse empirical distrib.function., R type 1
            (0.5, 0, 1, 0), # similar to type 1, averaged, R type 2
            (0.5, 0, 0, 0), # nearest order statistic,(SAS) R type 3
 
            (0,   0, 0, 1), # California linear interpolation, R type 4
            (0.5, 0, 0, 1), # hydrologists method, R type 5
            (0,   1, 0, 1), # mean-based estimate(Weibull method), (SPSS,Minitab), type 6 
            (1,  -1, 0, 1), # mode-based method,(S, S-Plus), R type 7
            (1.0/3, 1.0/3, 0, 1), # median-unbiased ,  R type 8
            (3/8.0, 0.25, 0, 1)   # normal-unbiased, R type 9.
           ]
 
    a, b, c, d = abcd[qtype-1]
    n = len(x)
    g, j = modf( a + (n+b) * q -1)
    if j < 0:
        return y[0]
    elif j >= n:           
        return y[n-1]   # oct. 8, 2010 y[n]???!! uncaught  off by 1 error!!!
 
    j = int(floor(j))
    if g ==  0:
       return y[j]
    else:
       return y[j] + (y[j+1]- y[j])* (c + d * g)    
 
def Test():
    x = [11.4, 17.3, 21.3, 25.9, 40.1, 50.5, 60.0, 70.0, 75]
 
    for qtype in range(1,10):
        print qtype, quantile(x, 0.35, qtype)
 
if __name__ == "__main__":    
    Test()

When the test code runs, it outputs

1 25.9
2 25.9
3 21.3
4 21.99
5 24.29
6 23.6
7 24.98
8 24.06
9 24.1175

This matches the output of the following R code:

>> x <- c(11.4, 17.3, 21.3, 25.9, 40.1, 50.5, 60.0, 70.0, 75)
> for (i in seq(1,9)) { print(c( i, quantile(x, 0.35, type=i))) }
      35%
 1.0 25.9
      35%
 2.0 25.9
      35%
 3.0 21.3
        35%
 4.00 21.99
        35%
 5.00 24.29
      35%
 6.0 23.6
        35%
 7.00 24.98
        35%
 8.00 24.06
            35%
 9.0000 24.1175
>                                                
"""

Feel free to comment and feel free to use the above code for educational research purposes.

Review problems.
1. What are the values returned for q = 0 and q = 1.0?

2. The data can be separated into 4 blocks and boundaries are marked by delimiters called quartiles, labelled Q1, Q2, Q3. (Q0 and Q4 are by convention the minimum and maximum of the array values. Determine Q1, Q2, Q3 MANUALLY for the array x = [11.4, 17.3, 21.3, 25.9, 40.1].

3. The median corresponds to the 0.5 quantile. By convention if the number of data points is odd, the median is the middle value otherwise when the number of data points is even, the median will be the
unique middle value. Compute the median for the previous array data.

4. Try to get access to legal copies of SPSS, SAS, STATA. What do they return as default values for the quartiles for the array in Problem 3?

References: Interesting discussions on the computation of quantiles:

Scipy now has a quantile function. Click on the (dated sep 11, 2010)Scipy mstats mquantile documentation .

From time to time we check on our Python codes. It is not fun to find errors, even if impossible or rare in practice when running the actual code.

3 Responses to “Statistics: computing quantiles with Python”

  1. Всё о строительстве Says:

    Спасибо наконец то нашла то что хотела прочитать тут. Кстати у меня есть рисунки на эту тему. Куда можно скинуть? Ещё раз спасибо ! :)

  2. ernie Says:

    This is a temporary approval. Wonder what the Russian comment means in English?

  3. ernie Says:

    August 1, 2009

    From time to time and today again, I received comments in russian. be informed that I have written a blog entry requiing that all responses must be in English.

    I don't have the luxury of time to translate foreign languages to english.