## Statistics: computing quantiles with Python

Nov. 30, 2009 Visit the link Quantile function in R for more information. It is said that the quantile function we implemented cannot do the type #2. We will investigate this when we have the time.

I used to open R just to compute summary statistics on data. R is big and so is Python. But I find it strange when people finds Python lacking on some statistical features, other people say just use R. You can interface to R using Rpy. But for a few calculations, it is just much more efficient when you have Python code handy for doing statistics. Scipy comes with a stats module but there is no function for computing quantiles from an array of data.

Inspired by R, and helped by the clear documentation from the Mathematica page, I am able to replicate the quantile function in Python. Here it is (Again, I find it crazy that my Python indents are not being shown! [Sep. 29, 2009, fixed by using CodeBox]

Given an array x, and a quantile value from from 0 to 1.0, the function will return a value of x which may not be an element of the array such that P(X <= x_q) < q, i.e. q * 100 percent of the data are less than or equal to x_q. Quantiles are similar to percentiles.

^{?}View Code PYTHON

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | """ File quantile.py Desc computes sample quantiles Author Ernesto P. Adorio, PhD. UPDEPP (U.P. at Clarkfield) Version 0.0.1 August 7. 2009 """ from math import modf, floor def quantile(x, q, qtype = 7, issorted = False): """ Args: x - input data q - quantile qtype - algorithm issorted- True if x already sorted. Compute quantiles from input array x given q.For median, specify q=0.5. References: http://reference.wolfram.com/mathematica/ref/Quantile.html http://wiki.r-project.org/rwiki/doku.php?id=rdoc:stats:quantile Author: Ernesto P.Adorio Ph.D. UP Extension Program in Pampanga, Clark Field. """ if not issorted: y = sorted(x) else: y = x if not (1 <= qtype <= 9): return None # error! # Parameters for the Hyndman and Fan algorithm abcd = [(0, 0, 1, 0), # inverse empirical distrib.function., R type 1 (0.5, 0, 1, 0), # similar to type 1, averaged, R type 2 (0.5, 0, 0, 0), # nearest order statistic,(SAS) R type 3 (0, 0, 0, 1), # California linear interpolation, R type 4 (0.5, 0, 0, 1), # hydrologists method, R type 5 (0, 1, 0, 1), # mean-based estimate(Weibull method), (SPSS,Minitab), type 6 (1, -1, 0, 1), # mode-based method,(S, S-Plus), R type 7 (1.0/3, 1.0/3, 0, 1), # median-unbiased , R type 8 (3/8.0, 0.25, 0, 1) # normal-unbiased, R type 9. ] a, b, c, d = abcd[qtype-1] n = len(x) g, j = modf( a + (n+b) * q -1) if j < 0: return y[0] elif j >= n: return y[n-1] # oct. 8, 2010 y[n]???!! uncaught off by 1 error!!! j = int(floor(j)) if g == 0: return y[j] else: return y[j] + (y[j+1]- y[j])* (c + d * g) def Test(): x = [11.4, 17.3, 21.3, 25.9, 40.1, 50.5, 60.0, 70.0, 75] for qtype in range(1,10): print qtype, quantile(x, 0.35, qtype) if __name__ == "__main__": Test() |

When the test code runs, it outputs

1 25.9

2 25.9

3 21.3

4 21.99

5 24.29

6 23.6

7 24.98

8 24.06

9 24.1175

This matches the output of the following R code:

>> x <- c(11.4, 17.3, 21.3, 25.9, 40.1, 50.5, 60.0, 70.0, 75) > for (i in seq(1,9)) { print(c( i, quantile(x, 0.35, type=i))) } 35% 1.0 25.9 35% 2.0 25.9 35% 3.0 21.3 35% 4.00 21.99 35% 5.00 24.29 35% 6.0 23.6 35% 7.00 24.98 35% 8.00 24.06 35% 9.0000 24.1175 > """

Feel free to comment and feel free to use the above code for educational research purposes.

Review problems.

1. What are the values returned for q = 0 and q = 1.0?

2. The data can be separated into 4 blocks and boundaries are marked by delimiters called quartiles, labelled Q1, Q2, Q3. (Q_{0} and Q_{4} are by convention the minimum and maximum of the array values. Determine Q1, Q2, Q3 MANUALLY for the array x = [11.4, 17.3, 21.3, 25.9, 40.1].

3. The median corresponds to the 0.5 quantile. By convention if the number of data points is odd, the median is the middle value otherwise when the number of data points is even, the median will be the

unique middle value. Compute the median for the previous array data.

4. Try to get access to legal copies of SPSS, SAS, STATA. What do they return as default values for the quartiles for the array in Problem 3?

References: Interesting discussions on the computation of quantiles:

- Weisstein, Eric W. "Quantile." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/Quantile.html
- Stackoverflow
- Wikipedia article on quantiles

Scipy now has a quantile function. Click on the (dated sep 11, 2010)Scipy mstats mquantile documentation .

From time to time we check on our Python codes. It is not fun to find errors, even if impossible or rare in practice when running the actual code.

March 6th, 2009 at 12:55 pm

Спасибо наконец то нашла то что хотела прочитать тут. Кстати у меня есть рисунки на эту тему. Куда можно скинуть? Ещё раз спасибо !

March 7th, 2009 at 11:37 am

This is a temporary approval. Wonder what the Russian comment means in English?

August 1st, 2009 at 2:13 pm

August 1, 2009

From time to time and today again, I received comments in russian. be informed that I have written a blog entry requiing that all responses must be in English.

I don't have the luxury of time to translate foreign languages to english.