Python, Econometrics: Diagnostic measures, studentized deleted residuals, dffits and Cook distances.
This blog entry replaces parts of an old blog entry 2916
Along the way in writing this entry, we make some revisions to lm.py, and matlib.py. All files are available for downloading as a single tarred file diagnostics.tar.gz.
The hat matrix 
The hat matrix relates the observed Ys to fitted Ys. It is defined in the equation
and is computed by the formula involving only X as

.
Studentized deleted residuals.
The formula is given on page 400 of Neter. This measure is useful for identifying outlying Y observations.
![d_i^* = e_i [\frac{n-p-1}{RSS(1 -h_{ii}) - e_i^2}]^{1/2} d_i^* = e_i [\frac{n-p-1}{RSS(1 -h_{ii}) - e_i^2}]^{1/2}](http://adorio-research.org/wordpress/wp-content/plugins/easy-latex/cache/tex_ba735474ec8d8710cbc50e8d7ad08c42.png)
|
ith residual |
![]() |
ith diagonal of hat matrix |
![]() |
number of cases or observations |
![]() |
number of columns of model matrix X |
|
regression sum of squares = ![]() |
dffits
The formula is given by

where the subscript (i) means that the ith case is omitted when performing the least squares fit. Fortunately, this diagnostic measure can be computed using the full data set and is given by

Cook’s distance
This is an overall measure of the aggregate effect os the ith observation on all of the estimated regression coefficients. The computing formula used is given by
![D_i = \frac{e_i^2}{p MRSS[\frac{h_{ii}}{1-h_{ii}}]^2} D_i = \frac{e_i^2}{p MRSS[\frac{h_{ii}}{1-h_{ii}}]^2}](http://adorio-research.org/wordpress/wp-content/plugins/easy-latex/cache/tex_26756344164a2bc98dd6ebe3b5d755a6.png)
Here is the code for diagnostics.py incorporating Python code for computing the various measures defined above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | # -*- coding: utf-8 -*- """ File diagnostics.py Author: Dr. Ernesto Adorio University of the Philippines, Clarkfield Angeles City, Pampanga. Version 0.0.1 december 25 release License: noncommercial academic use only """ from matlib import * from lm import * # add the following simple function to lm.py: def hatmatrix (X): """ Computes the hat matrix. """ VC= matinv(matxtx(X)) B = matmatt(VC, X) return matmul(X, B) def diaghatmatrix(X): """ returns diagonal elements of hatmatrix as a vector. """ VC = matinv(matxtx(X)) hii = [0] * len(X) for i in range(len(X)): row = X[i][:] hii[i] = matvectmatvec(row, VC) return hii def VIF(X): """ Returns the vector of VIF for each independent variable. X is the model matrix and may include leading constant column of all 1s. revised: dec.25,2009 tested with Neter's example. """ rows = len(X) cols = len(X[0]) #Check if first column of X is a column of all ones. startcol = 0 for x in X: if x[0]-1.0 < 1.0e-5: startcol = 1 break # Compute ALL VIFs. out = [] for j in range(startcol, cols): Z = matDelCols(X, [j]) Xj = [x[j] for x in X] fit = ols(Z, Xj) out.append(1.0/(1.0 - fit.R2)) return out def studdelresids(X,Y, hii): """ Computes the studentized deleted residuals for model matrix X. Reference formula 11.25 of Neter, Kutner and Wassermann, page 400 n -number of data p - number of columns of X hii -leverage values (diagonal of hat matrix) ei - error at ith observation. d_i* = e_i [(n-p-1)/{SSE(1-hii) - e_i^2}]**(1/2) """ n = len(X) fit = ols(X,Y) dstar = fit.resids if hii is None: hii = diaghatmatrix(X) SSE = fit.RSS num = n -len(X[0])-1 for i in range(n): dstar[i] *= sqrt(num/(SSE*(1- hii[i]) - (dstar[i])**2)) return dstar def dffits(X,Y, hii): """ Computes the dffits vector for model matrix X. Reference formula 11.25 of Neter, Kutner and Wassermann, page 400 n -number of data p - number of variables hii -leverage values (diagonal of hat matrix) ei - error at ith observation. d_i* = e_i [(n-p-1)/{SSE(1-hii) - e_i^2}]**(1/2) dffits = d_i* * ( hii/ (1 -hii)) """ n = len(X) fit = ols(X,Y) dstar = fit.resids if hii is None: hii = diaghatmatrix(X) SSE = fit.RSS num = n -len(X[0])-1 for i in range(n): dstar[i] *= sqrt(num/(SSE*(1- hii[i]) - (dstar[i])**2)) dstar[i] *= sqrt(hii[i]/ (1 -hii[i])) return dstar def CookDistance(X,Y, hii=None,resids=None,MRSS = None): """ Computes Cook's Distance using formula 11.29a p. 403 """ n = len(X) p = len(X[0]) fit = None if resids is None: fit = ols(X,Y) ei = fit.resids else: ei = resids if hii is None: hii = diaghatmatrix(X) if MRSS is None: if fit is None: fit = ols(X,Y) MRSS = fit.MRSS Di = [0] * n invden = 1.0 / (p * MRSS) for i in range(n): Di[i] = ei[i]**2 * invden * hii[i]/ (1 - hii[i])**2 return Di def Test(): print "reading bodyfat data file body.fat" # Read data file "bodyfat.dat" sdata = open("data/bodyfat.dat").read() data = matreadstring(sdata) print"splitting matrix:" X, Y = matsplitbycolumn(data, -1) mataugprint(X, Y) """ Alternately you can use: X = matDelCols(data, [-1]) Y = [d[-1] for d in data] print "Input matrix" mataugprint(X,Y) """ Xmodel = matInsertConstCol(X, 0, 1,inplace=False) print "Variance Inflation factors for full model" print "matches with p. 410, Table 11.4 of Neter's text." print VIF(Xmodel) # Neters's first two variable example: #print "Hat matrix for two variable example of Neter." Xmodel = matSelectCols(Xmodel, [0,1,2]) fit = ols(Xmodel, Y) resids = fit.resids MRSS = fit.MRSS hii = diaghatmatrix(Xmodel) dstars = studdelresids(Xmodel, Y, hii) DFFITS = dffits(Xmodel, Y, hii) D = CookDistance(X,Y, resids= resids, hii = hii, MRSS = MRSS) print print "Diagnostics for Neter's two variable model example. See pp. " print "%2s %8s %8s %8s %8s %8s" % ("i", "ei", "dffit", "dstar", "hii", "Cook") for i, (e,dffit, dstar,h,d) in enumerate(zip(resids,DFFITS,dstars,hii,D)): print "%2d %8.4f %8.4f %8.4f %8.4f %8.4f" %(i,e,dffit, dstar,h,d) if __name__ == "__main__": Test() |
When the above code is run, it outputs the following:
$ python diagnostics.py reading bodyfat data file body.fat splitting matrix: 19.5000 43.1000 29.1000 | 11.9000 24.7000 49.8000 28.2000 | 22.8000 30.7000 51.9000 37.0000 | 18.7000 29.8000 54.3000 31.1000 | 20.1000 19.1000 42.2000 30.9000 | 12.9000 25.6000 53.9000 23.7000 | 21.7000 31.4000 58.5000 27.6000 | 27.1000 27.9000 52.1000 30.6000 | 25.4000 22.1000 49.9000 23.2000 | 21.3000 25.5000 53.5000 24.8000 | 19.3000 31.1000 56.6000 30.0000 | 25.4000 30.4000 56.7000 28.3000 | 27.2000 18.7000 46.5000 23.0000 | 11.7000 19.7000 44.2000 28.6000 | 17.8000 14.6000 42.7000 21.3000 | 12.8000 29.5000 54.4000 30.1000 | 23.9000 27.7000 55.3000 25.7000 | 22.6000 30.2000 58.6000 24.6000 | 25.4000 22.7000 48.2000 27.1000 | 14.8000 25.2000 51.0000 27.5000 | 21.1000 Variance Inflation factors for full model matches with p. 410, Table 11.4 of Neter's text. [708.84291410637888, 564.3433856203801, 104.60600492589333] Diagnostics for Neter's two variable model example. See pp. i ei dffit dstar hii Cook 0 -1.6827 -0.3661 -0.7300 0.2010 0.0460 1 3.6429 0.3838 1.5343 0.0589 0.0455 2 -3.1760 -1.2731 -1.6543 0.3719 0.4902 3 -3.1585 -0.4763 -1.3485 0.1109 0.0722 4 -0.0003 -0.0001 -0.0001 0.2480 0.0000 5 -0.3608 -0.0567 -0.1475 0.1286 0.0011 6 0.7162 0.1279 0.2981 0.1555 0.0058 7 4.0147 0.5745 1.7601 0.0963 0.0979 8 2.6551 0.4022 1.1176 0.1146 0.0531 9 -2.4748 -0.3639 -1.0337 0.1102 0.0440 10 0.3358 0.0505 0.1367 0.1203 0.0009 11 2.2255 0.3233 0.9232 0.1093 0.0352 12 -3.9469 -0.8508 -1.8259 0.1784 0.2122 13 3.4475 0.6355 1.5248 0.1480 0.1249 14 0.5706 0.1889 0.2672 0.3332 0.0126 15 0.6423 0.0838 0.2581 0.0953 0.0025 16 -0.8509 -0.1184 -0.3445 0.1056 0.0049 17 -0.7829 -0.1655 -0.3344 0.1968 0.0096 18 -2.8573 -0.3151 -1.1762 0.0670 0.0324 19 1.0404 0.0940 0.4094 0.0501 0.0031 toto@toto-laptop:~/Blogs/lm/diagnostics$
The contents of the datafile bodyfat.dat is shown here for convenience.
# Table 8-1 p.272 of NWK reference. # triceps thigh circumference midarm circumference body fat. #vars X1 X2 X3 Y 19.5 43.1 29.1 11.9 24.7 49.8 28.2 22.8 30.7 51.9 37.0 18.7 29.8 54.3 31.1 20.1 19.1 42.2 30.9 12.9 25.6 53.9 23.7 21.7 31.4 58.5 27.6 27.1 27.9 52.1 30.6 25.4 22.1 49.9 23.2 21.3 25.5 53.5 24.8 19.3 31.1 56.6 30.0 25.4 30.4 56.7 28.3 27.2 18.7 46.5 23.0 11.7 19.7 44.2 28.6 17.8 14.6 42.7 21.3 12.8 29.5 54.4 30.1 23.9 27.7 55.3 25.7 22.6 30.2 58.6 24.6 25.4 22.7 48.2 27.1 14.8 25.2 51.0 27.5 21.1
Reference: Neter, Wassermann and Kutner, Applied Linear Regression Models, 2ed., 1989, Richard Irwin, Inc.
The newest version of matlib.py may be viewed in the more recent blog entry.
http://adorio-research.org/wordpress/?p=4353











