Python, Econometrics: Diagnostic measures, studentized deleted residuals, dffits and Cook distances.

This blog entry replaces parts of an old blog entry 2916
Along the way in writing this entry, we make some revisions to lm.py, and matlib.py. All files are available for downloading as a single tarred file diagnostics.tar.gz.

The hat matrix H

The hat matrix relates the observed Ys to fitted Ys. It is defined in the equation Y = H \hat{Y} and is computed by the formula involving only X as

 H = X (X^t X)^{-1} X^t
. If there are n cases, the dimension of the hat matrix would be n by n. In practice we are interested only in the diagonal elements. In this case, if X_i is the ith row vector of the model X matrix, then h_{ii} = X_i (X^t X)^{-1} X_i^t.

Studentized deleted residuals.

The formula is given on page 400 of Neter. This measure is useful for identifying outlying Y observations.

d_i^* = e_i [\frac{n-p-1}{RSS(1 -h_{ii}) - e_i^2}]^{1/2}

e_i ith residual
h_{ii} ith diagonal of hat matrix
n number of cases or observations
p number of columns of model matrix X
RSS regression sum of squares = \sum (\hat{y_i} - y_i)^2

dffits

The formula is given by

\frac{\hat{Y_i}-\hat{Y_{i(i)}}}       {\sqrt{MRSS_{(i)} h_{ii}}}

where the subscript (i) means that the ith case is omitted when performing the least squares fit. Fortunately, this diagnostic measure can be computed using the full data set and is given by

 d_i^* \sqrt{\frac{h_{ii}}{1-h_{ii}}}.
The authors of our reference, NWK, recommends a case is influential ofn the fitted values if the absolute value of dffits exceeds 1 for small to medium size data sets and 2 \sqrt{p/n} for large data sets.

Cook’s distance

This is an overall measure of the aggregate effect os the ith observation on all of the estimated regression coefficients. The computing formula used is given by

D_i = \frac{e_i^2}{p MRSS[\frac{h_{ii}}{1-h_{ii}}]^2}

Here is the code for diagnostics.py incorporating Python code for computing the various measures defined above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# -*- coding: utf-8 -*-
"""
File  diagnostics.py
 
Author: Dr. Ernesto Adorio
           University of the Philippines, Clarkfield
           Angeles City, Pampanga.
Version 0.0.1 december 25 release
License: noncommercial academic use only
"""
from matlib import *
from lm     import *
 
# add the following simple function to lm.py:
def hatmatrix (X):
    """
    Computes the hat matrix.
    """
    VC= matinv(matxtx(X))
    B = matmatt(VC, X)
    return matmul(X, B) 
 
def diaghatmatrix(X):
    """
    returns diagonal elements of hatmatrix as a vector.
    """
    VC  = matinv(matxtx(X))
    hii = [0] * len(X)
    for i in range(len(X)):
       row = X[i][:]
       hii[i] = matvectmatvec(row, VC)
    return hii      
 
 
def VIF(X):
    """
    Returns the vector of VIF for each independent variable.
    X is the model matrix and may include leading constant column
    of all 1s.
 
    revised: dec.25,2009 tested with Neter's example.
    """
    rows = len(X)
    cols = len(X[0])
 
    #Check if first column of X is a column of all ones.
    startcol = 0
    for x in X:
       if x[0]-1.0 < 1.0e-5:
         startcol = 1
         break
 
    # Compute ALL VIFs.
    out = []
    for j in range(startcol, cols):
       Z   = matDelCols(X, [j])
       Xj  = [x[j] for x in X]
       fit = ols(Z, Xj)
       out.append(1.0/(1.0 - fit.R2))
    return out
 
def studdelresids(X,Y, hii):
    """
    Computes the studentized deleted residuals for model matrix X.
 
    Reference formula 11.25 of Neter, Kutner and Wassermann, page 400
    n -number of data
    p - number of columns of X
    hii -leverage values (diagonal of hat matrix) 
    ei - error at ith observation.
    d_i* = e_i [(n-p-1)/{SSE(1-hii) - e_i^2}]**(1/2)
    """
    n     = len(X)
    fit   = ols(X,Y)
    dstar = fit.resids
    if hii is None:
       hii  = diaghatmatrix(X)
    SSE   = fit.RSS
    num = n -len(X[0])-1
    for i in range(n):
       dstar[i] *= sqrt(num/(SSE*(1- hii[i]) - (dstar[i])**2))
    return dstar
 
 
def dffits(X,Y, hii):
    """
    Computes the dffits vector for model matrix X.
    Reference formula 11.25 of Neter, Kutner and Wassermann, page 400
    n -number of data
    p - number of variables
    hii -leverage values (diagonal of hat matrix) 
    ei - error at ith observation.
    d_i* = e_i [(n-p-1)/{SSE(1-hii) - e_i^2}]**(1/2)
    dffits = d_i* * ( hii/ (1 -hii))
    """
    n     = len(X)
    fit   = ols(X,Y)
    dstar = fit.resids
    if hii is None:
       hii  = diaghatmatrix(X)
    SSE   = fit.RSS
    num = n -len(X[0])-1
    for i in range(n):
       dstar[i] *= sqrt(num/(SSE*(1- hii[i]) - (dstar[i])**2))
       dstar[i] *= sqrt(hii[i]/ (1 -hii[i]))
    return dstar
 
def CookDistance(X,Y, hii=None,resids=None,MRSS = None):
    """
    Computes Cook's Distance using formula 11.29a p. 403
    """
    n = len(X)
    p = len(X[0])
    fit = None
    if resids is None:
       fit = ols(X,Y)
       ei  = fit.resids
    else:
       ei = resids
 
    if hii is None:
       hii  = diaghatmatrix(X)
    if MRSS is None:
       if fit is None:
          fit = ols(X,Y)
       MRSS = fit.MRSS
 
 
    Di = [0] * n
    invden = 1.0 / (p * MRSS)
    for i in range(n):
        Di[i] = ei[i]**2 * invden * hii[i]/ (1 - hii[i])**2   
    return Di
 
 
 
def Test():
   print "reading bodyfat data file body.fat"
   # Read data file "bodyfat.dat"
   sdata = open("data/bodyfat.dat").read()
   data = matreadstring(sdata)
   print"splitting matrix:"
   X, Y = matsplitbycolumn(data, -1)
   mataugprint(X, Y)
   """
   Alternately you can use:
   X = matDelCols(data, [-1])
   Y = [d[-1] for d in data]
   print "Input matrix"
   mataugprint(X,Y)
   """
 
   Xmodel = matInsertConstCol(X, 0, 1,inplace=False)
   print "Variance Inflation factors for full model"
   print "matches with p. 410, Table  11.4  of Neter's text."
   print VIF(Xmodel)
 
 
   # Neters's first two variable example:
   #print "Hat matrix for two variable example of Neter."
   Xmodel = matSelectCols(Xmodel, [0,1,2])
   fit = ols(Xmodel, Y)
   resids = fit.resids
   MRSS = fit.MRSS
   hii =   diaghatmatrix(Xmodel)
   dstars = studdelresids(Xmodel, Y, hii)
   DFFITS = dffits(Xmodel, Y, hii)
   D = CookDistance(X,Y, resids= resids, hii = hii, MRSS = MRSS)
   print
   print "Diagnostics for Neter's two variable model example. See pp. "
   print "%2s %8s %8s %8s %8s %8s" % ("i", "ei", "dffit", "dstar", "hii", "Cook")
   for i, (e,dffit, dstar,h,d) in enumerate(zip(resids,DFFITS,dstars,hii,D)):
      print "%2d %8.4f %8.4f %8.4f %8.4f %8.4f" %(i,e,dffit, dstar,h,d)
 
 
if __name__ == "__main__":
   Test()

When the above code is run, it outputs the following:

$ python diagnostics.py
reading bodyfat data file body.fat
splitting matrix:
 19.5000  43.1000  29.1000 |  11.9000
 24.7000  49.8000  28.2000 |  22.8000
 30.7000  51.9000  37.0000 |  18.7000
 29.8000  54.3000  31.1000 |  20.1000
 19.1000  42.2000  30.9000 |  12.9000
 25.6000  53.9000  23.7000 |  21.7000
 31.4000  58.5000  27.6000 |  27.1000
 27.9000  52.1000  30.6000 |  25.4000
 22.1000  49.9000  23.2000 |  21.3000
 25.5000  53.5000  24.8000 |  19.3000
 31.1000  56.6000  30.0000 |  25.4000
 30.4000  56.7000  28.3000 |  27.2000
 18.7000  46.5000  23.0000 |  11.7000
 19.7000  44.2000  28.6000 |  17.8000
 14.6000  42.7000  21.3000 |  12.8000
 29.5000  54.4000  30.1000 |  23.9000
 27.7000  55.3000  25.7000 |  22.6000
 30.2000  58.6000  24.6000 |  25.4000
 22.7000  48.2000  27.1000 |  14.8000
 25.2000  51.0000  27.5000 |  21.1000

Variance Inflation factors for full model
matches with p. 410, Table  11.4  of Neter's text.
[708.84291410637888, 564.3433856203801, 104.60600492589333]

Diagnostics for Neter's two variable model example. See pp.
 i       ei    dffit    dstar      hii     Cook
 0  -1.6827  -0.3661  -0.7300   0.2010   0.0460
 1   3.6429   0.3838   1.5343   0.0589   0.0455
 2  -3.1760  -1.2731  -1.6543   0.3719   0.4902
 3  -3.1585  -0.4763  -1.3485   0.1109   0.0722
 4  -0.0003  -0.0001  -0.0001   0.2480   0.0000
 5  -0.3608  -0.0567  -0.1475   0.1286   0.0011
 6   0.7162   0.1279   0.2981   0.1555   0.0058
 7   4.0147   0.5745   1.7601   0.0963   0.0979
 8   2.6551   0.4022   1.1176   0.1146   0.0531
 9  -2.4748  -0.3639  -1.0337   0.1102   0.0440
10   0.3358   0.0505   0.1367   0.1203   0.0009
11   2.2255   0.3233   0.9232   0.1093   0.0352
12  -3.9469  -0.8508  -1.8259   0.1784   0.2122
13   3.4475   0.6355   1.5248   0.1480   0.1249
14   0.5706   0.1889   0.2672   0.3332   0.0126
15   0.6423   0.0838   0.2581   0.0953   0.0025
16  -0.8509  -0.1184  -0.3445   0.1056   0.0049
17  -0.7829  -0.1655  -0.3344   0.1968   0.0096
18  -2.8573  -0.3151  -1.1762   0.0670   0.0324
19   1.0404   0.0940   0.4094   0.0501   0.0031
toto@toto-laptop:~/Blogs/lm/diagnostics$

The contents of the datafile bodyfat.dat is shown here for convenience.

# Table 8-1 p.272 of NWK reference.
# triceps thigh circumference midarm circumference body fat.
#vars X1 X2 X3 Y
19.5 43.1 29.1 11.9
24.7 49.8 28.2 22.8
30.7 51.9 37.0 18.7
29.8 54.3 31.1 20.1
19.1 42.2 30.9 12.9
25.6 53.9 23.7 21.7
31.4 58.5 27.6 27.1
27.9 52.1 30.6 25.4
22.1 49.9 23.2 21.3
25.5 53.5 24.8 19.3
31.1 56.6 30.0 25.4
30.4 56.7 28.3 27.2
18.7 46.5 23.0 11.7
19.7 44.2 28.6 17.8
14.6 42.7 21.3 12.8
29.5 54.4 30.1 23.9
27.7 55.3 25.7 22.6
30.2 58.6 24.6 25.4
22.7 48.2 27.1 14.8
25.2 51.0 27.5 21.1

Reference: Neter, Wassermann and Kutner, Applied Linear Regression Models, 2ed., 1989, Richard Irwin, Inc.

The newest version of matlib.py may be viewed in the more recent blog entry.
http://adorio-research.org/wordpress/?p=4353

  • Share/Bookmark

Leave a Reply

Digital explorations is Digg proof thanks to caching by WP Super Cache