Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Additive Models in Data Analysis: Backfitting Algorithm and Dimensionality Curse, Slides of Advanced Data Analysis

Additive models for regression, which are more flexible than linear models while retaining some of their desirable properties. The backfitting algorithm for estimating additive models, the curse of dimensionality, and the use of one-dimensional smoothers. Real-life examples are provided using r packages gam and mgcv.

Typology: Slides

2010/2011

Uploaded on 11/02/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Additive Models
36-402, Advanced Data Analysis
17 February 2011
Readings: Farway, ch. 12; sections 2.6–2.9 of Berk
Contents
1 Partial Residuals and Backfitting for Linear Models 1
2 Additive Models 3
3 The Curse of Dimensionality 4
4 Example: California House Prices Revisited 7
5 Closing Modeling Advice 15
A Big Oand Little oNotation 17
1 Partial Residuals and Backfitting for Linear
Models
The general form of a linear regression model is
EhY|~
X=~xi=β0+~
β·~x =
p
X
j=0
βjxj(1)
where for j1 : p, the xjare the components of ~x, and x0is always the
constant 1. (Adding a fictitious constant “feature” like this is a standard way
of handling the intercept just like any other regression coefficient.)
Suppose we don’t condition on all of ~
Xbut just one component of it, say
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Additive Models in Data Analysis: Backfitting Algorithm and Dimensionality Curse and more Slides Advanced Data Analysis in PDF only on Docsity!

Additive Models

36-402, Advanced Data Analysis

17 February 2011

Readings: Farway, ch. 12; sections 2.6–2.9 of Berk

Contents

1 Partial Residuals and Backfitting for Linear Models 1

2 Additive Models 3

3 The Curse of Dimensionality 4

4 Example: California House Prices Revisited 7

5 Closing Modeling Advice 15

A Big O and Little o Notation 17

1 Partial Residuals and Backfitting for Linear

Models

The general form of a linear regression model is

E

[

Y | X~ = ~x

]

= β 0 + β~ · ~x =

∑^ p

j=

βj xj (1)

where for j ∈ 1 : p, the xj are the components of ~x, and x 0 is always the constant 1. (Adding a fictitious constant “feature” like this is a standard way of handling the intercept just like any other regression coefficient.) Suppose we don’t condition on all of X~ but just one component of it, say

Xk. What is the conditional expectation of Y?

E [Y |Xk = xk] = E [E [Y |X 1 , X 2 ,... Xk,... Xp] |Xk = xk] (2)

= E

∑^ p

j=

βj Xj |Xk = xk

= βkxk + E

j 6 =k

βj Xj |Xk = xk

where the first line uses the law of total expectation^1 , and the second line uses Eq. 1. Turned around,

βkxk = E [Y |Xk = xk] − E

j 6 =k

βj Xj |Xk = xk

= E

Y −

j 6 =k

βj Xj

 (^) |Xk = xk

The expression in the expectation is the kth^ partial residual — the (total) residual is the difference between Y and its expectation, the partial residual is the difference between Y and what we expect it to be ignoring the contribution from Xk. Let’s introduce a symbol for this, say Y (k).

βkxk = E

[

Y (k)|Xk = xk

]

In words, if the over-all model is linear, then the partial residuals are linear. And notice that Xk is the only input feature appearing here — if we could somehow get hold of the partial residuals, then we can find βk by doing a simple regression, rather than a multiple regression. Of course to get the partial residual we need to know all the other βj s... This suggests the following estimation scheme for linear models, known as the Gauss-Seidel algorithm, or more commonly and transparently as back- fitting; the pseudo-code is in Example 1. This is an iterative approximation algorithm. Initially, we look at how far each point is from the global mean, and do simple regressions of those deviations on the input features. This then gives us a better idea of what the regression surface really is, and we use the deviations from that surface in our next set of simple regressions. At each step, each coefficient is adjusted to fit in with what we already know about the other coefficients — that’s why it’s called “backfitting”. It is not obvious^2 that this converges, but it does, and the fixed point on which it converges is the usual least-squares estimate of β.

(^1) As you learned in baby prob., this is the fact that E [Y |X] = E [E [Y |X, Z] |X] — that we can always condition on another variable or variables (Z), provided we then average over those extra variables when we’re done. (^2) Unless, I suppose, you’re Gauss.

straightforward to interpret: if you want to know how the prediction changes as you change xj , you just need to know βj. The partial response function fj plays the same role in an additive model: of course the change in prediction from changing xj will generally depend on the level xj had before perturbation, but since that’s also true of reality that’s really a feature rather than a bug. It’s true that a set of plots for fj s takes more room than a table of βj s, but it’s also nicer to look at. There are really very, very few situations when, with modern computing power, linear models are superior to additive models. Now, one of the nice properties which additive models share with linear ones has to do with the partial residuals. Defining

Y (k)^ = Y −

α +

j 6 =k

fj (xj )

a little algebra along the lines of the last section shows that

E

[

Y (k)|Xk = xk

]

= fk(xk)

If we knew how to estimate arbitrary one-dimensional regressions, we could now use backfitting to estimate additive models. But we have spent a lot of time talking about how to use smoothers to fit one-dimensional regressions! Our new, improved backfitting algorithm in Example 2. Once again, while it’s not obvious that this converges, it does converge. Also, the backfitting procedure works well with some complications or refinements of the additive model. If we know the function form of one or another of the fj , we can fit those parametrically (rather than with the smoother) at the appropriate points in the loop. (This would be a semiparametric model.) If we think that there is an interaction between xj and xk, rather than their making separate additive contributions, we can smooth them together; etc. There are actually two packages standard packages for fitting additive models in R: gam and mgcv. Both have commands called gam, which fit generalized additive models — the generalization is to use the additive model for things like the probabilities of categorical responses, rather than the response variable itself. If that sounds obscure right now, don’t worry — we’ll come back to this after we’ve looked at generalized linear models. The last section of this hand-out illustrates using these packages to fit an additive model.

3 The Curse of Dimensionality

Before illustrating how additive models work in practice, let’s talk about why we’d want to use them. So far, we have looked at two extremes for regression models; additive models are somewhere in between. On the one hand, we had linear regression, which is a parametric method (with p + 1) parameters. Its weakness is that the true regression function r is hardly ever linear, so even with infinite data it will always make systematic

Given: n × p inputs x n × 1 responses y tolerance 1  δ > 0 one-dimensional smoother S ̂ α ← n−^1

∑n i=1 yi f̂j ← 0 for j ∈ 1 : p until (all | f̂j − gj | ≤ δ) { for k ∈ 1 : p { y( i k)= yi −

j 6 =k f̂j^ (xij^ ) gk ← S(y(k)^ ∼ x·k) gk ← gk − n−^1

∑n i=1 gk(xik) f̂k ← gk } } Return: ( α,̂ f̂ 1 ,... f̂p)

Code Example 2: Pseudo-code for backfitting additive models. Notice the extra step, as compared to backfitting linear models, which keeps each partial response function centered.

mistakes in its predictions — there’s always some approximation bias, bigger or smaller depending on how non-linear r is. The strength of linear regression is that it converges very quickly as we get more data. Generally speaking,

M SElinear = σ^2 + alinear + O(n−^1 )

where the first term is the intrinsic noise around the true regression function, the second term is the (squared) approximation bias, and the last term is the estimation variance. Notice that the rate at which the estimation variance shrinks doesn’t depend on p — factors like that are all absorbed into the big O.^4 Other parametric models generally converge at the same rate. At the other extreme, we’ve seen a number of completely non-parametric re- gression methods, such as kernel regression, local polynomials, k-nearest neigh- bors, etc. Here the limiting approximation bias is actually zero, at least for any reasonable regression function r. The problem is that they converge more slowly, because we need to use the data not just to figure out the coefficients of a parametric model, but the sheer shape of the regression function. Again gen- erally speaking, the rate of convergence for these models is (Wasserman, 2006, §5.12) M SEnonpara − σ^2 = O(n−^4 /(p+4))

There’s no approximation bias term here, just estimation variance.^5 Why does

(^4) See the appendix if you are not familiar with “big O” notation. (^5) To be careful: if we use, say, kernel regression, then at any finite n and bandwidth there is some approximation bias, but this can be made arbitrarily small, and is actually absorbed into the remaining big-O.

4 Example: California House Prices Revisited

As an example, we’ll revisit the California house price data from the homework.

calif = read.table("~/teaching/402/hw/01/cadata.dat",header=TRUE)

Fitting a linear model is very fast (about 1/5 of a second on my laptop). Here are the summary statistics:

linfit = lm(log(MedianHouseValue) ~ ., data=calif) print(summary(linfit),signif.stars=FALSE)

Call: lm(formula = log(MedianHouseValue) ~ ., data = calif)

Residuals: Min 1Q Median 3Q Max -2.517974 -0.203797 0.001589 0.194947 3.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.180e+01 3.059e-01 -38.570 < 2e- MedianIncome 1.782e-01 1.639e-03 108.753 < 2e- MedianHouseAge 3.261e-03 2.111e-04 15.446 < 2e- TotalRooms -3.186e-05 3.855e-06 -8.265 < 2e- TotalBedrooms 4.798e-04 3.375e-05 14.215 < 2e- Population -1.725e-04 5.277e-06 -32.687 < 2e- Households 2.493e-04 3.675e-05 6.783 1.21e- Latitude -2.801e-01 3.293e-03 -85.078 < 2e- Longitude -2.762e-01 3.487e-03 -79.212 < 2e-

Residual standard error: 0.34 on 20631 degrees of freedom Multiple R-squared: 0.6432,Adjusted R-squared: 0. F-statistic: 4648 on 8 and 20631 DF, p-value: < 2.2e-

Figure 1 plots the predicted prices, ±2 standard errors, against the actual prices. The predictions are not all that accurate — the RMS residual is 0.340 on the log scale (i.e., 41%), and only 3.3% of the actual prices fall within the prediction bands.^7 On the other hand, they are quite precise, with an RMS standard error of 0.0071 (i.e., 0.71%). This linear model is pretty thoroughly converged.

(^7) You might worry that the top-coding of the prices — all values over $500,000 are recorded as $500,001 — means we’re not being fair to the model. After all, we see $500,001 and the model predicts $600,000, the prediction might be right — it’s certainly right that it’s over $500,000. To deal with this, I tried top-coding the predicted values, but it didn’t change much — the RMS error for the linear model only went down to 0.332, and it was similarly inconsequential for the others. Presumably this is because only about 5% of the records are top-coded.

predictions = predict(linfit,se.fit=TRUE) plot(calif$MedianHouseValue,exp(predictions$fit),cex=0.1, xlab="Actual price",ylab="Predicted") segments(calif$MedianHouseValue,exp(predictions$fit-2predictions$se.fit), calif$MedianHouseValue,exp(predictions$fit+2predictions$se.fit), col="grey") abline(a=0,b=1,lty=2)

Figure 1: Actual median house values (horizontal axis) versus those predicted by the linear model (black dots), plus or minus two standard errors (grey bars). The dashed line shows where actual and predicted prices would be equal. Note that I’ve exponentiated the predictions so that they’re comparable to the original values.

predictions = predict(addfit,se.fit=TRUE) plot(calif$MedianHouseValue,exp(predictions$fit),cex=0.1, xlab="Actual price",ylab="Predicted") segments(calif$MedianHouseValue,exp(predictions$fit-2predictions$se.fit), calif$MedianHouseValue,exp(predictions$fit+2predictions$se.fit), col="grey") abline(a=0,b=1,lty=2)

Figure 2: Actual versus predicted prices for the additive model, as in Figure 1.

plot(addfit,scale=0,se=2,shade=TRUE,resid=TRUE,pages=1)

Figure 3: The estimated partial response functions for the additive model, with a shaded region showing ±2 standard errors, and dots for the actual partial residuals. The tick marks along the horizontal axis show the observed values of the input variables (a rug plot); note that the error bars are wider where there are fewer observations. Setting pages=0 (the default) would produce eight separate plots, with the user prompted to cycle through them. Setting scale= gives each plot its own vertical scale; the default is to force them to share the same one. Finally, note that here the vertical scale is logarithmic.

Longitude Latitude

s(Longitude,Latitude,28.82)

plot(addfit2,select=7,phi=60,pers=TRUE)

Figure 5: The result of the joint smoothing of longitude and latitude.

s(Longitude,Latitude,28.82)

Longitude

Latitude

-124 -122 -120 -118 -116 -

34

36

38

40

42

plot(addfit2,select=7,se=FALSE)

Figure 6: The result of the joint smoothing of longitude and latitude. Setting se=TRUE, the default, adds standard errors for the contour lines in multiple colors. Again, note that these are log units.

Figure 7: Maps of real or fitted prices: actual, top left; linear model, top right; first additive model, bottom right; additive model with interaction, bottom left. Categories are deciles of the actual prices; since those are the same for all four plots, it would have been nicer to make one larger legend, but that was beyond my graphical abilities.

A Big O and Little o Notation

It is often useful to talk about the rate at which some function changes as its argument grows (or shrinks), without worrying to much about the detailed form. This is what the O(·) and o(·) notation lets us do. A function f (n) is “of constant order”, or “of order 1” when there exists some non-zero constant c such that

f (n) c

as n → ∞; equivalently, since c is a constant, f (n) → c as n → ∞. It doesn’t matter how big or how small c is, just so long as there is some such constant. We then write f (n) = O(1)

and say that “the proportionality constant c gets absorbed into the big O”. For example, if f (n) = 37, then f (n) = O(1). But if g(n) = 37(1 − (^) n^2 ), then g(n) = O(1) also. The other orders are defined recursively. Saying

g(n) = O(f (n))

means g(n) f (n)

= O(1)

or g(n) f (n)

→ c

as n → ∞ — that is to say, g(n) is “of the same order” as f (n), and they “grow at the same rate”, or “shrink at the same rate”. For example, a quadratic function a 1 n^2 + a 2 n + a 3 = O(n^2 ), no matter what the coefficients are. On the other hand, b 1 n−^2 + b 2 n−^1 is O(n−^1 ). Big-O means “is of the same order as”. The corresponding little-o means “is ultimately smaller than”: f (n) = o(1) means that f (n)/c → 0 for any constant c. Recursively, g(n) = o(f (n)) means g(n)/f (n) = o(1), or g(n)/f (n) → 0. We also read g(n) = o(f (n)) as “g(n) is ultimately negligible compared to f (n)”. There are some rules for arithmetic with big-O symbols:

  • If g(n) = O(f (n)), then cg(n) = O(f (n)) for any constant c.
  • If g 1 (n) and g 2 (n) are both O(f (n)), then so is g 1 (n) + g 2 (n).
  • If g 1 (n) = O(f (n)) but g 2 (n) = o(f (n)), then g 1 (n) + g 2 (n) = O(f (n)).
  • If g(n) = O(f (n)), and f (n) = o(h(n)), then g(n) = o(h(n)).

These are not all of the rules, but they’re enough for most purposes.