Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Analysis Introduction to class 1, Exercises - Engineering, Exercises of Advanced Data Analysis

Carnegie Mellon University (CMU)Advanced Data Analysis

Data Analysis Introduction to class 1, Exercises - Engineering, Advanced Data Analysis

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge 🇺🇸

4.9

(13)

287 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

Homework Assignment 1: What’s That Got to

Do with the Price of Condos in California?

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

The easiest way to load the data is with read.table, but you have to tell

R that the first line names the variables:

> calif = read.table("~/teaching/350/hw/06/cadata.dat",header=TRUE)

> dim(calif)

[1] 20640 9

> is.data.frame(calif)

[1] TRUE

> summary(calif)

MedianHouseValue MedianIncome MedianHouseAge TotalRooms

Min. : 14999 Min. : 0.4999 Min. : 1.00 Min. : 2

1st Qu.:119600 1st Qu.: 2.5634 1st Qu.:18.00 1st Qu.: 1448

Median :179700 Median : 3.5348 Median :29.00 Median : 2127

Mean :206856 Mean : 3.8707 Mean :28.64 Mean : 2636

3rd Qu.:264725 3rd Qu.: 4.7432 3rd Qu.:37.00 3rd Qu.: 3148

Max. :500001 Max. :15.0001 Max. :52.00 Max. :39320

TotalBedrooms Population Households Latitude

Min. : 1.0 Min. : 3 Min. : 1.0 Min. :32.54

1st Qu.: 295.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.:33.93

Median : 435.0 Median : 1166 Median : 409.0 Median :34.26

Mean : 537.9 Mean : 1425 Mean : 499.5 Mean :35.63

3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.:37.71

Max. :6445.0 Max. :35682 Max. :6082.0 Max. :41.95

Longitude

Min. :-124.3

1st Qu.:-121.8

Median :-118.5

Mean :-119.6

3rd Qu.:-118.0

Max. :-114.3

So far nothing looks crazy.1

1But, what’s up with the smallest census tracts being so small? We’d better hope that the

1

Partial preview of the text

Download Data Analysis Introduction to class 1, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 1: What’s That Got to

Do with the Price of Condos in California?

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

The easiest way to load the data is with read.table, but you have to tell R that the first line names the variables:

calif = read.table("~/teaching/350/hw/06/cadata.dat",header=TRUE) dim(calif) [1] 20640 9 is.data.frame(calif) [1] TRUE summary(calif) MedianHouseValue MedianIncome MedianHouseAge TotalRooms Min. : 14999 Min. : 0.4999 Min. : 1.00 Min. : 2 1st Qu.:119600 1st Qu.: 2.5634 1st Qu.:18.00 1st Qu.: 1448 Median :179700 Median : 3.5348 Median :29.00 Median : 2127 Mean :206856 Mean : 3.8707 Mean :28.64 Mean : 2636 3rd Qu.:264725 3rd Qu.: 4.7432 3rd Qu.:37.00 3rd Qu.: 3148 Max. :500001 Max. :15.0001 Max. :52.00 Max. : TotalBedrooms Population Households Latitude Min. : 1.0 Min. : 3 Min. : 1.0 Min. :32. 1st Qu.: 295.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.:33. Median : 435.0 Median : 1166 Median : 409.0 Median :34. Mean : 537.9 Mean : 1425 Mean : 499.5 Mean :35. 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.:37. Max. :6445.0 Max. :35682 Max. :6082.0 Max. :41. Longitude Min. :-124. 1st Qu.:-121. Median :-118. Mean :-119. 3rd Qu.:-118. Max. :-114.

So far nothing looks crazy.^1

(^1) But, what’s up with the smallest census tracts being so small? We’d better hope that the

Answer: See figures. I am not very satisfied with either of the 3D plots — it’s too hard to see what’s going on, except for the fact that the coastal areas are expensive, especially around San Francisco and LA — but maybe you can do better.
Answer:

fit = lm(MedianHouseValue ~ ., data=calif)

Handily, there is a function to extract the coefficients:

signif(coefficients(fit),3) (Intercept) MedianIncome MedianHouseAge TotalRooms -3.59e+06 4.02e+04 1.16e+03 -8.18e+ TotalBedrooms Population Households Latitude 1.13e+02 -3.85e+01 4.83e+01 -4.26e+ Longitude -4.28e+

where I’ve rounded to three significant digits.^2 The coefficients mean that house prices increase with income; that older housing is more expensive; that tracts with more rooms have lower prices, but ones with more bed- rooms have higher prices; that tracts with more people have lower prices, but ones with more families have higher prices. Finally, prices increase from north to south (towards the big cities), and from east to west (to- wards the coast). To get R^2 I use the summary function:

signif(summary(fit)$r.squared,digits=3) [1] 0.

Finally, the mean squared error:

signif(mean((residuals(fit))^2),digits=3) [1] 4.83e+

This is somewhat more comprehensible when we take the square root, so it’s on the same scale as the response variable (dollars, rather than dollars squared):

signif(sqrt(mean((residuals(fit))^2)),digits=3) [1] 69500

one household, the population 3, the 1 bedroom and 2 rooms all come from the same record. But in fact they don’t, and are almost certainly data-entry errors. For example, the row with TotalBedrooms==1 is 16172, which records a population of 13 in one household — with a house value of $500,001. This isn’t mathematically impossible, but it’s remarkably implausible. (^2) Why three? Because one expects the relative error in parametric estimates to be ∝ 1 /√n, and with n = 2 × 104 , this gives a relative error of about 1 in 100.

plot(calif$Longitude,calif$Latitude,pch=21, col=terrain.colors(11)[1+floor(calif$MedianHouseValue/50e3)], bg=terrain.colors(11)[1+floor(calif$MedianHouseValue/50e3)], cex=sqrt(calif$Population/median(calif$Population)), xlab="Longitude",ylab="Latitude",main="Median House Prices", sub="Circle area proportional to population") legend(x="topright",legend=(50*(1:11)),fill=terrain.colors(11))

Figure 2: Map with price indicated by color and population by the area of the circle. (To make area proportional to population, radius must be proportional to √ (^) population. Making radius proportional to population drastically exaggerates

the differences; see Huff’s How to Lie with Statistics.) This shows that the census tracts are not precisely equal in population, though the range doesn’t seem unreasonable.

plot(calif$Longitude,calif$Latitude,pch=21, col=grey(0:10/11)[1+floor(calif$MedianHouseValue/50e3)], bg=grey(0:10/11)[1+floor(calif$MedianHouseValue/50e3)], xlab="Longitude",ylab="Latitude",main="Median House Prices") legend(x="topright",legend=(50*(1:11)),fill=grey(0:10/11))

Figure 3: Greyscale. See help(grey) for the behavior of this color-making command, which differs from that of the terrain.colors family.

scatterplot3d(calif$Longitude,calif$Latitude,calif$MedianHouseValue, pch=20,scale.y=0.4,type="h",cex.symbol=0.01, xlab="Longitude",ylab="Latitude",zlab="Median House Value")

Figure 5: As in the previous plot, but with vertical lines drawn from the x − y plane to the appropriate z value.

Many people used instead the estimate of the random variance provided by the linear regression:

signif((summary(fit)$sigma)^2,digits=3) signif(summary(fit)$sigma,digits=3)

Strictly speaking, what summary(fit)$sigma gives is not the in-sample mean-squared error, but an estimate of what the generalization MSE should be,

M SÊ generalization = n n − (p + 1)

M SEin−sample

where p is the number of input variables to the regression. The logic here is the same as that for scaling up the sample variance by n/(n − 1) when estimaing the population variance: by estimating p + 1 parameters, we’ve used up p + 1 degrees of freedom, and really have only n − (p + 1) independent measurements of the residuals. With p = 8 and n = 20640, the difference is a factor of 1.000436, i.e, much too small to bother with.

Answer: To answer this, it’s not sufficient to just look at the size of the coefficients, because they’re being multiplied by features which are on very different scales! A reasonable thing to do to compare variables’ importance in a linear model is to multiply the coefficients by the standard deviations — the standard deviation is in some sense a typical or expected-sized change in the variable, so this compares the change in the prediction for typical changes in the inputs. (Of course there’s nothing magic about the standard deviation here, as opposed to say the inter-quartile range.) Let’s do that here:

input.sds = apply(calif[,-1],2,sd) signif(input.sds,3) MedianIncome MedianHouseAge TotalRooms TotalBedrooms 1.90 12.60 2180.00 421. Population Households Latitude Longitude 1130.00 382.00 2.14 2. prediction.changes = input.sds * abs(coefficients(fit)[-1]) signif(prediction.changes,3) MedianIncome MedianHouseAge TotalRooms TotalBedrooms 76500 14600 17800 47800 Population Households Latitude Longitude 43600 18500 90900 85800

(I exclude the first column of calif from calculating the standard devia- tions because that’s the response variable, and exclude the first coefficient of fit because that’s the intercept.) By this standard, the two most im- portant variables are latitude and longitude, followed by median income.

qqnorm(residuals(fit)) qqline(residuals(fit))

Figure 6: Q − Q plot for the residuals against a Gaussian distribution. If the residuals were Gaussian, they should fall approximately on the straight line.

plot(density(residuals(fit))) curve(dnorm(x,mean(residuals(fit)),sd(residuals(fit))),add=TRUE,lty=2)

Figure 7: Estimated probability density function of the residuals (solid line), overlaid with the pdf of a Gaussian with the same mean and variance (dashed line).

plot.calif(resid,cex=0.2,main="Residuals")

Figure 8: Map of the residuals. Reducing the point size keeps them from overlap- ping and clarifies the patterns. Notice that the darker colors (highest residuals) cluster around the coast, especially around the big cities.

plot.calif(resid,subset=(resid>0),cex=0.4,main="Positive-only residuals")

Figure 9: Positive residuals only. Fewer points so I can make each one larger.

plot.calif(abs(resid),cex=0.2,main="Residual magnitude")

Figure 11: Absolute value of residuals.

calif.plot.residuals <- function(i,f=1.06(nrow(calif)^(-0.2))) { x <- calif[,i+1] plot(x,resid,xlab=colnames(calif)[i+1],ylab="Residuals") lines(ksmooth(x,resid,bandwidth=sd(x)f)),col="red",lty=2) }

This is less than ideal programming practice, because it presumes the objects calif and resid exist and have the right properties, and will failure obscurely if they don’t, but it’s better than typing everything out longhand. The f argument gives the bandwidth in terms of the standard deviation. The default is a “reference rule” which works well when every- thing is Gaussian, but playing around with it shows that it doesn’t matter too much. There is a clear downward trend for the residuals against income; also a narrowing of the range of residuals. In fact, for all of the variables except house age, latitude and longitude, the range and variance of the residuals shrinks as we go to larger values of the feature. This means that the residuals are not independent of the features, as they should be if the linear model were correct.

The standard errors are

cbind(signif((summary(fit)$coef)[-1,2],3)) [,1] MedianIncome 335. MedianHouseAge 43. TotalRooms 0. TotalBedrooms 6. Population 1. Households 7. Latitude 673. Longitude 713.

The number of degres of freedom is 20631 = 20640 − (8 + 1) = n − (p + 1). So, if the standard formulas hold, we can get confidence intervals from the t distribution. These are

signif( cbind( (summary(fit)$coef)[-1,1] - qt(.975,20631) * (summary(fit)$coef)[-1,2], (summary(fit)$coef)[-1,1] + qt(.975,20631) * (summary(fit)$coef)[-1,2]) ,3)

[,1] [,2] MedianIncome 39600.00 40900. MedianHouseAge 1070.00 1240. TotalRooms -9.73 -6. TotalBedrooms 99.90 127.

Population -40.60 -36. Households 33.60 63. Latitude -43900.00 -41300. Longitude -44200.00 -41400.

But the residuals are not Gaussian, and they are not independent of the input variables, so the distributional assumptions for the confidence inter- val formulas aren’t met. We can’t say whether this makes them too big, too small, skewed to one side, etc.

Answer: The coefficients:

signif(coefficients(logfit),3) (Intercept) MedianIncome MedianHouseAge TotalRooms -1.18e+01 1.78e-01 3.26e-03 -3.19e- TotalBedrooms Population Households Latitude 4.80e-04 -1.72e-04 2.49e-04 -2.80e- Longitude -2.76e-

The MSE (and RMS error as well):

signif(summary(logfit)$sigma^2,3) [1] 0. signif(summary(logfit)$sigma,3) [1] 0.

Notice that this is on a log scale, so the claimed accuracy is to within a factor of exp(0.34) = 1.4, i.e., ±40%. Finally, R^2 :

signif(summary(logfit)$r.square,3) [1] 0.

One good reason to prefer the log-scale regression is that it always pre- dicts a positive price, while the untransformed regression predicts negative median prices for 99 census tracts. Even in the least appealing parts of California, even today, people will not actually pay you to take houses off their hands, so that suggests something is seriously wrong with the untransformed model. (See Figure 13.) A bad reason to prefer one regression over the other is to compare their R^2 or their MSEs; these are not comparable, because the response variables are different.

As you may have noticed, there are some weird features to the data. Median house prices never exceed $500,001, and there are 965 tracts with that house price. This is not because of an astonishing numerical coincidence, but because

plot(calif[,1],fitted(fit),xlab="Real price",ylab="Predicted price") abline(h=0,lty=2); abline(a=0,b=1,lty=2) plot(calif[,1],exp(fitted(logfit)),xlab="Real price",ylab="Predicted price") abline(a=0,b=1,lty=2)

Figure 13: Upper panel: actual median prices (horizontal axis) vs. those pre- dicted by the untransformed regression (vertical axis). The dashed horizontal line is a zero price, the dashed diagonal the x = y line. Lower panel: the same for the transformed regression; note that all predicted prices are positive.

Data Analysis Introduction to class 1, Exercises - Engineering, Exercises of Advanced Data Analysis

Related documents

Partial preview of the text

Download Data Analysis Introduction to class 1, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 1: What’s That Got to

Do with the Price of Condos in California?

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS