













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Data Analysis Introduction to class 1, Exercises - Engineering, Advanced Data Analysis
Typology: Exercises
1 / 21
This page cannot be seen from the preview
Don't miss anything!
The easiest way to load the data is with read.table, but you have to tell R that the first line names the variables:
calif = read.table("~/teaching/350/hw/06/cadata.dat",header=TRUE) dim(calif) [1] 20640 9 is.data.frame(calif) [1] TRUE summary(calif) MedianHouseValue MedianIncome MedianHouseAge TotalRooms Min. : 14999 Min. : 0.4999 Min. : 1.00 Min. : 2 1st Qu.:119600 1st Qu.: 2.5634 1st Qu.:18.00 1st Qu.: 1448 Median :179700 Median : 3.5348 Median :29.00 Median : 2127 Mean :206856 Mean : 3.8707 Mean :28.64 Mean : 2636 3rd Qu.:264725 3rd Qu.: 4.7432 3rd Qu.:37.00 3rd Qu.: 3148 Max. :500001 Max. :15.0001 Max. :52.00 Max. : TotalBedrooms Population Households Latitude Min. : 1.0 Min. : 3 Min. : 1.0 Min. :32. 1st Qu.: 295.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.:33. Median : 435.0 Median : 1166 Median : 409.0 Median :34. Mean : 537.9 Mean : 1425 Mean : 499.5 Mean :35. 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.:37. Max. :6445.0 Max. :35682 Max. :6082.0 Max. :41. Longitude Min. :-124. 1st Qu.:-121. Median :-118. Mean :-119. 3rd Qu.:-118. Max. :-114.
So far nothing looks crazy.^1
(^1) But, what’s up with the smallest census tracts being so small? We’d better hope that the
fit = lm(MedianHouseValue ~ ., data=calif)
Handily, there is a function to extract the coefficients:
signif(coefficients(fit),3) (Intercept) MedianIncome MedianHouseAge TotalRooms -3.59e+06 4.02e+04 1.16e+03 -8.18e+ TotalBedrooms Population Households Latitude 1.13e+02 -3.85e+01 4.83e+01 -4.26e+ Longitude -4.28e+
where I’ve rounded to three significant digits.^2 The coefficients mean that house prices increase with income; that older housing is more expensive; that tracts with more rooms have lower prices, but ones with more bed- rooms have higher prices; that tracts with more people have lower prices, but ones with more families have higher prices. Finally, prices increase from north to south (towards the big cities), and from east to west (to- wards the coast). To get R^2 I use the summary function:
signif(summary(fit)$r.squared,digits=3) [1] 0.
Finally, the mean squared error:
signif(mean((residuals(fit))^2),digits=3) [1] 4.83e+
This is somewhat more comprehensible when we take the square root, so it’s on the same scale as the response variable (dollars, rather than dollars squared):
signif(sqrt(mean((residuals(fit))^2)),digits=3) [1] 69500
one household, the population 3, the 1 bedroom and 2 rooms all come from the same record. But in fact they don’t, and are almost certainly data-entry errors. For example, the row with TotalBedrooms==1 is 16172, which records a population of 13 in one household — with a house value of $500,001. This isn’t mathematically impossible, but it’s remarkably implausible. (^2) Why three? Because one expects the relative error in parametric estimates to be ∝ 1 /√n, and with n = 2 × 104 , this gives a relative error of about 1 in 100.
plot(calif$Longitude,calif$Latitude,pch=21, col=terrain.colors(11)[1+floor(calif$MedianHouseValue/50e3)], bg=terrain.colors(11)[1+floor(calif$MedianHouseValue/50e3)], cex=sqrt(calif$Population/median(calif$Population)), xlab="Longitude",ylab="Latitude",main="Median House Prices", sub="Circle area proportional to population") legend(x="topright",legend=(50*(1:11)),fill=terrain.colors(11))
Figure 2: Map with price indicated by color and population by the area of the circle. (To make area proportional to population, radius must be proportional to √ (^) population. Making radius proportional to population drastically exaggerates
the differences; see Huff’s How to Lie with Statistics.) This shows that the census tracts are not precisely equal in population, though the range doesn’t seem unreasonable.
plot(calif$Longitude,calif$Latitude,pch=21, col=grey(0:10/11)[1+floor(calif$MedianHouseValue/50e3)], bg=grey(0:10/11)[1+floor(calif$MedianHouseValue/50e3)], xlab="Longitude",ylab="Latitude",main="Median House Prices") legend(x="topright",legend=(50*(1:11)),fill=grey(0:10/11))
Figure 3: Greyscale. See help(grey) for the behavior of this color-making command, which differs from that of the terrain.colors family.
scatterplot3d(calif$Longitude,calif$Latitude,calif$MedianHouseValue, pch=20,scale.y=0.4,type="h",cex.symbol=0.01, xlab="Longitude",ylab="Latitude",zlab="Median House Value")
Figure 5: As in the previous plot, but with vertical lines drawn from the x − y plane to the appropriate z value.
Many people used instead the estimate of the random variance provided by the linear regression:
signif((summary(fit)$sigma)^2,digits=3) signif(summary(fit)$sigma,digits=3)
Strictly speaking, what summary(fit)$sigma gives is not the in-sample mean-squared error, but an estimate of what the generalization MSE should be,
M SÊ generalization = n n − (p + 1)
M SEin−sample
where p is the number of input variables to the regression. The logic here is the same as that for scaling up the sample variance by n/(n − 1) when estimaing the population variance: by estimating p + 1 parameters, we’ve used up p + 1 degrees of freedom, and really have only n − (p + 1) independent measurements of the residuals. With p = 8 and n = 20640, the difference is a factor of 1.000436, i.e, much too small to bother with.
input.sds = apply(calif[,-1],2,sd) signif(input.sds,3) MedianIncome MedianHouseAge TotalRooms TotalBedrooms 1.90 12.60 2180.00 421. Population Households Latitude Longitude 1130.00 382.00 2.14 2. prediction.changes = input.sds * abs(coefficients(fit)[-1]) signif(prediction.changes,3) MedianIncome MedianHouseAge TotalRooms TotalBedrooms 76500 14600 17800 47800 Population Households Latitude Longitude 43600 18500 90900 85800
(I exclude the first column of calif from calculating the standard devia- tions because that’s the response variable, and exclude the first coefficient of fit because that’s the intercept.) By this standard, the two most im- portant variables are latitude and longitude, followed by median income.
qqnorm(residuals(fit)) qqline(residuals(fit))
Figure 6: Q − Q plot for the residuals against a Gaussian distribution. If the residuals were Gaussian, they should fall approximately on the straight line.
plot(density(residuals(fit))) curve(dnorm(x,mean(residuals(fit)),sd(residuals(fit))),add=TRUE,lty=2)
Figure 7: Estimated probability density function of the residuals (solid line), overlaid with the pdf of a Gaussian with the same mean and variance (dashed line).
plot.calif(resid,cex=0.2,main="Residuals")
Figure 8: Map of the residuals. Reducing the point size keeps them from overlap- ping and clarifies the patterns. Notice that the darker colors (highest residuals) cluster around the coast, especially around the big cities.
plot.calif(resid,subset=(resid>0),cex=0.4,main="Positive-only residuals")
Figure 9: Positive residuals only. Fewer points so I can make each one larger.
plot.calif(abs(resid),cex=0.2,main="Residual magnitude")
Figure 11: Absolute value of residuals.
calif.plot.residuals <- function(i,f=1.06(nrow(calif)^(-0.2))) { x <- calif[,i+1] plot(x,resid,xlab=colnames(calif)[i+1],ylab="Residuals") lines(ksmooth(x,resid,bandwidth=sd(x)f)),col="red",lty=2) }
This is less than ideal programming practice, because it presumes the objects calif and resid exist and have the right properties, and will failure obscurely if they don’t, but it’s better than typing everything out longhand. The f argument gives the bandwidth in terms of the standard deviation. The default is a “reference rule” which works well when every- thing is Gaussian, but playing around with it shows that it doesn’t matter too much. There is a clear downward trend for the residuals against income; also a narrowing of the range of residuals. In fact, for all of the variables except house age, latitude and longitude, the range and variance of the residuals shrinks as we go to larger values of the feature. This means that the residuals are not independent of the features, as they should be if the linear model were correct.
cbind(signif((summary(fit)$coef)[-1,2],3)) [,1] MedianIncome 335. MedianHouseAge 43. TotalRooms 0. TotalBedrooms 6. Population 1. Households 7. Latitude 673. Longitude 713.
The number of degres of freedom is 20631 = 20640 − (8 + 1) = n − (p + 1). So, if the standard formulas hold, we can get confidence intervals from the t distribution. These are
signif( cbind( (summary(fit)$coef)[-1,1] - qt(.975,20631) * (summary(fit)$coef)[-1,2], (summary(fit)$coef)[-1,1] + qt(.975,20631) * (summary(fit)$coef)[-1,2]) ,3)
[,1] [,2] MedianIncome 39600.00 40900. MedianHouseAge 1070.00 1240. TotalRooms -9.73 -6. TotalBedrooms 99.90 127.
Population -40.60 -36. Households 33.60 63. Latitude -43900.00 -41300. Longitude -44200.00 -41400.
But the residuals are not Gaussian, and they are not independent of the input variables, so the distributional assumptions for the confidence inter- val formulas aren’t met. We can’t say whether this makes them too big, too small, skewed to one side, etc.
signif(coefficients(logfit),3) (Intercept) MedianIncome MedianHouseAge TotalRooms -1.18e+01 1.78e-01 3.26e-03 -3.19e- TotalBedrooms Population Households Latitude 4.80e-04 -1.72e-04 2.49e-04 -2.80e- Longitude -2.76e-
The MSE (and RMS error as well):
signif(summary(logfit)$sigma^2,3) [1] 0. signif(summary(logfit)$sigma,3) [1] 0.
Notice that this is on a log scale, so the claimed accuracy is to within a factor of exp(0.34) = 1.4, i.e., ±40%. Finally, R^2 :
signif(summary(logfit)$r.square,3) [1] 0.
One good reason to prefer the log-scale regression is that it always pre- dicts a positive price, while the untransformed regression predicts negative median prices for 99 census tracts. Even in the least appealing parts of California, even today, people will not actually pay you to take houses off their hands, so that suggests something is seriously wrong with the untransformed model. (See Figure 13.) A bad reason to prefer one regression over the other is to compare their R^2 or their MSEs; these are not comparable, because the response variables are different.
As you may have noticed, there are some weird features to the data. Median house prices never exceed $500,001, and there are 965 tracts with that house price. This is not because of an astonishing numerical coincidence, but because
plot(calif[,1],fitted(fit),xlab="Real price",ylab="Predicted price") abline(h=0,lty=2); abline(a=0,b=1,lty=2) plot(calif[,1],exp(fitted(logfit)),xlab="Real price",ylab="Predicted price") abline(a=0,b=1,lty=2)
Figure 13: Upper panel: actual median prices (horizontal axis) vs. those pre- dicted by the untransformed regression (vertical axis). The dashed horizontal line is a zero price, the dashed diagonal the x = y line. Lower panel: the same for the transformed regression; note that all predicted prices are positive.