




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Data Analysis The Bootstrap 1, Exercises - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis, Fair's Affairs
Typology: Exercises
1 / 8
This page cannot be seen from the preview
Don't miss anything!
library(AER) data(Affairs)
(a) When dealing with an counting variable Y with a known (not esti- mated) upper limit m, we can try to model it as having a binomial distribution, with m trials and some success probability p which we need to estimate. In binomial logistic regression, the log odds of success is still linear in the predictors, but we use this binomial dis- tribution to build the likelihood, instead of the Bernoulli distribution we used for binary logistic regression. (Actually, binary is just the special case where m = 1.) In this case, we can roughly think of this as having started with 12 monthly, binary observations for each sub- ject (“did they have an affair in January?”, “did they have an affair in February?”, etc.), and collapsing it by summing the observations. The fitted value from the logistic regression is then a prediction of the probability of having an affair in any given month. In R, we need to provide the glm() function with two columns of responses bound together: the first counts successes and the second one failures. (There are multiple examples of this in the assigned reading from Faraway.) log.r.num = glm(cbind(affairs,12-affairs)~., data=Affairs,family=binomial) summary(log.r.num) Notice that setting the right-hand side of the formula, after the ∼, to be just. means “all the other variables in the data frame”. (age) is significant with a negative coefficient, suggesting that older people are less likely to have affairs; specifically, every year of age lowers the predicted log odds of an affair in any given month by 0 .043. (yearsmarried) is positively associated with affairs; each year married increases the monthly log odds by 0.15. (religiousness) has the second-largest-in-absolute-value coefficient; every step on the 5- point scale lowers the log-odds by 0.44. (occupation) has a small but significant positive effect on affairs. (rating) has the largest- in-absolute-value coefficient, which is negative, i.e., people who give
high ratings to their marriage are predicted to be unlikely to have affairs. (b) Here we are just predicting the whether they had any affairs at all, and ignoring how many. We need to make sure now that the response variable is binary; here’s one way to do it: log.r = glm((affairs>0) ~ ., data = Affairs, family=binomial) summary(log.r) This logistic regression gives mostly the same conclusions as the last one, but with less statistical significance. (rating) and (religiousness) are still the two most important predictors, while (age) and (yearsmarried) are hardly significant. Additionally, (occupation) is not significant.
(a) For each individual, the first model estimates the probability of hav- ing an affair in a given month. Then the probability of having no affairs at all over 12 months is the probability of no affair in one month raised to the 12th power: p.none.log.r.num = (1-fitted(log.r.num))^ The second logistic regression fits directly the probability of having an affair over the whole, so the probability of having no affair is estimated as 1 minus the probability of an affair: p.none.log.r = 1-fitted(log.r) (b) plot(x=p.none.log.r.num, y=p.none.log.r, main = "Comparing estimated probabilities of no affairs", xlab = "First model; response as number of affairs", ylab = "Second model; response as affair/no-affair",
frequency.vs.probability <- function(p.lower,p.upper=p.lower+0.1, fitted.probs,responses) {
indices <- (fitted.probs >= p.lower) & (fitted.probs < p.upper)
ave.prob <- mean(fitted.probs[indices])
se <- sqrt(ave.prob*(1-ave.prob)/sum(indices))
frequency <- mean(responses[indices]) out <- list(frequency=frequency,ave.prob=ave.prob,se=se) return(out) }
(a) Use the function.
frequency.vs.probability(p.lower=0,p.upper=0.1, fitted.probs=1-p.none.log.r.num,responses=(Affairs$affairs > 0)) $frequency [1] 0
$ave.prob [1] 0.
$se [1] 0. None of the individuals predicted to have less than 10% annual chance of an affair did so. On the other hand, one can check that there was only one such person! (b) Again, take code from lecture 16: f.vs.p <- sapply((0:9)/10,frequency.vs.probability,
fitted.probs=1-p.none.log.r.num, responses=(Affairs$affairs > 0)) f.vs.p <- data.frame(frequency=unlist(f.vs.p["frequency",]), ave.prob=unlist(f.vs.p["ave.prob",]), se=unlist(f.vs.p["se",])) This applies the function to the probability brackets (0, 0 .1], (0. 1 , 0 .2], etc., through (0. 9 , 1 .0], and then stores the results in a data frame. Now plot it: plot(f.vs.p$ave.prob,f.vs.p$frequency,xlim=c(0,1),ylim=c(0,1), xlab="Predicted probabilities",ylab="Observed frequencies", main="Calibration of first model (response as number of affairs)") rug(1-p.none.log.r.num,col="grey") abline(0,1,col="grey") segments(x0=f.vs.p$ave.prob,y0=f.vs.p$ave.prob-1.96f.vs.p$se, y1=f.vs.p$ave.prob+1.96f.vs.p$se)
0.0 0.2 0.4 0.6 0.8 1.
Calibration of first model (response as number of affairs)
Predicted probabilities
Observed frequencies
Figure 2: Calibration plot for the first model. Dots show the actual, observed frequency of affairs in each bracket of predicted probabilities; vertical lines are approximate 95% sampling intervals around the predicted probability.
(c) The only thing we have to change is the vector of fitted probabilities.
the actual frequency is much lower than the predicted probability — it is strongly over-estimating the probability of having at least one affair. The second model does rather better, and seems to be at least roughly calibrated. (Notice that the error bars on the far right in Figure 3 are very wide, because there are very few observations out there.) It is important to remember that while we want our models to be calibrated, calibration is not enough. For example, a model which predicted a 24.9% probability of anyone having an affair, no mater what, would be calibrated (since that is the actual frequency of hav- ing at least one affair), but would clearly miss a lot.
(a) The data is probably better than nothing, but it consists of (i) self- reports about (ii) a sensitive topic which people often lie about, collected from (iii) a self-selected subset (survey-answerers) of (iv) a non-random convenience sample (readers of Psychology Today). Moreover, it was collected over forty years ago, in a very different social context. In particular, in the late 1960s and early 1970s it be- came vastly easier to get a legal divorce, which presumably changed the incentives to stay married to one person but have sex with some- one else.
(b) If it comes down to a choice between these two models, the second (binary response) one is at least pretty well calibrated, while the first is not. The fact that the variables can all be seen as categorical (some, more precisely, as ordinal) suggests using conditional density estimation with appropriate kernels (as in Lecture 6). Doing so^1 produces a model more accurate than either logistic regression. However, the only variables it says are relevant are religiousness and the rating of the marriage.
(^1) See Qi Li and Jeff Racine, “Predictor Relevance and Extramarital Affairs”, Journal of Applied Econometrics 19 (2004): 533–535; on the class website. You can reproduce their analysis using npcdens in the np package.