Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Analysis The Bootstrap 1, Exercises - Engineering, Exercises of Advanced Data Analysis

Data Analysis The Bootstrap 1, Exercises - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis, Fair's Affairs

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework Assignment 8: Fair’s Affairs
36-402, Advanced Data Analysis, Spring 2011
SOLUTIONS
library(AER)
data(Affairs)
1. Answer:
(a) When dealing with an counting variable Ywith a known (not esti-
mated) upper limit m, we can try to model it as having a binomial
distribution, with mtrials and some success probability pwhich we
need to estimate. In binomial logistic regression, the log odds of
success is still linear in the predictors, but we use this binomial dis-
tribution to build the likelihood, instead of the Bernoulli distribution
we used for binary logistic regression. (Actually, binary is just the
special case where m= 1.) In this case, we can roughly think of this
as having started with 12 monthly, binary observations for each sub-
ject (“did they have an affair in January?”, “did they have an affair
in February?”, etc.), and collapsing it by summing the observations.
The fitted value from the logistic regression is then a prediction of
the probability of having an affair in any given month.
In R, we need to provide the glm() function with two columns of
responses bound together: the first counts successes and the second
one failures. (There are multiple examples of this in the assigned
reading from Faraway.)
log.r.num = glm(cbind(affairs,12-affairs)~., data=Affairs,family=binomial)
summary(log.r.num)
Notice that setting the right-hand side of the formula, after the ,
to be just .means “all the other variables in the data frame”.
(age) is significant with a negative coefficient, suggesting that older
people are less likely to have affairs; specifically, every year of age
lowers the predicted log odds of an affair in any given month by
0.043. (yearsmarried) is positively associated with affairs; each year
married increases the monthly log odds by 0.15. (religiousness) has
the second-largest-in-absolute-value coefficient; every step on the 5-
point scale lowers the log-odds by 0.44. (occupation) has a small
but significant positive effect on affairs. (rating) has the largest-
in-absolute-value coefficient, which is negative, i.e., people who give
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download Data Analysis The Bootstrap 1, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 8: Fair’s Affairs

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

library(AER) data(Affairs)

  1. Answer:

(a) When dealing with an counting variable Y with a known (not esti- mated) upper limit m, we can try to model it as having a binomial distribution, with m trials and some success probability p which we need to estimate. In binomial logistic regression, the log odds of success is still linear in the predictors, but we use this binomial dis- tribution to build the likelihood, instead of the Bernoulli distribution we used for binary logistic regression. (Actually, binary is just the special case where m = 1.) In this case, we can roughly think of this as having started with 12 monthly, binary observations for each sub- ject (“did they have an affair in January?”, “did they have an affair in February?”, etc.), and collapsing it by summing the observations. The fitted value from the logistic regression is then a prediction of the probability of having an affair in any given month. In R, we need to provide the glm() function with two columns of responses bound together: the first counts successes and the second one failures. (There are multiple examples of this in the assigned reading from Faraway.) log.r.num = glm(cbind(affairs,12-affairs)~., data=Affairs,family=binomial) summary(log.r.num) Notice that setting the right-hand side of the formula, after the ∼, to be just. means “all the other variables in the data frame”. (age) is significant with a negative coefficient, suggesting that older people are less likely to have affairs; specifically, every year of age lowers the predicted log odds of an affair in any given month by 0 .043. (yearsmarried) is positively associated with affairs; each year married increases the monthly log odds by 0.15. (religiousness) has the second-largest-in-absolute-value coefficient; every step on the 5- point scale lowers the log-odds by 0.44. (occupation) has a small but significant positive effect on affairs. (rating) has the largest- in-absolute-value coefficient, which is negative, i.e., people who give

high ratings to their marriage are predicted to be unlikely to have affairs. (b) Here we are just predicting the whether they had any affairs at all, and ignoring how many. We need to make sure now that the response variable is binary; here’s one way to do it: log.r = glm((affairs>0) ~ ., data = Affairs, family=binomial) summary(log.r) This logistic regression gives mostly the same conclusions as the last one, but with less statistical significance. (rating) and (religiousness) are still the two most important predictors, while (age) and (yearsmarried) are hardly significant. Additionally, (occupation) is not significant.

  1. Answer: There is some variation between models both in coefficient sign and in significance levels. For example, (childrenyes) and education had negative coefficients in the first model and positive coefficients in the second one, though none of these four coefficients was significant. Perhaps the largest single difference is in the evaluation of whether (occupation) is significant; the first model rates it highly significant while the second model finds it not significant at all. Differences between these models arise because they are being asked to predict different, though related, things. Variables which are unimportant for whether someone has an affair could be important in determining how many affairs they have, if they have any. It is common to be more confident about the importance variables which show up as significant for multiple related models (here, say, the rating of the marriage), as these variables are “robust with respect to model specification”. This seems reasonable, but it is hard to make the argument precise.
  2. Answer:

(a) For each individual, the first model estimates the probability of hav- ing an affair in a given month. Then the probability of having no affairs at all over 12 months is the probability of no affair in one month raised to the 12th power: p.none.log.r.num = (1-fitted(log.r.num))^ The second logistic regression fits directly the probability of having an affair over the whole, so the probability of having no affair is estimated as 1 minus the probability of an affair: p.none.log.r = 1-fitted(log.r) (b) plot(x=p.none.log.r.num, y=p.none.log.r, main = "Comparing estimated probabilities of no affairs", xlab = "First model; response as number of affairs", ylab = "Second model; response as affair/no-affair",

Inputs: lower and upper limits of the probability range to check

Vector of predicted probabilities

Vector of binary success/fail responses

Presumes: fitted.probs and responses have the same length

fitted.probs and responses are in the same order

fitted.probs contains probabilities for the type of event in responses

Outputs: actual frequency of positive responses

average of predicted probabilities

rough standard error for predicted probabilities

frequency.vs.probability <- function(p.lower,p.upper=p.lower+0.1, fitted.probs,responses) {

Which rows of the data have fitted probabilities in our range of interest?

indices <- (fitted.probs >= p.lower) & (fitted.probs < p.upper)

For plotting purposes, what’s our average predicted probability over this

range?

ave.prob <- mean(fitted.probs[indices])

How big a standard error should we see for binomial data with this predicted

probability?

Rough calculation, could improve by taking account of variation in

predicted probabilities, not reliable if difference between p.lower

and p.upper was substantial

se <- sqrt(ave.prob*(1-ave.prob)/sum(indices))

How often does the event actually happen?

Presumes data is either 0/1 or TRUE/FALSE valued

frequency <- mean(responses[indices]) out <- list(frequency=frequency,ave.prob=ave.prob,se=se) return(out) }

(a) Use the function.

frequency.vs.probability(p.lower=0,p.upper=0.1, fitted.probs=1-p.none.log.r.num,responses=(Affairs$affairs > 0)) $frequency [1] 0

$ave.prob [1] 0.

$se [1] 0. None of the individuals predicted to have less than 10% annual chance of an affair did so. On the other hand, one can check that there was only one such person! (b) Again, take code from lecture 16: f.vs.p <- sapply((0:9)/10,frequency.vs.probability,

fitted.probs=1-p.none.log.r.num, responses=(Affairs$affairs > 0)) f.vs.p <- data.frame(frequency=unlist(f.vs.p["frequency",]), ave.prob=unlist(f.vs.p["ave.prob",]), se=unlist(f.vs.p["se",])) This applies the function to the probability brackets (0, 0 .1], (0. 1 , 0 .2], etc., through (0. 9 , 1 .0], and then stores the results in a data frame. Now plot it: plot(f.vs.p$ave.prob,f.vs.p$frequency,xlim=c(0,1),ylim=c(0,1), xlab="Predicted probabilities",ylab="Observed frequencies", main="Calibration of first model (response as number of affairs)") rug(1-p.none.log.r.num,col="grey") abline(0,1,col="grey") segments(x0=f.vs.p$ave.prob,y0=f.vs.p$ave.prob-1.96f.vs.p$se, y1=f.vs.p$ave.prob+1.96f.vs.p$se)

0.0 0.2 0.4 0.6 0.8 1.

Calibration of first model (response as number of affairs)

Predicted probabilities

Observed frequencies

Figure 2: Calibration plot for the first model. Dots show the actual, observed frequency of affairs in each bracket of predicted probabilities; vertical lines are approximate 95% sampling intervals around the predicted probability.

(c) The only thing we have to change is the vector of fitted probabilities.

the actual frequency is much lower than the predicted probability — it is strongly over-estimating the probability of having at least one affair. The second model does rather better, and seems to be at least roughly calibrated. (Notice that the error bars on the far right in Figure 3 are very wide, because there are very few observations out there.) It is important to remember that while we want our models to be calibrated, calibration is not enough. For example, a model which predicted a 24.9% probability of anyone having an affair, no mater what, would be calibrated (since that is the actual frequency of hav- ing at least one affair), but would clearly miss a lot.

  1. Answer: Two variables absolutely must be treated as categorical, sex and childrenyes. However, coding them as 0 or 1 and treating them as numerical values has the same effect as treating them as categorical variables and introducing dummy/indicator variables. Three variables, religiousness, occupation, and the rating of marriage, while coded with numbers, cannot be regarded as measured on a linear scale. That is, we have no reason to think that the difference between a religiousness of 5 (“very religious”) and of 4 (“somewhat religious”) is as big as the difference between a 2 (“not at all religious”) and a 1 (“anti- religious”). These are ordinal variables, because they do fall in a definite order, but treating them the same as age or the number of years married is dubious. Using dummies for these categories seems more appropriate. Close examination of the table shows that while age, number of years married, years of education, and indeed even the number of affairs are all, in principal, numerical variables, they have each been coded into a few values. So when we see “7” for the number of years someone has been married, that could mean anything from 6 to 8 years, while “15” for that variable could mean 12, 13, or 40 or more. Treating these as ordinary numbers which we can do arithmetic on is somewhat dubious, and it might be better to treat them, too, as ordinal variables. So, in the end, we can make a case that all of the variables in this data set should be treated as categorical.
  2. Answer:

(a) The data is probably better than nothing, but it consists of (i) self- reports about (ii) a sensitive topic which people often lie about, collected from (iii) a self-selected subset (survey-answerers) of (iv) a non-random convenience sample (readers of Psychology Today). Moreover, it was collected over forty years ago, in a very different social context. In particular, in the late 1960s and early 1970s it be- came vastly easier to get a legal divorce, which presumably changed the incentives to stay married to one person but have sex with some- one else.

(b) If it comes down to a choice between these two models, the second (binary response) one is at least pretty well calibrated, while the first is not. The fact that the variables can all be seen as categorical (some, more precisely, as ordinal) suggests using conditional density estimation with appropriate kernels (as in Lecture 6). Doing so^1 produces a model more accurate than either logistic regression. However, the only variables it says are relevant are religiousness and the rating of the marriage.

(^1) See Qi Li and Jeff Racine, “Predictor Relevance and Extramarital Affairs”, Journal of Applied Econometrics 19 (2004): 533–535; on the class website. You can reproduce their analysis using npcdens in the np package.