Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Estimating Variable Dependence in DAGs: Cancer & Risk Factors Case Study, Exercises of Advanced Data Analysis

Solutions to homework assignment 10 in the advanced data analysis course, focusing on estimating the dependence and independence of variables in directed acyclic graphs (dags). The case study revolves around the relationship between cancer, risk factors such as smoking, asbestos, and dental care, and how to estimate the conditional risk of cancer given these factors using logistic regression models.

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework Assignment 10: Estimating with
DAGs
36-402, Advanced Data Analysis, Spring 2011
Solutions
1. (a) Answer:
Variable Parents
cancer cellular damage
cellular damage tar, asbestos
tar smoking
teeth smoking, dental care
dental care occupation
smoking occupation
asbestos occupation
occupation None
(b) Answer:
Variable Parents
cancer None
cellular cancer
tar cellular
teeth None
dental teeth
smoking tar, teeth
asbestos cellular
occupation asbestos, smoking, dental
2. Answer: In any graphical model, the joint distribution “factors according
to the graph”:
p(X1, X2,...Xp) =
p
Y
i=1
p(Xi|Xparents(i))
where if Xihas no parents, we read p(Xi|Xparents(i)) as just the marginal
distribution p(Xi). Here, there is only one variable with no parents, “occu-
pation”, so we start with its marginal distribution: p(occupation). Then
we need the conditional distributions of its children, p(dental|occupation),
p(smoking|occupation), and p(asbestos|occupation). Next we move on to
the children of these variables: p(tar|smoking) and p(teeth|smoking,dental).
1
pf3
pf4
pf5

Partial preview of the text

Download Estimating Variable Dependence in DAGs: Cancer & Risk Factors Case Study and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 10: Estimating with

DAGs

36-402, Advanced Data Analysis, Spring 2011

Solutions

  1. (a) Answer: Variable Parents cancer cellular damage cellular damage tar, asbestos tar smoking teeth smoking, dental care dental care occupation smoking occupation asbestos occupation occupation None (b) Answer: Variable Parents cancer None cellular cancer tar cellular teeth None dental teeth smoking tar, teeth asbestos cellular occupation asbestos, smoking, dental
  2. Answer: In any graphical model, the joint distribution “factors according to the graph”:

p(X 1 , X 2 ,... Xp) =

∏^ p

i=

p(Xi|Xparents(i))

where if Xi has no parents, we read p(Xi|Xparents(i)) as just the marginal distribution p(Xi). Here, there is only one variable with no parents, “occu- pation”, so we start with its marginal distribution: p(occupation). Then we need the conditional distributions of its children, p(dental|occupation), p(smoking|occupation), and p(asbestos|occupation). Next we move on to the children of these variables: p(tar|smoking) and p(teeth|smoking, dental).

(Notice that since “teeth” has two parents, we need to condition on both of them.) Then p(cellular|asbestos, tar), and finally p(cancer|cellular).

  1. (a) Answer: Teeth and Cancer share Occupation and Smoking as an- cestors, so they are dependent. All paths from teeth to cancer have a positive product of signs, so these two are positively associated. (b) Answer: Still dependent (unblocked path through dental to occu- pation to asbestos, positive product of signs). (c) Answer: Dependent, unblocked path through smoking, positive as- sociation. (d) Answer: All paths are blocked, therefore they are independent and there is no association. (e) Answer: Dependent, path from smoking to occupation to asbestos to cancer, positive sign. (f) Answer: Independent (therefore no association), as conditioning on cellular damage blocks the asbestos ← occupation → smoking → tar → damage path. (g) Answer: The path through tar and cellular damage is open, so dependent, with positive association. (h) Answer: Conditioning on a common effect makes them dependent, and would produce a negative association (see the first full paragraph on page 8, lecture 21), but they are already dependent with a positive association from occupational prestige, so while they are dependent the sign of the association is indeterminate. (i) Answer: Dependent, positive (only the straight path from tar to cancer is unblocked; teeth is a collider, and conditioning on it acti- vates it, but all paths which involve it are blocked by conditioning on the non-colliders at smoking and asbestos). (j) Answer: Dependent, positive. Conditioning on occupation blocks all paths from smoking to cancer except smoking =⇒ tar =⇒ damage =⇒ cancer.
  2. (a) Answer: To estimate the conditional risk of cancer given smoking, we merely regress cancer on smoking using our favorite regression technique (logistic regression might work, or a generalized additive model). A more interesting question is to ask whether smoking causes cancer. of cancer on smoking. To answer this, we should control for any other factor that could potentially allow us to predict cancer from smoking, via an indirect chain of dependence. We should not control for any variables on the directed path from smoking to cancer. Examining the figure (and our previous answers), every indirect path from smok- ing to cancer goes through asbestos, so conditioning on that would

(Intercept) -7.99767 4.66365 -1.715 0.. cellular -2.28934 3.03429 -0.754 0. tar 24.01178 23.38684 1.027 0. teeth 9.87549 7.41977 1.331 0. dental 2.35654 2.24426 1.050 0. smoking -0.04837 1.23852 -0.039 0. asbestos 0.16040 0.15257 1.051 0. occupation -0.11653 0.28330 -0.411 0. Null deviance: 714.09 on 600 degrees of freedom Residual deviance: 334.28 on 593 degrees of freedom The model fit as assessed by deviance is approximately the same as before. This time, however, the estimated effect for smoking is that it slightly reduces the risk of cancer. However, this effect is not even close to being statistically significant. In fact, nothing is, not even cellular damage, which (in the model) is the true direct cause of cancer. (d) Answer: The insurance company should use the model from part (c). Assuming that the insurance company will not involve itself with the personal health decisions of its clients (telling them whether or not to smoke), the company is interested only in estimating the cancer risk for each client given all factors. What causes the risk is not important to them. (e) Answer: The doctor should use the model from part (b), which at- tempts to reveal the causative affect of smoking on cancer. Normally the patient is interested in knowing how to change their lifestyle to minimize cancer risk, and less interested in indirect statistical clues about their risk factors.

  1. (a) Answer:

(a) Teeth and cancer are still positively associated through occupa- tion. Difference: the path through smoking is now blocked. (b) Same as part (3b). (c) Difference: all paths blocked, no association. (d) Same as part (3d). (e) Same as part (3e). (f) Same as part (3f), minus the explanation for the path through tar and damage, which is now missing. (g) Difference: now independent, without the tar → damage path. (h) Difference: still dependent, but they are dependent and posi- tively associated regardless of whether we control for damage. (i) Difference: independent, since we are controlling for asbestos and since the tar → damage path is gone. (j) Difference: independent, all paths are blocked.

(b) Answer: Any of the relations which switched from being dependent in problem 3 to being independent in problem 5 could be used to distinguish between the two DAGs. In particular, in the first DAG, cancer is dependent on tar after controlling for smoking, but in the second DAG cancer and tar are independent given smoking. This thought is continued in the extra credit.

  1. Extra Credit To distinguish between the two DAGs, we look at whether cancer 6 |= tar|smoking (DAG 1), or whether cancer |= tar|smoking (DAG 2). Specifically, let’s look at the conditional distribution p(cancer|tar, smoking): in DAG 2, tar should drop out of this as irrelevant, but not in DAG 1. To avoid parametric specification issues, I’ll use a non-parametric condi- tional density estimator, as in Lecture 6:

library(np) tar.npc <- npcdens(factor(cancer)~smoking+tar,data=smoke)

(We need to let np know that cancer is a categorical variable, hence the factor() wrapper.) After a little thought, we get a fitted conditional density function. Examining the bandwidths shows that smoking, rather than tar, has been smoothed away almost entirely:

summary(tar.npc)

Conditional Density Data: 601 training points, in 3 variable(s) (1 dependent variable(s), and 2 explanatory variable(s))

factor(cancer) Dep. Var. Bandwidth(s): 3.200268e- smoking tar Exp. Var. Bandwidth(s): 1149858 0.

Bandwidth Type: Fixed Log Likelihood: -169.

Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 2

Unordered Categorical Kernel Type: Aitchison and Aitken No. Unordered Categorical Dependent Vars.: 1

The bandwidth of smoking is over a million, while its standard deviation is 1.4. The bandwidth of tar, on the other hand, is quite small compared to its own standard deviation. Plotting like so (after Lecture 6)

tar.grid$tar

tar.grid$smoking

1

2

3

0.05 0.10 0.15 0.20 0.25 0.30 0.

Figure 1: Conditional probability of cancer (indicated by color, see bar at right) as a function of tar levels (horizontal axis) and smoking (vertical axis).