Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Estimating Variable Dependence in DAGs: Cancer & Risk Factors Case Study, Exercises of Advanced Data Analysis

Carnegie Mellon University (CMU)Advanced Data Analysis

Solutions to homework assignment 10 in the advanced data analysis course, focusing on estimating the dependence and independence of variables in directed acyclic graphs (dags). The case study revolves around the relationship between cancer, risk factors such as smoking, asbestos, and dental care, and how to estimate the conditional risk of cancer given these factors using logistic regression models.

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge 🇺🇸

4.9

(13)

287 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Homework Assignment 10: Estimating with

DAGs

36-402, Advanced Data Analysis, Spring 2011

Solutions

1. (a) Answer:

Variable Parents

cancer cellular damage

cellular damage tar, asbestos

tar smoking

teeth smoking, dental care

dental care occupation

smoking occupation

asbestos occupation

occupation None

(b) Answer:

Variable Parents

cancer None

cellular cancer

tar cellular

teeth None

dental teeth

smoking tar, teeth

asbestos cellular

occupation asbestos, smoking, dental

2. Answer: In any graphical model, the joint distribution “factors according

to the graph”:

p(X1, X2,...Xp) =

p

Y

i=1

p(Xi|Xparents(i))

where if Xihas no parents, we read p(Xi|Xparents(i)) as just the marginal

distribution p(Xi). Here, there is only one variable with no parents, “occu-

pation”, so we start with its marginal distribution: p(occupation). Then

we need the conditional distributions of its children, p(dental|occupation),

p(smoking|occupation), and p(asbestos|occupation). Next we move on to

the children of these variables: p(tar|smoking) and p(teeth|smoking,dental).

1

Partial preview of the text

Download Estimating Variable Dependence in DAGs: Cancer & Risk Factors Case Study and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 10: Estimating with

DAGs

36-402, Advanced Data Analysis, Spring 2011

Solutions

(a) Answer: Variable Parents cancer cellular damage cellular damage tar, asbestos tar smoking teeth smoking, dental care dental care occupation smoking occupation asbestos occupation occupation None (b) Answer: Variable Parents cancer None cellular cancer tar cellular teeth None dental teeth smoking tar, teeth asbestos cellular occupation asbestos, smoking, dental
Answer: In any graphical model, the joint distribution “factors according to the graph”:

p(X 1 , X 2 ,... Xp) =

∏^ p

i=

p(Xi|Xparents(i))

where if Xi has no parents, we read p(Xi|Xparents(i)) as just the marginal distribution p(Xi). Here, there is only one variable with no parents, “occu- pation”, so we start with its marginal distribution: p(occupation). Then we need the conditional distributions of its children, p(dental|occupation), p(smoking|occupation), and p(asbestos|occupation). Next we move on to the children of these variables: p(tar|smoking) and p(teeth|smoking, dental).

(Notice that since “teeth” has two parents, we need to condition on both of them.) Then p(cellular|asbestos, tar), and finally p(cancer|cellular).

(a) Answer: Teeth and Cancer share Occupation and Smoking as an- cestors, so they are dependent. All paths from teeth to cancer have a positive product of signs, so these two are positively associated. (b) Answer: Still dependent (unblocked path through dental to occu- pation to asbestos, positive product of signs). (c) Answer: Dependent, unblocked path through smoking, positive as- sociation. (d) Answer: All paths are blocked, therefore they are independent and there is no association. (e) Answer: Dependent, path from smoking to occupation to asbestos to cancer, positive sign. (f) Answer: Independent (therefore no association), as conditioning on cellular damage blocks the asbestos ← occupation → smoking → tar → damage path. (g) Answer: The path through tar and cellular damage is open, so dependent, with positive association. (h) Answer: Conditioning on a common effect makes them dependent, and would produce a negative association (see the first full paragraph on page 8, lecture 21), but they are already dependent with a positive association from occupational prestige, so while they are dependent the sign of the association is indeterminate. (i) Answer: Dependent, positive (only the straight path from tar to cancer is unblocked; teeth is a collider, and conditioning on it acti- vates it, but all paths which involve it are blocked by conditioning on the non-colliders at smoking and asbestos). (j) Answer: Dependent, positive. Conditioning on occupation blocks all paths from smoking to cancer except smoking =⇒ tar =⇒ damage =⇒ cancer.
(a) Answer: To estimate the conditional risk of cancer given smoking, we merely regress cancer on smoking using our favorite regression technique (logistic regression might work, or a generalized additive model). A more interesting question is to ask whether smoking causes cancer. of cancer on smoking. To answer this, we should control for any other factor that could potentially allow us to predict cancer from smoking, via an indirect chain of dependence. We should not control for any variables on the directed path from smoking to cancer. Examining the figure (and our previous answers), every indirect path from smok- ing to cancer goes through asbestos, so conditioning on that would

(Intercept) -7.99767 4.66365 -1.715 0.. cellular -2.28934 3.03429 -0.754 0. tar 24.01178 23.38684 1.027 0. teeth 9.87549 7.41977 1.331 0. dental 2.35654 2.24426 1.050 0. smoking -0.04837 1.23852 -0.039 0. asbestos 0.16040 0.15257 1.051 0. occupation -0.11653 0.28330 -0.411 0. Null deviance: 714.09 on 600 degrees of freedom Residual deviance: 334.28 on 593 degrees of freedom The model fit as assessed by deviance is approximately the same as before. This time, however, the estimated effect for smoking is that it slightly reduces the risk of cancer. However, this effect is not even close to being statistically significant. In fact, nothing is, not even cellular damage, which (in the model) is the true direct cause of cancer. (d) Answer: The insurance company should use the model from part (c). Assuming that the insurance company will not involve itself with the personal health decisions of its clients (telling them whether or not to smoke), the company is interested only in estimating the cancer risk for each client given all factors. What causes the risk is not important to them. (e) Answer: The doctor should use the model from part (b), which at- tempts to reveal the causative affect of smoking on cancer. Normally the patient is interested in knowing how to change their lifestyle to minimize cancer risk, and less interested in indirect statistical clues about their risk factors.

(a) Answer:

(a) Teeth and cancer are still positively associated through occupa- tion. Difference: the path through smoking is now blocked. (b) Same as part (3b). (c) Difference: all paths blocked, no association. (d) Same as part (3d). (e) Same as part (3e). (f) Same as part (3f), minus the explanation for the path through tar and damage, which is now missing. (g) Difference: now independent, without the tar → damage path. (h) Difference: still dependent, but they are dependent and posi- tively associated regardless of whether we control for damage. (i) Difference: independent, since we are controlling for asbestos and since the tar → damage path is gone. (j) Difference: independent, all paths are blocked.

(b) Answer: Any of the relations which switched from being dependent in problem 3 to being independent in problem 5 could be used to distinguish between the two DAGs. In particular, in the first DAG, cancer is dependent on tar after controlling for smoking, but in the second DAG cancer and tar are independent given smoking. This thought is continued in the extra credit.

Extra Credit To distinguish between the two DAGs, we look at whether cancer 6 |= tar|smoking (DAG 1), or whether cancer |= tar|smoking (DAG 2). Specifically, let’s look at the conditional distribution p(cancer|tar, smoking): in DAG 2, tar should drop out of this as irrelevant, but not in DAG 1. To avoid parametric specification issues, I’ll use a non-parametric condi- tional density estimator, as in Lecture 6:

library(np) tar.npc <- npcdens(factor(cancer)~smoking+tar,data=smoke)

(We need to let np know that cancer is a categorical variable, hence the factor() wrapper.) After a little thought, we get a fitted conditional density function. Examining the bandwidths shows that smoking, rather than tar, has been smoothed away almost entirely:

summary(tar.npc)

Conditional Density Data: 601 training points, in 3 variable(s) (1 dependent variable(s), and 2 explanatory variable(s))

factor(cancer) Dep. Var. Bandwidth(s): 3.200268e- smoking tar Exp. Var. Bandwidth(s): 1149858 0.

Bandwidth Type: Fixed Log Likelihood: -169.

Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 2

Unordered Categorical Kernel Type: Aitchison and Aitken No. Unordered Categorical Dependent Vars.: 1

The bandwidth of smoking is over a million, while its standard deviation is 1.4. The bandwidth of tar, on the other hand, is quite small compared to its own standard deviation. Plotting like so (after Lecture 6)

tar.grid$tar

tar.grid$smoking

1

2

3

0.05 0.10 0.15 0.20 0.25 0.30 0.

Figure 1: Conditional probability of cancer (indicated by color, see bar at right) as a function of tar levels (horizontal axis) and smoking (vertical axis).

Estimating Variable Dependence in DAGs: Cancer & Risk Factors Case Study, Exercises of Advanced Data Analysis

Related documents

Partial preview of the text

Download Estimating Variable Dependence in DAGs: Cancer & Risk Factors Case Study and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 10: Estimating with

DAGs

36-402, Advanced Data Analysis, Spring 2011

Solutions