



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Solutions to homework assignment 10 in the advanced data analysis course, focusing on estimating the dependence and independence of variables in directed acyclic graphs (dags). The case study revolves around the relationship between cancer, risk factors such as smoking, asbestos, and dental care, and how to estimate the conditional risk of cancer given these factors using logistic regression models.
Typology: Exercises
1 / 7
This page cannot be seen from the preview
Don't miss anything!
p(X 1 , X 2 ,... Xp) =
∏^ p
i=
p(Xi|Xparents(i))
where if Xi has no parents, we read p(Xi|Xparents(i)) as just the marginal distribution p(Xi). Here, there is only one variable with no parents, “occu- pation”, so we start with its marginal distribution: p(occupation). Then we need the conditional distributions of its children, p(dental|occupation), p(smoking|occupation), and p(asbestos|occupation). Next we move on to the children of these variables: p(tar|smoking) and p(teeth|smoking, dental).
(Notice that since “teeth” has two parents, we need to condition on both of them.) Then p(cellular|asbestos, tar), and finally p(cancer|cellular).
(Intercept) -7.99767 4.66365 -1.715 0.. cellular -2.28934 3.03429 -0.754 0. tar 24.01178 23.38684 1.027 0. teeth 9.87549 7.41977 1.331 0. dental 2.35654 2.24426 1.050 0. smoking -0.04837 1.23852 -0.039 0. asbestos 0.16040 0.15257 1.051 0. occupation -0.11653 0.28330 -0.411 0. Null deviance: 714.09 on 600 degrees of freedom Residual deviance: 334.28 on 593 degrees of freedom The model fit as assessed by deviance is approximately the same as before. This time, however, the estimated effect for smoking is that it slightly reduces the risk of cancer. However, this effect is not even close to being statistically significant. In fact, nothing is, not even cellular damage, which (in the model) is the true direct cause of cancer. (d) Answer: The insurance company should use the model from part (c). Assuming that the insurance company will not involve itself with the personal health decisions of its clients (telling them whether or not to smoke), the company is interested only in estimating the cancer risk for each client given all factors. What causes the risk is not important to them. (e) Answer: The doctor should use the model from part (b), which at- tempts to reveal the causative affect of smoking on cancer. Normally the patient is interested in knowing how to change their lifestyle to minimize cancer risk, and less interested in indirect statistical clues about their risk factors.
(a) Teeth and cancer are still positively associated through occupa- tion. Difference: the path through smoking is now blocked. (b) Same as part (3b). (c) Difference: all paths blocked, no association. (d) Same as part (3d). (e) Same as part (3e). (f) Same as part (3f), minus the explanation for the path through tar and damage, which is now missing. (g) Difference: now independent, without the tar → damage path. (h) Difference: still dependent, but they are dependent and posi- tively associated regardless of whether we control for damage. (i) Difference: independent, since we are controlling for asbestos and since the tar → damage path is gone. (j) Difference: independent, all paths are blocked.
(b) Answer: Any of the relations which switched from being dependent in problem 3 to being independent in problem 5 could be used to distinguish between the two DAGs. In particular, in the first DAG, cancer is dependent on tar after controlling for smoking, but in the second DAG cancer and tar are independent given smoking. This thought is continued in the extra credit.
library(np) tar.npc <- npcdens(factor(cancer)~smoking+tar,data=smoke)
(We need to let np know that cancer is a categorical variable, hence the factor() wrapper.) After a little thought, we get a fitted conditional density function. Examining the bandwidths shows that smoking, rather than tar, has been smoothed away almost entirely:
summary(tar.npc)
Conditional Density Data: 601 training points, in 3 variable(s) (1 dependent variable(s), and 2 explanatory variable(s))
factor(cancer) Dep. Var. Bandwidth(s): 3.200268e- smoking tar Exp. Var. Bandwidth(s): 1149858 0.
Bandwidth Type: Fixed Log Likelihood: -169.
Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 2
Unordered Categorical Kernel Type: Aitchison and Aitken No. Unordered Categorical Dependent Vars.: 1
The bandwidth of smoking is over a million, while its standard deviation is 1.4. The bandwidth of tar, on the other hand, is quite small compared to its own standard deviation. Plotting like so (after Lecture 6)
tar.grid$tar
tar.grid$smoking
1
2
3
0.05 0.10 0.15 0.20 0.25 0.30 0.
Figure 1: Conditional probability of cancer (indicated by color, see bar at right) as a function of tar levels (horizontal axis) and smoking (vertical axis).