Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Conditional Independence and D-Separation in Probabilistic Graphical Models, Slides of Advanced Data Analysis

The concept of conditional independence and d-separation in the context of probabilistic graphical models, using examples and illustrations. It discusses how to determine when two sets of variables are conditionally independent given a third set, and the role of the markov property and colliders in this process. The document also covers the importance of d-separation in blocking the flow of information and making inferences about causality.

Typology: Slides

2010/2011

Uploaded on 11/02/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Causal Modeling, Especially Graphical Causal
Models
36-402, Advanced Data Analysis
12 April 2011
Contents
1 Causation and Counterfactuals 1
2 Causal Graphical Models 3
2.1 Calculating the “effects of causes” . . . . . . . . . . . . . . . . . 3
3 Conditional Independence and d-Separation 4
3.1 D-Separation Illustrated . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Linear Graphical Models and Path Coefficients . . . . . . . . . . 9
3.3 Positive and Negative Associations . . . . . . . . . . . . . . . . . 10
4 Further Reading 11
A Independence, Conditional Independence, and Information The-
ory 12
1 Causation and Counterfactuals
Take a piece of cotton, say an old rag. Apply flame to it; the cotton burns. We
say the fire caused the cotton to burn. The flame is certainly correlated with
the cotton burning, but, as we all know, correlation is not causation (Figure 1).
Perhaps every time we set rags on fire we handle them with heavy protective
gloves; the gloves don’t make the cotton burn, but the statistical dependence is
strong. So what is causation?
We do not have to settle 2500 years (or more) of argument among philoso-
phers and scientists. For our purposes, it’s enough to realize that the concept
has a counter-factual component: if, contrary to fact, the flame had not been
applied to the rag, then the rag would not have burned1. On the other hand,
the fire makes the cotton burn whether we are wearing protective gloves or not.
1If you immediately start thinking about quibbles, like “What if we hadn’t applied the
flame, but the rag was struck by lightning?”, then you may have what it takes to be a
philosopher.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Understanding Conditional Independence and D-Separation in Probabilistic Graphical Models and more Slides Advanced Data Analysis in PDF only on Docsity!

Causal Modeling, Especially Graphical Causal

Models

36-402, Advanced Data Analysis

12 April 2011

Contents

1 Causation and Counterfactuals 1

2 Causal Graphical Models 3 2.1 Calculating the “effects of causes”................. 3

3 Conditional Independence and d-Separation 4 3.1 D-Separation Illustrated....................... 6 3.2 Linear Graphical Models and Path Coefficients.......... 9 3.3 Positive and Negative Associations................. 10

4 Further Reading 11

A Independence, Conditional Independence, and Information The- ory 12

1 Causation and Counterfactuals

Take a piece of cotton, say an old rag. Apply flame to it; the cotton burns. We say the fire caused the cotton to burn. The flame is certainly correlated with the cotton burning, but, as we all know, correlation is not causation (Figure 1). Perhaps every time we set rags on fire we handle them with heavy protective gloves; the gloves don’t make the cotton burn, but the statistical dependence is strong. So what is causation? We do not have to settle 2500 years (or more) of argument among philoso- phers and scientists. For our purposes, it’s enough to realize that the concept has a counter-factual component: if, contrary to fact, the flame had not been applied to the rag, then the rag would not have burned^1. On the other hand, the fire makes the cotton burn whether we are wearing protective gloves or not.

(^1) If you immediately start thinking about quibbles, like “What if we hadn’t applied the flame, but the rag was struck by lightning?”, then you may have what it takes to be a philosopher.

Figure 1: “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there”’ (Image and text copyright by Randall Munroe, used here under a Creative Commons attribution-noncommercial license; see http://xkcd.com/552/.)

To say it a somewhat different way, the distributions we observe in the world are the outcome of complicated stochastic processes. The mechanisms which set the value of one variable inter-lock with those which set other vari- ables. When we make a probabilistic prediction by conditioning — whether we predict E [Y |X = x] or P (Y |X = x) or something more complicated — we are just filtering the output of those mechanisms, picking out the cases where they happen to have set X to the value x, and looking at what goes along with that. When we make a causal prediction, we want to know what would happen if the usual mechanisms controlling X were suspended and it was set to x. How would this change propagate to the other variables? What distribution would result for Y? This is often, perhaps even usually, what people really want to know from a data analysis, and they settle for statistical prediction either because they think it is causal prediction, or for lack of a better alternative. Causal inference is the undertaking of trying to answer causal questions from empirical data. Its fundamental difficulty is that we are trying to derive counter-factual conclusions with only factual premises. As a matter of habit, we come to expect cotton to burn when we apply flames. We might even say, on the basis of purely statistical evidence, that the world has this habit. But as a matter of pure logic, no amount of evidence about what did happen can compel beliefs about what would have happened under non-existent circumstances^2. (For all my data shows, all the rags I burn just so happened to be on the verge of spontaneously bursting into flames anyway.) We must supply some counter- factual or causal premise, linking what we see to what we could have seen, to

(^2) The first person to really recognize this was the medieval Muslim theologian and anti- philosopher al Ghazali (1100/1997). (See Kogan (1985) for some of the history.) Very similar arguments were revived centuries later by Hume (1739); whether there was some line of intellectual descent linking them, I don’t know.

set Xc. The other mechanisms in the assemblage are left alone, however, and so step (iii) propagates the fixed values of Xc through them. We are not selecting a sub-population, but producing a new one. If setting Xc to different values, say xc and x′ c, leads to different distributions for XE , then we say that Xc has an effect on XE — or, slightly redundantly, has a causal effect on XE. Sometimes^4 “the effect of switching from xc to x′ c” specifically refers to a change in the expected value of XE , but since profoundly different distributions can have the same mean, this seems needlessly restrictive.^5 If one is interested in average effects of this sort, they are computed by the same procedure. It is convenient to have a short-hand notation for this procedure of causal conditioning. One more-or-less standard idea, introduced by Judea Pearl, is to introduce a do operator which encloses the conditioning variable and its value. That is, P (XE |Xc = xc)

is probabilistic conditioning, or selecting a sub-ensemble from the old mecha- nisms; but P (XE |do(Xc = xc))

is causal conditioning, or producing a new ensemble. Sometimes one sees this written as P (XE |Xc =ˆxc), or even P (XE |x̂c). I am actually fond of the do notation and will use it. Suppose that P (XE |Xc = xc) = P (XE |do(Xc = xc)). This would be ex- tremely convenient for causal inference. The conditional distribution on the right is the causal, counter-factual distribution which tells us what would hap- pen if xc was imposed. The distribution on the left is the ordinary probabilistic distribution we have spent years learning how to estimate from data. When do they coincide? One time when they would is if Xc contains all the parents of XE , and none of its descendants. Then, by the Markov property, XE is independent of all other variables given XC , and removing the arrows into XC will not change that, or the conditional distribution of XE given its parents. Doing causal inference for other choices of XC will demand other conditional independence relations implied by the Markov property.

3 Conditional Independence and d-Separation

It is clearly very important to us to be able to deduce when two sets of variables are conditionally independent of each other given a third. One of the great uses of DAGs is that they give us a fairly simple criterion for this, in terms of the graph itself. All distributions which conform to a given DAG share a common set of conditional independence relations, implied by the Markov

(^4) Especially in economics. (^5) Economists are also fond of the horribly misleading usage of talking about “an X effect” or “the effect of X” when they mean the regression coefficient of X. Don’t do this.

a

X Z Y

b

X Z Y

c

X Z Y

d

X Z Y

Figure 2: Four DAGs for three linked variables. The first two (a and b) are called chains; c is a fork; d is a collider. If these were the whole of the graph, we would have X 6 |= Y and X |= Y |Z. For the collider, however, we would have X |= Y while X 6 |= Y |Z.

property, no matter what their parameters or the form of the distributions. Faithful distributions have no other conditional independence relations. Let us think this through. Our starting point is that while causal influence flows one way through the graph, along the directions of arrows from parents to children, statistical information can flow in either direction. We can certainly make inferences about an effect from its causes, but we can equally make inferences about causes from their effects. It might be harder to actually do the calculations^6 , and we might be left with more uncertainty, but we could do it. While we can do inference in either direction across any one edge, we do have to worry about whether we can propagate this information further. Consider the four graphs in Figure 2. In every case, we condition on X, which acts as the source of information. In the first three cases, we can (in general) propagate the information from X to Z to Y — the Markov property tells us that Y is independent of its non-descendants given its parents, but in none of those cases does that make X and Y independent. In the last graph, however, what’s called a collider^7 , we cannot propagate the information, because Y has no parents, and X is not its descendant, hence they are independent. We learn about Z from X, but this doesn’t tell us anything about Z’s other cause, Y. All of this flips around when we condition on the intermediate variable (Z in Figure 2). The chains (Figures 2a and b), conditioning on the intermediate variable blocks the flow of information from X to Y — we learn nothing more about Y from X and Z than from Z alone, at least not along this path. This is also true of the fork (Figure 2c) — conditional on their common cause, the two effects are uninformative about each other. But in a collider, conditioning on

(^6) Janzing (2007) makes the very interesting suggestion that the direction of causality can be discovered by using this — roughly speaking, that if X|Y is much harder to compute than is Y |X, we should presume that X → Y rather than the other way around. (^7) Because two incoming arrows “collide” there.

X

Y

X

X

X

X

X

Figure 3: Example DAG used to illustrate d-separation.

are three exogenous variables in the graph, X 2 , X 3 and X 5. Because they have no parents, and path from one to another must go over a collider^11 If we do not condition on anything, therefore, we find that the exogenous variables are d-separated and thus independent. Since X 3 is not on any path linking X 2 and X 5 , if we condition only on X 3 , X 2 and X 5 are still d-separated, so X 2 |= X 5 |X 3. There are two paths linking X 3 to X 5 : X 3 → X 1 ← X 2 → X 4 ← X 5 , and X 3 → X 1 → Y ← X 5. Conditioning on X 2 blocks the first path (since X 2 is part of it, but is a fork), and also blocks the second path (since X 2 is not part of it, and Y is a blocked collider). Thus, X 3 |= X 5 |X 2. Similarly^12 , X 3 |= X 2 |X 5. For a slightly more interesting example, let’s look at the relation between X 3 and Y. There are two paths here: X 3 → X 1 → Y , and X 3 → X 1 ← X 2 → X 4 ← X 5 → Y. If we condition on nothing, the first path, which is a simple chain, is open, so X 3 and Y are d-connected and dependent. If we condition on X 1 , we block the first path. But X 1 is a collider on the second path, so conditioning on it activates the path. We have blocked the causal pathway, but we have created a channel for inference. So we have

Y |= X 3 (1) Y |= X 3 |X 1 (2)

To block the second path, it would be enough to condition on X 1 and X 2 (since conditioning on a fork blocks it), or X 1 and X 5 (ditto), or, in fact, any super-set of these. So

Y |= X 3 |X 1 , X 2 (3) Y |= X 3 |X 1 , X 5 (4) Y |= X 3 |X 1 , X 2 , X 5 (5) Y |= X 3 |X 1 , X 2 , X 4 (6) Y |= X 3 |X 1 , X 2 , X 6 (7) Y |= X 3 |X 1 , X 2 , X 5 , X 6 (8)

etc., etc. Let’s look at the relationship between X 4 and Y. X 4 is not an ancestor of Y , or a descendant of it, but they do share common ancestors, X 5 and X 2. Unconditionally, Y and X 4 are dependent, both through the path going X 4 ← X 5 → Y , and through that going X 4 ← X 2 → X 1 → Y. Along both paths, the exogenous variables are forks, so not conditioning on them leaves the path unblocked. X 4 and Y become d-separated when we condition on X 5 and X 2. X 6 and X 3 have no common ancestors. Unconditionally, they should be independent, and indeed they are: the two paths are X 6 ← X 4 ← X 2 → X 1 ← X 3 , and X 6 ← X 4 ← X 5 → Y ← X 1 ← X 3. Both paths contain a single

(^11) Exercise: Convince yourself of this for this graph by finding all the paths between the exogenous variables. Can you prove this as a general fact about DAGs? (^12) Exercise!

  • Find all of the undirected paths between Xi and Xj.
  • Discard all of the paths which go through colliders.
  • For each remaining path, multiply all the path coefficients along the path.
  • Sum up these products over paths.

These rules were introduced by the great geneticist and mathematical biologist Sewall Wright in the early 20th century, in a series of papers culminating in Wright (1934)^14 These “Wright path rules” often seem mysterious, particularly the bit where paths with colliders are thrown out. But from our perspective, we can see that what Wright is doing is finding all of the unblocked paths between Xi and Xj. Each path is a channel along which information (here, correlation) can flow, and so we add across channels. It is frequent, and customary, to assume that all of the variables are Gaus- sian. (We saw this in factor models as well.) With this extra assumption, the joint distribution of all the variables is a multivariate Gaussian, and the cor- relation matrix (which we find from the path coefficients) gives us the joint distribution. If we want to find conditional correlations, corr(Xi, Xj |Xk, Xl,.. .), we still sum up over the unblocked paths. If we have avoided conditioning on colliders, then this is just a matter of dropping the now-blocked paths from the sum. If on the other hand we have conditioned on a collider, that path does become active (unless blocked elsewhere), and we in fact need to modify the path weights. Specifically, we need to work out the correlation induced between the two parents of the collider, by conditioning on that collider. This can be calculated from the path weights, and some fairly tedious algebra^15. The important thing is to remember that the rule of d-separation still applies, and that conditioning on a collider can create correlations.

3.3 Positive and Negative Associations

We say that variables X and Y are positively associated if increasing X predicts, on average, an increase in Y , and vice versa^16 ; if increasing X predicts a decrease in Y , then they are negatively associated. If this holds when conditioning out other variables, we talk about positive and negative partial associations. Heuristically, positive association means positive correlation in the neighborhood of any given x, though the magnitude of the positive correlation need not be constant. Note that not all dependent variables have to have a definite sign for their association. We can multiply together the signs of positive and negative partial associa- tions along a path in a graphical model, the same we can multiply together path

(^14) That paper is now freely available online, and worth reading. See also http://www.ssc. wisc.edu/soc/class/soc952/Wright/wright_biblio.htm for references to, and in some cases copies of, related papers by Wright. (^15) See for instance Li et al. (1975). (^16) I.e., if dE[Y^ |X=x] dx ≥^0

coefficients in a linear graphical model. Paths which contain colliders should be neglected. If all the paths connecting X and Y have the same sign, then we know that over-all association between X and Y must have that sign. If differ- ent paths have different signs, however, then signs alone are not enough to tell us about the over-all association. If we are interested in conditional associations, we have to consider whether our conditioning variables block paths or not. Paths which are blocked by con- ditioning should be dropped from consideration. If a path contains an activated collider, we need to include it, but we reverse the sign of one arrow into the

collider. That is, if X →+ Z ←+ Y , and we condition on Z, we need to replace one of the plus signs with a − sign, because the two parents now have an over-all negative association.^17 If on the other hand one of the incoming arrows had a positive association and the other was negative, we need to flip one of them so they are both positive or both negative; it doesn’t matter which, since it creates a positive association between the parents^18.

4 Further Reading

The two foundational books on graphical causal models are Spirtes et al. (2001) and Pearl (2009b). Both are excellent and recommended in the strongest pos- sible terms; but if you had to read just one, I would recommend Spirtes et al. (2001). If on the other hand you do not feel up to reading a book at all, then Pearl (2009a) is much shorter, and covers most of the high points. (Also, it’s free online.) The textbook by Morgan and Winship (2007) is much less demanding mathematically, which also means it is less complete conceptually, but it does explain the crucial ideas clearly, simply, and with abundant examples.^19 Lau- ritzen (1996) has a mathematically rigorous treatment of d-separation (among many other things), but de-emphasizes causality. Linear path models have a very large literature, going back to the early 20th century. Loehlin (1992) is user-friendly, though aimed at psychologists with less mathematical sophistication than students taking this course. Li (1975), while older, is very enthusiastic and has many interesting applications. Many software packages for linear structural equation models and path analysis offer options to search for models; these are not, in general, reliable (Spirtes et al., 2001).

(^17) If both smoking and asbestos are positively associated with lung cancer, and we know the patient does not have lung cancer, then high levels of smoking must be compensated for by low levels of asbestos, and vice versa. (^18) If yellow teeth are positively associated with smoking and negatively associated with dental insurance, and we know the patient does not have yellow teeth, then high levels of smoking must be compensated for by excellent dental care, and conversely poor dental care must be compensated for by low levels of smoking. (^19) This textbook also discusses an alternative formalism for counterfactuals, due to Donald Rubin. While Rubin has done very distinguished work in causal inference, his formalism is vastly harder to manipulate than are graphical models, but has no more expressive power. (Pearl (2009a) has a convincing discussion of this point.) I have accordingly skipped the Rubin formalism here, but good accounts are available in Morgan and Winship (2007, ch. 2), and in Rubin’s collected papers (Rubin, 2006).

matters is that if we test the hypothesis that X and Y are independent, with joint distribution p(x)p(y), against the hypothesis that they dependent, with joint distribution p(x, y), then our power to detect dependence grows exponen- tially with the number of samples, and the exponential rate at which it grows is I[X; Y ]. More exactly, if βn is the error probability with n samples,

n log βn → I[X; Y ]

(See Cover and Thomas (2006) again, or Kullback (1968).) So positive mutual information means dependence, and the magnitude of mutual information tells us about how detectable the dependence is. Suppose we conditioned X and Y on a third variable (or variables) Z. For each realization z, we can calculate the mutual information,

I[X; Y |Z = z] ≡

x,y

p(x, y|z) log

p(x, y|z) p(x|z)p(y|z)

And we can average over z,

I[X; Y |Z] ≡

z

p(z)I[X; Y |Z = z]

This is the conditional mutual information. It will not surprise you at this point to learn that X |= Y |Z if and only if I[X; Y |Z] = 0. The magnitude of the conditional mutual information tells us how easy it is to detect conditional dependence.

References

al Ghazali, Abu Hamid Muhammad ibn Muhammad at-Tusi (1100/1997). The Incoherence of the Philosophers = Tahafut al-Falasifah: A Parallel English- Arabic Text. Provo, Utah: Brigham Young University Press. Translated by Michael E. Marmura.

Cover, Thomas M. and Joy A. Thomas (2006). Elements of Information Theory. New York: John Wiley, 2nd edn.

Hume, David (1739). A Treatise of Human Nature: Being an Attempt to In- troduce the Experimental Method of Reasoning into Moral Subjects. London: John Noon. Reprint (Oxford: Clarendon Press, 1951) of original edition, with notes and analytical index.

Janzing, Dominik (2007). “On causally asymmetric versions of Occam’s Razor and their relation to thermodynamics.” E-print, arxiv.org. URL http:// arxiv.org/abs/0708.3411.

Kogan, Barry S. (1985). Averroes and the Metaphysics of Causation. Albany, New York: State University of New York Press.

Kullback, Solomon (1968). Information Theory and Statistics. New York: Dover Books, 2nd edn.

Lauritzen, Steffen L. (1996). Graphical Models. New York: Oxford University Press.

Li, Ching Chun (1975). Path Analysis: A Primer. Pacific Grove, California: The Boxwood Press.

Li, Ching Chun, Sati Mazumdar and B. Raja Rao (1975). “Partial Correlation in Terms of Path Coefficients.” The American Statistician, 29 : 89–90. URL http://www.jstor.org/stable/2683271.

Loehlin, John C. (1992). Latent Variable Models: An Introduction to Factor, Path, and Structural Analysis. Hillsdale, New Jersey: Lawrence Erlbaum Associates, 2nd edn.

Morgan, Stephen L. and Christopher Winship (2007). Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge, England: Cambridge University Press.

Pearl, Judea (2000). Causality: Models, Reasoning, and Inference. Cambridge, England: Cambridge University Press.

— (2009a). “Causal inference in statistics: An overview.” Statistics Surveys, 3 : 96–146. URL http://projecteuclid.org/euclid.ssu/1255440554.

— (2009b). Causality: Models, Reasoning, and Inference. Cambridge, England: Cambridge University Press, 2nd edn.

Rubin, Donald B. (2006). Matched Sampling for Causal Effects. Cambridge, England: Cambridge University Press.

Shannon, Claude E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal , 27 : 379–423. URL http://cm.bell-labs.com/ cm/ms/what/shannonday/paper.html. Reprinted in Shannon and Weaver (1963).

Shannon, Claude E. and Warren Weaver (1963). The Mathematical Theory of Communication. Urbana, Illinois: University of Illinois Press.

Spirtes, Peter, Clark Glymour and Richard Scheines (2001). Causation, Predic- tion, and Search. Cambridge, Massachusetts: MIT Press, 2nd edn.

Wright, Sewall (1934). “The Method of Path Coefficients.” Annals of Mathe- matical Statistics, 5 : 161–215. URL http://projecteuclid.org/euclid. aoms/1177732676.