Graphical criteria for efficient total effect estimation via adjustment in causal linear models

Covariate adjustment is a commonly used method for total causal effect estimation. In recent years, graphical criteria have been developed to identify all valid adjustment sets, that is, all covariate sets that can be used for this purpose. Different valid adjustment sets typically provide total causal effect estimates of varying accuracies. Restricting ourselves to causal linear models, we introduce a graphical criterion to compare the asymptotic variances provided by certain valid adjustment sets. We employ this result to develop two further graphical tools. First, we introduce a simple variance decreasing pruning procedure for any given valid adjustment set. Second, we give a graphical characterization of a valid adjustment set that provides the optimal asymptotic variance among all valid adjustment sets. Our results depend only on the graphical structure and not on the specific error variances or edge coefficients of the underlying causal linear model. They can be applied to directed acyclic graphs (DAGs), completed partially directed acyclic graphs (CPDAGs) and maximally oriented partially directed acyclic graphs (maximal PDAGs). We present simulations and a real data example to support our results and show their practical applicability.


Introduction
Covariate adjustment is a popular method for estimating total causal effects from observational data. Given a causal graph, with nodes representing covariates and edges direct effects, graphical criteria have been developed to read of covariate sets that can be used for this purpose. We refer to such sets as valid adjustment sets. The best-known such criterion is probably the back-door criterion (Pearl, 1993), which is sufficient for adjustment. A necessary and sufficient criterion was developed by Shpitser et al. (2010) and Perković et al. (2018).
Given the complete identification of all valid adjustment sets, the following question naturally arises: If more than one valid adjustment set is available, which one should be used for estimation? In practice this choice will often be affected by considerations such as ease and cost of data collection. On the other hand, statistical aspects should also be taken into account as different valid adjustment sets provide estimates with varying accuracy. Restricting ourselves to causal linear models, we develop graphical tools to leverage the information encoded in the causal graph to identify adjustment sets that are not only valid but also efficient.
As of now, efficiency considerations have not featured prominently in adjustment set selection. When the treatment is a single variable X, the parent set of X, i.e. the set of direct causes of X, is often used as an adjustment set (e.g., Williamson et al., 2014;Gascon et al., 2015;Sunyer et al., 2015). Although easy to compute and guaranteed to satisfy the backdoor-criterion, the parents of X are typically quite inefficient in terms of the asymptotic variance, as they are usually strongly correlated with X (see Example 3.5). Another approach to choosing an adjustment set is to adjust for as few covariates as possible (e.g., De Luna et al., 2011;Jonker et al., 2012;Schliep et al., 2015). For efficiency, this is also sub-optimal in general, as adjusting for certain additional covariates that explain variance in the outcome Y , sometimes called precision variables or risk factors, can be beneficial for efficiency (see Example 3.5).
Literature on variable selection for efficient total causal effect estimation has been growing in recent years, particularly in the area of propensity score methods. For example there are simulation studies (Brookhart et al., 2006;Lefebvre et al., 2008), results regarding minimum asymptotic variance bounds (Rotnitzky and Robins, 1995;Hahn, 2004;Rotnitzky et al., 2010) and theoretical results for certain estimators (Robinson and Jewell, 1991;Lunceford and Davidian, 2004;Schnitzer et al., 2016;Wooldridge, 2016). These results indicate that the following two notions hold: First, adding instrumental variables to a given valid adjustment set harms the efficiency and second, adding precision variables improves the efficiency. Model selection procedures taking these notions into account have also been developed (VanderWeele and Shpitser, 2011;Shortreed and Ertefaie, 2017).
While the above notions provide useful heuristics, there are pitfalls to the approach of labeling individual covariates as either good or bad for efficiency. Whether adding a given covariate to an adjustment set is harmful or beneficial can vary depending on the starting adjustment set, i.e. is generally speaking a conditional property. Furthermore, adding or removing a covariate might render a valid adjustment set invalid. As a result some care must be taken in sequentially applying these heuristics. Kuroki and Miyakawa (2003) and Kuroki and Cai (2004) circumvent these difficulties by comparing the efficiency of certain pairs of valid adjustment sets, rather than considering the behavior of individual covariates. Both introduce graphical criteria that identify which of two valid adjustment sets provides the smaller asymptotic variance in causal linear models. The criterion from Kuroki and Miyakawa (2003) compares adjustment sets of size two and the criterion from Kuroki and Cai (2004) compares disjoint adjustment sets. Furthermore, both criteria require a directed acyclic graph (DAG) and a multivariate Gaussian distribution. We extend these results in various directions.
Our first result is a new graphical criterion (see Theorem 3.4) that can compare more pairs of valid adjustment sets than the existing criteria (Kuroki and Miyakawa, 2003;Kuroki and Cai, 2004). Our result holds for causal linear models with arbitrary error distributions as well as single and joint interventions. They can also be applied to graph types other than DAGs. We note, however, that we still cannot compare all pairs of valid adjustment sets. This is in fact impossible with the graph alone (see Example 3.5).
Building on Theorem 3.4, we introduce two further results. First, we provide a simple order invariant pruning procedure that, given a valid adjustment set, returns a subset that is also valid and provides equal or smaller asymptotic variance (see Algorithm 1 and Theorem 3.9). Our procedure is similar to that of VanderWeele and Shpitser (2011, Propositions 1 and 2), who conjectured a resulting efficiency gain. Our main contribution is that we formally establish this efficiency gain for causal linear models and show the order invariance of such a procedure.
Second, we define a valid adjustment set that provides the smallest possible asymptotic variance among all valid adjustment sets relative to (X, Y ) in the underlying causal graph G (see Theorem 3.13). We denote this adjustment set by O(X, Y, G) and refer to it as asymptotically optimal . The fact that such an asymptotically optimal set can be defined is perhaps surprising, considering that Theorem 3.4 only allows for the comparison of certain pairs of valid adjustment sets. Our results depend only on the structure of the causal graph and not on the specific edge weights or error distributions of the underlying causal linear model. We also discuss the particulars of how our results can be applied to cases with unmeasured covariates in the Discussion (Section 6).
We also provide numerical experiments to quantify the efficiency that is gained by using O(X, Y, G) in finite samples (see Section 4), and also apply our methods to single cell data from Sachs et al. (2005) (see Section 5). All proofs can be found in the Supplement (Henckel et al., 2020). We have also made our code available at https://github.com/henckell/CodeEfficientVAS. Independent follow-up research has already expanded upon our results in various directions. In particular, Rotnitzky and Smucler (2020) show that our results on the asymptotic optimality of O(X, Y, G) extend to a broad class of non-parametric estimators. Building on this, Smucler et al. (2020) consider even more general settings and construct adjustment sets with efficiency guarantees other than asymptotic optimality. van der Zander and Liskiewicz (2019) provide a polynomial time algorithm to compute O(X, Y, G). Witte et al. (2020) provide an alternative characterization of O(X, Y, G) and also integrate O(X, Y, G) into the IDA algorithm by Maathuis et al. (2009). Kuipers and Moffa (2020) investigate the theoretical finite sample performance of O(X, Y, G) in a specific non-linear example and discuss how O(X, Y, G)'s performance relates to causal discovery considerations.

Preliminaries
In this paper we use graphs where nodes represent random variables, and edges represent conditional dependencies and direct causal effects. We now give an overview of the main graphical objects used in this paper. We give the usual graphical definitions and define these objects more formally in Section A.1 of the Supplement.
We consider three classes of acyclic graphs: directed acyclic graphs (DAGs), com- pleted partially directed acyclic graphs (CPDAGs) and maximally oriented partially directed acyclic graphs (maximal PDAGs) (see Example A.3 in the Supplement). DAGs are directed graphs, i.e. graphs with all edges of the form → and without directed cycles. They arise naturally to describe causal relationships under the assumption of no feedback loops (cf. Pearl, 2009). Generally it is not possible to learn the causal DAG from observational data alone. Under the assumptions of causal sufficiency and faithfulness, one can, however, learn a Markov equivalence class of DAGs, which can be uniquely represented by a CPDAG (Meek, 1995;Andersson et al., 1997;Spirtes et al., 2000;Chickering, 2002). Given explicit knowledge of some causal relationships between variables, access to interventional data, or some model restrictions, one can obtain a refinement of this class, uniquely represented by a maximal PDAG (Meek, 1995;Scheines et al., 1998;Hoyer et al., 2008;Hauser and Bühlmann, 2012;Eigenmann et al., 2017;Wang et al., 2017). All three graph types encode conditional independence relationships that can be read off the graph by applying the well known d-separation criterion (see Definition 1.2.3 in Pearl (2009) for DAGs, Definition 3.5 in Maathuis and Colombo (2015) for CPDAGs and Lemma C.1 of the Supplement for maximal PDAGs). We use the notation X ⊥ G Y|Z to denote that Z d-separates X from Y in G, with X, Y and Z pairwise disjoint nodes sets in a graph G.
Remark 2.1. DAGs and CPDAGs are special cases of maximal PDAGs. In the remainder of the paper results are generally stated in terms of maximal PDAGs. Readers unfamiliar with CPDAGs and maximal PDAGs may also disregard this and simply think of all results as being with respect to DAGs.
We now introduce causal linear models, total effects and defines some notation.
Causal DAGs, CPDAGs, maximal PDAGs. We consider interventions do(x) (for X ⊆ V), which represent outside interventions that set X to x uniformly for the entire population (Pearl, 1995). A density f of V = {V 1 , . . . , V p } is compatible with a causal DAG G = (V, E) if all post-intervention densities f (v|do(x)) factorize as: Equation (1) is known as the truncated factorization formula (Pearl, 2009), manipulated density formula (Spirtes et al., 2000) or the g-formula (Robins, 1986). A density f of V = {V 1 , . . . , V p } is compatible with a causal maximal PDAG or a causal CPDAG G if it is compatible with a causal DAG D ∈ [G], where [G] is the class of DAGs represented by G.
Causal linear model. Let G = (V, E) be a DAG. Then V = (V 1 , . . . , V p ) T , p ≥ 1 follows a causal linear model compatible with G if the following two conditions hold: 1. The distribution f of V is compatible with the causal DAG G.
V follows a causal linear model compatible with a maximal PDAG or CPDAG G, if it follows a causal linear model compatible with a DAG D ∈ [G]. We refer to v 1 , . . . , vp as errors and emphasize that we do not require them to be Gaussian. Furthermore, by construction E[V] = 0. The coefficient α ij corresponding to the edge V j → V i in the causal DAG G can be interpreted as the direct effect of V j on V i with respect to V.
A do intervention, for example do(V 4 = 1), then corresponds to replacing the generating mechanism of the intervened on variables with the fixed intervention value, e.g. V 4 ← 1.

Causal and proper paths
Y be disjoint node sets in a causal maximal PDAG G. A path from X to Y is proper if only the first node on p is in X Total effects. (Pearl, 2009) represents the effect of X i on Y j in the joint intervention of X on Y. In general, τ yx is a matrix of functions, but in causal linear models the partial derivatives do not depend on x i . Hence, τ yx reduces to a matrix of numbers, whose values are determined by the coefficients in Equation (2) (Wright, 1934;Nandy et al., 2017). We can thus give an equivalent definition of the total effect specific to this setting. Consider disjoint node sets X = {X 1 , . . . , X kx } and Y = {Y 1 , . . . , Y ky } in a causal DAG G = (V, E), such that V follows a causal linear model compatible with G. The total effect along a causal path p from X to Y in G is the product of the edge coefficients along p. The total effect of X on Y is then the matrix τ yx ∈ R ky×kx whose (j, i)-th value (τ yx ) j,i is equal to the sum of the total effects along all proper causal paths from X to Y j starting with X i in G.
If V follows a causal linear model compatible with a causal CPDAG or maximal PDAG G, the total effect of X on Y is identifiable if it is the same for every DAG in [G]. Remark 2.3. Consider the total effect τ yx of X on Y. If for some Y j ∈ Y and X i ∈ X, Y j is a non-descendant of X i then (τ yx ) j,i = 0. Further, the total effect τ y j x i of X i on Y j will generally differ from the partial total effect (τ yx ) j,i in the joint intervention on X. This is due to the latter effect not considering causal paths from X i to Y j that contain other nodes in X \ {X i }. The total effect of X on any Y j , however, does not depend on the remaining Y \ {Y j }.
Notation for covariance matrices and regression coefficients. Consider random vectors S, T, and W 1 , W 2 , . . . , W m , and let W = (W 1 T , . . . , W m T ) T , k s = |S| and k t = |T|. We denote the covariance matrix of S with Σ ss ∈ R ks×ks and the covariance matrix between S and T with Σ st ∈ R ks×kt , where its (i, j)th element equals Cov(S i , T j ). We further define Σ ss.t = Σ ss −Σ st Σ −1 tt Σ ts . If |S| = 1, we write σ ss.t instead. Let β st.w ∈ R ks×kt represent the least squares regression coefficient matrix whose (i, j)-th element is the regression coefficient of T j in the regression of S i on T and W, withβ st.w denoting the corresponding estimator. We also use the notation that β st.w 1 w 2 ···wm = β st.w and Σ st.w 1 w 2 ···wm = Σ st.w . Given a set X = {X 1 , . . . , X k } we use the notation X −i to denote X \ {X i }.

Total effect estimation via covariate adjustment
In causal linear models total effects can be estimated via OLS regression given an appropriate adjustment set. This result is well known in the Gaussian case with Shpitser et al.
(2010) and Perković et al. (2018) having fully characterized the class of valid adjustment sets (see Definition A.4).
The fact that total effects can be estimated via OLS regression has been shown to generalize to causal linear models with arbitrary error distributions for a singleton X with the adjustment set pa(X, G) (Proposition 3.1 from the supplement of Nandy et al., 2017). We now extend this property to arbitrary valid adjustment sets and derive the estimator's asymptotic distribution.
Propostion 3.1. Let X = {X 1 , . . . , X kx } and Y = {Y 1 , . . . , Y ky } be disjoint node sets in a causal DAG G = (V, E) and let V follow a causal linear model compatible with G. Let Z be a valid adjustment set relative to (X, Y) in G.
The key aspect to Proposition 3.1, is that it does not require the considered regression of Y on X and Z to be well-specified, in the sense of being linear and having homoskedastic residuals. One may think this generality is not needed, given that we consider causal linear models. However, in a causal linear model with non-Gaussian errors, adjusted regressions other than that of a node on its parents, are not generally wellspecified (see Example A.7). We note that for Proposition 3.1 to hold for misspecified regresssions, it is essential that Z is a valid adjustment set. Of course, Proposition 3.1 corresponds to what we know for well-specified regressions, in which case the restriction to valid adjustment sets is not needed. For causal linear models with Gaussian errors all regressions are well-specified.
Due to the result in Proposition 3.1, we use the notationτ z yx to denote the least squares estimateβ yx.z of τ yx , for any valid adjustment set Z relative to (X, Y). We also write and a.var(τ z yx ) to denote the matrix with entries a.var(τ z yx ) j,i = a.var((τ z yx ) j,i ), i = 1, . . . , k x and j = 1, . . . , k y . Remark 3.2. The terms in Equation (3) depend on the distribution of V = {V 1 , . . . , V p } only through the covariance matrix Σ vv , which in turn only depends on the underlying causal linear model through the edge coefficients α ij and error variances var Nandy et al., 2017). In particular, this implies that the asymptotic variance a.var(τ z yx ) does not depend on the error distribution families. Example 3.3. Consider the causal DAG G = (V, E) in Figure 1 and assume that V follows the causal linear model from Example 2.2. The total effect of V 4 on V 6 in G is τ 64 = α 64 + α 65 α 54 . By Proposition 3.1, τ 64 also equals the population level regression coefficient of V 4 in the regression of V 6 on V 4 and any adjustment set of the form A ∪ B, with A ⊆ {V 1 , V 2 } non-empty and B ⊆ {V 3 } possibly empty. (see Definition A.4). Figure 2: (a) Causal DAG from Examples 3.5, 3.10 and 3.15, (b) causal DAG from Examples 3.11 and 3.15.

Comparing valid adjustment sets
We now introduce a new graphical criterion for qualitative comparisons between the asymptotic variances resulting from certain pairs of valid adjustment sets, which is more general than the criteria of Kuroki and Miyakawa (2003) and Kuroki and Cai (2004).
Theorem 3.4. Let X and Y be disjoint node sets in a maximal PDAG G = (V, E), such that V follows a causal linear model that is compatible with G. Let Z 1 and Z 2 be two valid adjustment sets relative to (X, Y) in G and let with the matrix inequality denoting entry wise inequality.
The proof of Theorem 3.4 relies on equation (3). The intuition behind it is that the more information a conditioning set B contains on a target variable A the smaller σ aa.b . Thus, the assumed conditional independence statements imply that We stress that when a causal linear model with non-Gaussian errors is considered, Theorem 3.4 holds only for pairs of valid adjustment sets. This is due to Proposition 3.1 only holding for misspecified regressions when a valid adjustment set is considered.
We can thus apply Theorem 3.4 to the following pairs of valid adjustment sets: We can conclude that adding A or D to any conditioning set worsens the asymptotic variance, while a converse statement holds for C. Consequently, {B, C} provides the best asymptotic variance, while the set pa(X, G) = {A, B} does not fare well.
In order to empirically verify these results, we randomly drew six causal linear models compatible with G and computed the asymptotic variances a.var(τ z yx ) for each valid adjustment set Z relative to (X, Y ) in G, across these 6 models. Specifically, we did the following for each model. We drew error variances σ vv for each node V ∈ V independently from a standard uniform distribution and edge coefficients α vw for each edge (W, V ) ∈ E independently from a standard normal distribution. From these parameters we computed the causal linear model's covariance matrix and then, in accordance with Proposition 3.1, the asymptotic variances corresponding to each valid adjustment set. We did not consider error properties other than the variance (and mean 0) as they are irrelevant for the asymptotic variances (see Remark 3.2).
The thus obtained asymptotic variances are given in Table 1. They show that the three proven trends do in fact hold and that {B, C} provides the best asymptotic variance in the considered models. Interestingly, the order of the asymptotic variances corresponding to any two sets that cannot be compared using Theorem 3.4, such as {A, B, C} and {B}, or {A, B} and {B, D}, are in fact inconsistent throughout the considered models.
We now give two simple corollaries of Theorem 3.4. The first one shows that superfluous parents of X are harmful for the asymptotic variance, while the second one shows that parents of Y are beneficial.
Corollary 3.6. Let X and Y be disjoint node sets in a maximal PDAG G = (V, E) and let V follow a causal linear model compatible with G. Let Z be a valid adjustment set relative to (X, Y) in G and let P ∈ pa(X, G).
Corollary 3.7. Let X and Y be disjoint node sets in a maximal PDAG G = (V, E) and let V follow a causal linear model compatible with G. Let Z be a valid adjustment set relative to (X, Y) in G and let R ∈ pa(Y, G).
We now give a third corollary of Theorem 3.4, especially relevant for randomized trials, where pa(X, G) = ∅. It shows that, when restricting oneself to covariates not in de(X, G), enlarging an adjustment set can only be beneficial for the asymptotic variance. In particular this implies that adjusting for additional pre-treatment covariates in a randomized trial can only be beneficial for the asymptotic variance.
Corollary 3.8. Let X and Y be disjoint node sets in a DAG G = (V, E), such that pa(X, G) = ∅ and let V follow a causal linear model compatible with G. Let Z and Z be two node sets in G, such that Z

Pruning procedure
The result from Theorem 3.4 can be used to prune a valid adjustment set to obtain a subset that is still valid and yields a smaller asymptotic variance. Generally, which of the subsetsZ ⊆ Z provides the optimal asymptotic variance depends on the edge coefficients in the underlying causal linear model (see Example 3.11). However, we can use Theorem 3.4 to identify a subset such that there is no other subset for which Theorem 3.4 guarantees a better asymptotic variance. This is formalized in Algorithm 1 whose soundness is stated in Theorem 3.9.
In practice, such pruning is advisable as it reduces the number of variables that need to be measured while also improving precision. Although similar pruning procedures exist (Hahn, 2004;VanderWeele and Shpitser, 2011), we believe Theorem 3.4 and Theorem 3.9 to be the first theoretical guarantees for such pruning in causal linear models.
Theorem 3.9. Let X and Y be disjoint node sets in a maximal PDAG G = (V, E) and let V follow a causal linear model compatible with G. Let Z be a valid adjustment set relative to (X, Y) in G. Applying Algorithm 1 then yields a valid adjustment set Z ⊆ Z, such that a.var(τ z yx ) ≤ a.var(τ z yx ) and there is no other subset of Z for which Theorem 3.4 guarantees a better asymptotic variance than Z . Further, Algorithm 1 outputs the same set Z , regardless of the order in which the nodes in Z are considered.
Algorithm 1: Pruning procedure input : Causal maximal PDAG G and disjoint node sets X, Y and Z in G, such that Z is a valid adjustment set relative to (X, Y) in G output: Example 3.10. We now return to Example 3.5 and the DAG G = (V, E) in Figure 2 Algorithm 1 will discard the nodes A and D, while keeping the nodes B and C whenever these nodes are in Z. This is done independently of the order in which the nodes are considered. Hence, Z will either be pruned to {B} or {B, C}. Both these sets are valid adjustment sets relative to (X, Y ) in G and {B, C} yields the optimal asymptotic of all valid adjustment set, while {B} yields the optimal asymptotic variance of all valid adjustment sets that do not contain C.
Example 3.11. We now give an example in which one cannot use Theorem 3.4 to decide which subsetZ ⊆ Z of a valid adjustment set Z provides the optimal asymptotic variance. Instead, the optimal subset depends on the edge coefficients and error variances of the underlying causal linear model. Consider the DAG G in Figure 2 (b) and two sets of possible edge coefficients for G. Let all edge coefficients that are not explicitly mentioned be 1 and let α ba = 0.5, α xa = 0.25 and α yx = 2 in Case i), while α xa = 0.7 and α yc = 0.5 in Case ii). With all error variances equal to 1 in both cases, one obtains the asymptotic variances shown in Table  2, where we ignore error properties other than variance (and mean 0) in accordance with Remark 3.2.
The set {C} provides the smallest asymptotic variance in both cases and will also be the output of Algorithm 1 applied to any valid adjustment set containing C. If we instead consider valid adjustment sets that do not contain C the situation is more complex. If, for example, we apply Algorithm 1 to {A, B, D}, the output is {A, B}, which is the subset that yields the optimal asymptotic variance in Case i), but is bested by the empty set in Case ii). These two sets cannot be compared with Theorem 3.4. However, Theorem 3.4 still implies that the valid adjustment sets {A}, {D}, {A, D} and {A, B, D} provide worse asymptotic variances than both the empty set and {A, B}. Algorithm 1 will prune these sets to either {A, B} or the empty set, depending on whether they originally included {A, B}.

The optimal adjustment set
We will now define a set that provides the optimal asymptotic variance among all valid adjustment sets. This is remarkable since Theorem 3.4 can only compare the asymptotic variance provided by certain valid adjustment sets (see Examples 3.5 and 3.11). Nevertheless, it allows us to define this optimal set, whose optimality only depends on the underlying causal graph. We first give some preparatory definitions which for simplicity we restrict to the DAG setting. The general definitions for maximal PDAGs are given in Section A.1. This includes the definition of the set of possible descendants possde(X, G), which in the case that G is a DAG, reduces to the set of descendants de(X, G). Causal and forbidden nodes Consider a DAG G and two disjoint node sets X, Y in G. We define causal nodes relative to (X, Y) in G, denoted cn(X, Y, G), as all nodes on proper causal paths from X to Y, excluding nodes in X. For singleton X, causal nodes are also called mediating nodes. We then define the forbidden nodes relative to (X, Y) Note that, differently from Perković et al. (2018), we also include X in forb(X, Y, G) to simplify notation. The forbidden set characterizes those covariates that may never be included into a valid adjustment set (see Definition A.4).
Definition 3.12. Let X and Y be disjoint node sets in a maximal PDAG G. We define O(X, Y, G) as: (ii) Let Z be a valid adjustment set relative to (X, Y) in G. If V follows a causal linear model compatible with G then a.var(τ o yx ) ≤ a.var(τ z yx ).
(iii) Let Z be a valid adjustment set relative to (X, Y) in G, such that a.var(τ o yx ) = a.var(τ z yx ).
If V follows a causal linear model compatible with G and f is faithful to G then O ⊆ Z.
Remark 3.14. In Theorem 3.13 we assume that Y ⊆ possde(X, G). If Y ⊆ possde(X, G) we can instead consider the total effect of X onỸ = Y ∩ possde(X, G), since the total effect of X on Y \Ỹ is 0 (see Remark 2.3). Hence, this restriction only limits us from superfluously estimating some zero values.
Statement (i) implies that our optimal set, similarly to the Adjust(X, Y, G) set from Definition 12 in Perković et al. (2018), can be used to check if there exists a valid adjustment set, albeit with the added qualifier that Y has to be appropriately pruned in advance (see Remark 3.14). Due to statement (ii) in Theorem 3.13 we call O(X, Y, G) asymptotically optimal. Statement (iii) implies that in case of faithfulness no other asymptotically optimal set is of smaller or equal size than O(X, Y, G).
As a corollary to Theorem 3.13 jointly with Theorem 3.9, the output of Algorithm 1 is O(X, Y, G), whenever the starting valid adjustment set Z is a superset of O(X, Y, G). It is of course simpler to compute O directly rather than via pruning.
Remarkably, given a maximal PDAG G amenable relative to some tuple of node sets (X, Y), such that Y ⊆ possde(X, G), O(X, Y, G) is not only the optimal set amongst all valid adjustment sets in G but also among all valid adjustment sets in any DAG D ∈ [G]. In fact forb(X, Y, D ) = forb(X, Y, G) and O(X, Y, D ) = O(X, Y, G) for all DAGs D ∈ [G] (see Lemmas E.7 and E.8).
Intuitively, O(X, Y, G) is constructed to maximize information on Y, while minimizing information on X and preserving validity. Although one may think that a simpler set, such as pa(Y, G) \ (X ∪ Y) would suffice for this purpose, this is not generally the case. We illustrate this in Example 3.15. Interestingly, Witte et al. (2020) have shown that O(X, Y, G) can indeed be characterized as pa(Y,G) \ (X ∪ Y), in a specific latent projection graphG of G.
Example 3.15. Consider the DAG in Figure 1 and the two DAGs from Figure 2, denoted, respectively, as G 1 , G 2.a and G 2.b . We now illustrate how to construct O(X, Y, G) and the results from Theorem 3.13.
Consider first G 1 and suppose we are interested in the total effect of V 4 on V 6 as in Example 3.3. Here pa(cn(V 4 , V 6 , G 1 ), As shown in Example 3.3, {V 2 , V 3 } is a valid adjustment set relative to (V 4 , V 6 ) in G 1 . It can also easily be verified with Theorem 3.4 that {V 2 , V 3 } provides a smaller asymptotic variance than any of the alternative valid adjustment sets, as Hence, it is asymptotically optimal as claimed in Theorem 3.13.
In these two cases the results from Theorem 3.13 are corroborated by Example 3.5 and Example 3.10, respectively. We now discuss why O(X, Y, G) takes its distinctive form by considering these three examples. By the result from Theorem 3.4, an asymptotically optimal valid adjustment set with respect to (X, Y) in G must contain less or equal information on X and more or equal information on Y than any other valid adjustment set.
One might intuitively expect pa(Y, G) \ (X ∪ Y) satisfies these properties. This is indeed the case for two of the three examples considered here, with This pattern, however, fails to hold for G 1 . Here, pa( The construction of O(V 4 , V 6 , G 1 ) solves this problem by using the next-closest nonforbidden nodes instead, that is, the non-forbidden parents {V 2 , V 3 } of the causal node V 5 . This ensures validity, while maximizing information on V 6 and not providing unnecessary information on V 4 . Specifically, V 2 , as the non-forbidden node closest to V 6 (and furthest from V 4 ) on the non-causal path (V 4 , V 1 , V 2 , V 5 , V 6 ), is the most efficient choice to block this path. Moreover, V 3 , although superfluous for validity, contains only information on V 6 and therefore improves precision. Interestingly, it does so even though V 3 / ∈ pa(V 6 , G 1 ).

Simulation study
We investigate the finite sample performance of adjusting for O(X, Y, G) by sampling data from randomly generated causal linear models and comparing the empirical mean squared error provided by O(X, Y, G) to three alternative adjustment sets. A detailed explanation of our simulation setup is given in Section F.1 of the Supplement (Henckel et al., 2020). We randomly generate a total of 10'000 DAGs, with the number of nodes chosen from {10, 20, 50, 100} and the expected neighborhood size from {2, 3, 4, 5}. Each graph is associated with a causal linear model. The edge coefficients of the model are drawn independently from a uniform distribution on [−2, −0.1] ∪ [0.1, 2], and the errors are either drawn from a Gaussian distribution, a t-distribution with 5 degrees of freedom, a logistic distribution or a uniform distribution, with variances in the range of [0.5, 1.5]. For each DAG D, we randomly draw (X, Y ) such that |X| ∈ {1, 2, 3} and Y ∈ ∩ X i ∈X de(X i , D). We do this for the following two reasons. First, the restriction to a singleton Y is sensible by Remark 2.3. Secondly, if Y / ∈ de(X i , D) for some X i ∈ X then the corresponding entry of the total effect (τ yx ) i = 0 (see Remark 2.3). We then verify whether there exists a valid adjustment set with respect to (X, Y ) in both the DAG D and its CPDAG C. If not, we resample X and Y .
For each causal linear model we generate 100 data sets with sample sizes n ∈ {125, 500, 2000, 10000}. We then consider two settings: i) We suppose knowledge of the true causal DAG D and ii) we estimate a graph G from the data. If the errors are drawn from a Gaussian distribution G is estimated with the Greedy Equivalence Search (GES) algorithm (Chickering, 2002), otherwise with the Linear Non-Gaussian Acyclic Models (LiNGAM) algorithm (Shimizu et al., 2006). In both cases we use the algorithms as implemented in the pcalg R-package (Kalisch et al., 2012).
We then compute total effect estimates, by adjusting for O(X, Y, G) and three alternative adjustment sets. This is done with respect to both the true causal DAG D and the estimated causal graph G, with two special cases for the estimates with respect to G. Firstly, no estimate is returned if there was no valid adjustment set relative to (X, Y ) in G, i.e., these cases are discarded for the mean squared error computation. In such cases, we recommend the use of alternative total effect estimators such as the IDA algorithm by Maathuis et al. (2009) and the jointIDA algorithm by Nandy et al. (2017). Secondly, 0 is returned as the estimate whenever Y / ∈ possde(X, G), since the total effect on a non-descendant is 0. The pair (X, Y ) is sampled, ensuring that these two special cases do not occur in either the true DAG or its corresponding CPDAG.
The three alternative adjustments sets are: (i) The empty set, representing a non-causal baseline. It is generally not a valid adjustment set and is denoted by "em".
(ii) The set pa(X, G)\forb(X, Y, G), which in the setting |X| = 1 is the valid adjustment set pa(X, G). If |X| > 1, it is not generally a valid adjustment set. It is denoted by "pa".
For each causal linear model, we thus have four adjustment sets in two graphical settings. In each of these cases, we compute the empirical mean squared error of our respective estimates with respect to the true total effect. We emphasize that we do not consider the estimated standard errors or residuals from the regression analyses. To quantify the advantage of O(X, Y, G), we compute the ratio of the mean squared error corresponding to O(X, Y, G) and each of the three alternative adjustment sets. This is done separately for the two graphical settings. Figure 3 is a violin plot of these ratios. We see that O(X, Y, G) provides consistently smaller mean squared errors than any of the considered alternatives. Except for Adjust(X, Y, G), all alternative sets are clearly outperformed by O(X, Y, G), with geometric averages below 0.5. As might be expected, the gain becomes smaller when the underlying causal DAG has to be estimated, but it remains respectable. Notably, the proportion of ratios larger than 1.5 is small, even negligible when the true DAG is known. For a more thorough discussion how the ratios behave depending on the parameters see Section F.2 of the Supplement. The bulges at 1 are due to two reasons. Firstly, cases in which the compared sets are similar or the same. Secondly, cases in which Y / ∈ possde(X, G) occurs for a considerable number of the estimated graphs G (see Section F.3 of the Supplement).
The only true contender to O(X, Y, G) in terms of performance is Adjust(X, Y, G). It should be noted, however, that Adjust(X, Y, G) is a superset of O(X, Y, G) and hence will be more cumbersome to measure in practice (see also Figure 16 of the Supplement).
Another point worth noting is the bad performance of pa(X, G) \ forb(X, Y, G). Even though this set is a valid adjustment set if |X| = 1, it only provides a small gain compared to the empty set, especially when the graphical structure has to be estimated. This aptly illustrates the importance of taking efficiency considerations into account when choosing a valid adjustment set.
In summary, these results indicate that there are benefits to using O(X, Y, G). These benefits decrease when the underlying causal structure is not known in advance, but do remain respectable.

Real data example
Our result can easily be integrated into existing approaches to covariate adjustment. Using O(X, Y, G) to estimate the total effect τ yx instead of, for example, pa(X, G) only requires the minimal additional effort of computing O(X, Y, G) from the causal graph G. And yet, replacing pa(X, G) with O(X, Y, G) can only improve the asymptotic variance when the true causal graph G is used. Of course, errors in the used graph G and finite sample considerations might lead to cases where the use of O(X, Y, G) actually leads to a loss of efficiency in practice, but our simulations (Section 4) indicate that this risk is manageable and that an overall efficiency gain, at essentially no cost, is the norm.
To investigate this further, we apply our results to the single cell data of Sachs et al. (2005). This data set consists of flow cytometry measurements of 11 phosphorylated proteins and phospholipids in human T-cells, collected under 14 different experimental conditions. Each experimental condition corresponds to a different intervention on the abundance or activity of the proteins. We chose this data set due to the the availability of a consensus graph (see Figure Figure 5.a) in Mooij and Heskes, 2013) and the large sample size.
Given that there is some uncertainty regarding the consensus graph, we apply our results using the following three different graphs: the consensus graph, the DAG estimated by Sachs et al. (2005) and the DAG estimated under the restriction to at most 17 edges by Mooij and Heskes (2013). These three DAGs are given in Figure 5 of Mooij and Heskes (2013) and we denote them by G C , G S and G M , respectively.
Our data analysis is as follows. We first log transform the data as it is heavily right Ratio of estimated variances Figure 4: Violin plots of the ratios var( Sachs et al. (2005), obtained under the assumption that G C , G S or G M , respectively, is the true underlying causal graph. Here, G C denotes the consensus graph, G S the acyclic graph estimated by Sachs et al. (2005) and G M the acyclic graph estimated by Mooij and Heskes (2013). The red squares show the geometric average of the ratios, the black squares the median.
skewed. For each of the three DAGs we then do the following: Restricting ourselves to the 8 experimental conditions for which Mooij and Heskes (2013) provide a graphical interpretation of the condition's effect, we adjust our starting DAG accordingly. For each such adjusted DAG G, we then compute all pairs (X, Y ) of nodes in G, such that Y ∈ de(X, G), to ensure that there is a non-trivial total effect to estimate, and O(X, Y, G) = pa(X, G), to ensure that we compare different estimators. For each such pair, we compute the least squares regressions of Y on X and O as well as of Y on X and P. We note thatβ yx.o andβ yx.p are estimators for the total effect of X on Y in the considered data regime; not necessarily in the observational regime. As the true total effects are unknown, we compare the least squares coefficient variance estimates var(β yx.o ) and var(β yx.p ) by considering their ratio var(β yx.o )/ var(β yx.p ). Note that, differently from Section 4, where we are able to compute the empirical mean squared error with respect to the known true total effect, this approach raises some concerns regarding post-selection inference for the two estimated graphs G S and G M , which we disregard here.  Figure 4 shows violin plots of these ratios, aggregated over all considered (X, Y ) pairs in the 8 experimental settings, with one plot for each of the graphs G C , G S and G M . The plots in Figure 4 show that using O(X, Y, G) instead of pa(X, G) results in geometric means smaller than 1 for all three graphs. In particular, only few of the ratios are larger than 1, with only one larger than 1.2, showing that this gain is obtained at little risk of a potential downside. The gains are rather modest, however. This is likely due to the small size and sparsity of the considered graphs. In such settings, even when O(X, Y, G) = pa(X, G), the two sets will often share nodes and only differ in minor ways. As a result they provide similar asymptotic variances. In fact, this behaviour can also be seen in our simulations where small graph sizes and small expected neighborhood sizes result in mean squared error ratios closer to 1 (see Figure 15 of the Supplement).
It is also interesting to consider how the gain in efficiency differs between the three graphs. It is smallest for G C and largest for G S . As discussed by Mooij and Heskes (2013), several strong faithfulness violations seem to be present in the considered data set. While Theorem 3.13 does not require faithfulness to hold, faithfulness violations can lead to cases where O(X, Y, G) and pa(X, G) differ and yet provide the same asymptotic variance, making them both asymptotically optimal. For an example consider the DAG G in Figure  5 and the corresponding causal linear model with the edge weights indicated on the edges (and arbitrary error variances). Here, O(X, Y, G) = {P, O} and pa(X, G) = {P } but due to the non-faithfulness of the causal linear model, O ⊥ ⊥ Y |{X, P }, even though O ∈ pa(Y, G). Since O ⊥ G X|P , we can thus conclude with Lemma C.2 that for the considered causal linear model.
It appears that this issue is most prevalent for G C , as it is the densest of the three considered graphs and faithfulness violations require multiple connecting paths between nodes. In G S , the least dense of the three considered graphs, this issue appears to be less prominent, leading to a larger gain.

Discussion
We provide a series of results on graphical criteria for efficient total effect estimation via adjustment in causal linear models. Specifically, we present a graphical criterion to qualitatively compare the asymptotic variance that many pairs of valid adjustment sets provide. Further, supposing the existence of a valid adjustment set, we provide a variance reducing pruning procedure as well as an asymptotically optimal valid adjustment set. These results formalize and strengthen existing intuition regarding efficiency. They form a versatile tool set for choosing among valid adjustment sets, a choice that can have a significant impact on the mean squared error. We do, however, only consider total effect estimation via covariate adjustment. Other estimators, such as ensemble estimators or the front-door criterion (Hayashi and Kuroki, 2014) may be more efficient.
Our results require an in-depth understanding of the causal structure in the form of a causal DAG or an amenable maximal PDAG. However, our results are not considerably more affected by this difficulty than covariate adjustment as a whole. For example, suppose that we consider singleton X and Y , such that Y ⊆ possde(X, G) (see Remark 3.14). Then knowledge of an amenable maximal PDAG G is required both for pa(X, G) to be a valid adjustment set and for O(X, Y, G) to be identifiable. In practice, O(X, Y, G) may be more sensitive to graph estimation errors than pa(X, G), since pa(X, G) only relies on estimating the local neighborhood of X accurately. Nonetheless, our simulations indicate that even when the underlying causal DAG has to be estimated, O(X, Y,Ĝ) typically provides a smaller mean squared error than pa(X, G) or ∅.
Since our results cover DAGs, CPDAGs and maximal PDAGs, it is natural to ask: can they be extended to settings with latent variables? The answer is: partially. Theorem 3.4 extends to settings with latent variables and without selection bias, by simply changing d-separation to (definite) m-separation (Richardson and Spirtes, 2002;Zhang, 2008a) in the latent variable graph (MAG or PAG) and then using Theorem 4.18 from Richardson and Spirtes (2002) (see also Lemma 20 in Zhang (2008a)) and Lemma 26 from Zhang (2008a). However, Theorem 3.13 does not extend to latent variable models as can be seen in Example 3.11. If we suppose here that C is latent and only consider the valid adjustment sets that do not contain C, then the valid adjustment set providing the optimal asymptotic variance depends on the edge coefficients and error variances.
There is a caveat to this partial extension. Since all our results are with respect to valid adjustment sets, they do not apply if there is unmeasured confounding, i.e., no valid adjustment set is fully observed. The one partial exception to this is Theorem 3.4, which, under the assumption of Gaussian errors for the causal linear model, does hold for arbitrary adjustment sets. Interestingly, there is research indicating that in the presence of unmeasured confounding, the broad guideline of choosing the adjustment set Z in way that minimizes information on X while maximizing information on Y should still be followed to minimize bias amplification (Pearl, 2010;Wooldridge, 2016;Ding et al., 2017). As result, a pruning procedure along the lines of Algorithm 1 might still be warranted in this setting.
Our results do not apply to non-amenable CPDAGs and non-amenable maximal PDAGs. In this setting one can use the IDA algorithm from Maathuis et al. (2009) and Maathuis et al. (2010) for CPADGs or the modified version by Perković et al. (2017) for maximal PDAGs. Both output a list of possible total effect estimates by adjusting for the possible parent sets of X, one for each DAG compatible with the considered causal graph. As the parents are often an inefficient valid adjustment set, one may wonder whether it is possible to apply our results to improve the IDA algorithm's efficiency. This is indeed possible, as shown by Witte et al. (2020).
Finally, another possible generalization is to consider settings with selection bias. Correa et al. (2018) give a necessary and sufficient graphical criterion for causal effect estimation under confounding and selection bias. It remains to be investigated whether the results presented in this paper generalize to this setting. Shpitser, I., VanderWeele, T., and Robins, J. (2010). On the validity of covariate adjustment for estimating causal effects. In Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Artificial Intelligence (UAI-10), pages 527-536, Corvallis, Oregon. AUAI Press.

Supplement
This is the Supplement to "Graphical criteria for efficient total effect estimation via adjustment in causal linear models" which we will refer to as the Main paper. Results from the Main paper are referenced to by their original numbering (e.g., Proposition 3.1) whereas references to results in the Supplement begin with a letter (e.g., Lemma A.8).
A Graphical preliminaries and existing results

A.1 Graphical preliminaries and examples
Graphs. We consider simple graphs with a finite node set V and an edge set E, where edges can be either directed (→) or undirected (−). If all edges in E are directed (→), Paths. Two nodes are adjacent if there exists an edge between them. A path p from a node X to a node Y in a graph G is a sequence of distinct nodes ( Perković et al. (2017)).
Remark A.1. Our definition of a possibly directed path is non-standard, as j − i > 1 is allowed. This is required for its use in maximal PDAGs, as we show in Example A.3.
Ancestry. If X → Z, then X is a parent of Z and Z is a child of X. If there is a directed path from X to Y , then X is an ancestor of Y and Y a descendant of X. If there is a possibly directed path from X to Y , then X is a possible ancestor of Y and Y a possible descendant of X. We use the convention that every node is an ancestor, possible ancestor, descendant and possible descendant of itself. The sets of parents, ancestors, descendants, possible ancestors and possible descendants of X in G are denoted by pa(X, G), an(X, G), de(X, G), possan(X, G) and possde(X, G), respectively. For sets X, we let pa(X,  Meek, 1995).

Colliders, definite status paths and v-structures.
every non-endpoint node on p is either a collider or a non-collider.
Directed cycles, DAGs and PDAGs. A directed path from X to Y , together with the edge Y → X forms a directed cycle. A directed graph without directed cycles is called a directed acyclic graph (DAG) and a partially directed graph without directed cycles is called a partially directed acyclic graph (PDAG).
Blocking and d-separation in PDAGs. (Cf. Definition 1.2.3 in Pearl (2009) and Definition 3.5 in Maathuis and Colombo (2015)). Let Z be a set of nodes in a PDAG.
status path between any node in X and any node in Y in G. We then write X ⊥ G Y|Z.
If Z does not block every definite status path between any node in X and any node in Y in G, we write X ⊥ G Y|Z.
Remark A.2. We use the convention that for any two disjoint node sets X and Y it holds that ∅ ⊥ G X|Y.
Markov property and faithfulness. (Cf. (Definition 1.2.2 Pearl, 2009)) Let X, Y and Z be disjoint sets of random variables. We use the notation X ⊥ ⊥ Y|Z to denote that X is conditionally independent of Y given Z. A density f is called Markov with respect to a DAG G if X ⊥ G Y|Z implies X ⊥ ⊥ Y|Z in f . If this implication also holds in the other direction, then f is faithful with respect to G.
Markov equivalence and CPDAGs. (Cf. Meek, 1995;Andersson et al., 1997)   Maximal PDAGs. A PDAG G is maximally oriented (maximal PDAG) if and only if the graphs in Figure 6 are not induced subgraphs of G.
In general, a maximal PDAG G describes a subset of a Markov equivalence class of DAGs, denoted by [G]. A maximal PDAG G has the same adjacencies and v-structures as any DAG in [G]. Moreover, a directed edge X → Y in G corresponds to a directed edge X → Y in every DAG in [G], and for any undirected edge X − Y in G, [G] contains a DAG with X → Y and a DAG with X ← Y .
Example A.3. Consider the DAG D in Figure 8(d). The CPDAG C of D is given in Figure 8(a) and two maximal PDAGs G and G of D are given in Figures 8(b) and 8(c). The CPDAG C represents 8 DAGs, the maximal PDAG G represents five DAGs and the maximal PDAG G represents two DAGs (see Figure 10).
illustrating that G and G represent refinements of the Markov equivalence class [C].
We consider now possibly directed paths in C and G. According to our definition the path V 3 − V 4 − V 1 is possibly directed from V 3 to V 1 in C, but not in G, since G contains the edge V 1 → V 3 . As a result, V 1 ∈ possde(V 3 , C) but V 1 / ∈ possde(V 3 , G). The rationale behind these definitions is that there is a DAG in [C] containing V 3 → V 4 → V 1 , but there is no such DAG in [G] (see Figure 10).
Causal, non-causal and possibly causal paths and nodes A directed path from X to Y in a causal graph is also called a causal path from X to Y . Analogously, a possibly directed path from X to Y is called a possibly causal path. A non-causal path from X to Y is a path that is not possibly directed from X to Y . Let X and Y be disjoint node sets in a causal maximal PDAG G. We define causal nodes relative to (X, Y) in G, denoted cn(X, Y, G), as all nodes on proper causal paths from X to Y, excluding nodes in X. For singleton X the causal nodes are also called the mediating nodes. Analogously, we define possible causal nodes relative to (X, Y) in G, denoted posscn(X, Y, G), as all nodes on proper possibly causal paths from X to Y, excluding nodes in X.
Forbidden nodes. (Cf. Perković et al., 2018) Let X and Y be disjoint node sets in a maximal PDAG G. We define forbidden nodes relative to (X, Y) in G as for any density f compatible with G if and only if Z satisfies the generalized adjustment criterion relative to (X, Y) in G (Def. A.4).
Theorem A.5 establishes that Definition A.4 characterizes all covariate sets that can be used for causal effect estimation via adjustment. It is as a consequence of this Theorem that we refer to sets that satisfy the generalized adjustment criterion as valid adjustment sets.
Example A.6. Consider the DAG G in Figure 7. Since G is a DAG, it is trivially amenable and Further, any valid adjustment set needs to contain at least one node from {A 1 , A 2 } and one node from {B 1 , B 2 } to satisfy the blocking criterion. The remaining nodes V, D, R are neither required nor forbidden. This shows that any valid adjustment set has to be of the following form: Z = A ∪ B ∪ C, where A ⊆ {A 1 , A 2 } and B ⊆ {B 1 , B 2 } are non empty and C ⊆ {V, D, R} is possibly empty.
As an example of a joint intervention let X = {X, A 2 } and Y = {Y, F }. The amenability follows trivially from the fact that G is a DAG and Further, any valid adjustment set here must block the two proper non-causal paths (X, B 1 , B 2 , Y ) and (X, B 1 , B 2 , Y, F ) from X to Y. Thus, any valid adjustment set has to be of the following form: Z = B ∪ C, where B ⊆ {B 1 , B 2 } is non empty and C ⊆ {A 1 , V, D, R} is possibly empty.
Unshielded paths, corresponding paths and path concatenation. A path (V i , V j , V k ) in a partially directed graph G is an unshielded triple if V i and V k are not adjacent in G. A path is unshielded if all successive triples on the path are unshielded. If G and G * are two graphs with identical adjacencies and p is a path in G, then the corresponding path p * in G * consists of the same node sequence as p. We denote the concatenation of paths by ⊕. For example given a path p = (X 1 , . . . , X k ) it holds that p = p(X 1 , X m ) ⊕ p(X m , X k ) for 1 ≤ m ≤ k. In general, concatenating paths does not result in a path, but we only use the symbol ⊕ if the result is in fact a path.
Partial total effects in a causal linear model. Consider a causal DAG G = (V, E), such that V follows a causal linear model compatible with G. Let Z be a node set and X, Y / ∈ Z be two nodes in G. Then the partial total effect τ yx.z of X on Y given Z is defined as the sum of the total effect along all causal paths from X to Y that do not contain nodes from Z.
We now give a small simulation study to illustrate that Proposition 3.1 also covers regressions with heteroskedastic residuals and how these arise in causal linear models with non-Gaussian errors. For an example of non-linearity we refer to Example 1 from the supplement of Nandy et al. (2017).
Example A.7. Consider the DAG G given in Figure 7, with for simplicity the nodes V, D, R, F dropped. We sampled data from the causal linear model compatible with G, such that all errors are uniformly distributed on [−1, 1] and all edge coefficients except α yb 2 are 1 with α yb 2 = 20. We used this data to estimate the total effect of X on Y , i.e. 1, by adjusting for the valid adjustment set {A 2 , B 1 }. More precisely, we computed the ordinary least squares (OLS) regression of Y on X, A 2 , B 2 for one generated data set of 1000 points. Figure A.7 shows the resulting Residuals vs Fitted plot as well as the Residuals vs X plot. As we can see the residuals are clearly heteroskedastic with their distribution depending on both the fitted values and X.
To verify that Proposition 3.1 nonetheless holds, we then repeated this procedure 1000 times collecting the estimated coefficients. Based on these we computed empirical  means and variances for all three coefficients in the regression of Y on X, A 2 , B 1 . We also computed theoretical asymptotic variances in accordance with the formula given in Proposition 3.1 from the known true underlying causal linear model. The results are given in Table 3. We see that in factβ yx.a 2 b 1 appears to estimate the total effect 1 and that it does so with the claimed asymptotic variance from Proposition 3.1. Interestingly the same is true forβ ya 2 .xb 1 , which is due to the fact that B 1 is a valid adjustment set for the joint effect of {X, A 2 } on Y (see Example A.6) and therefore Proposition 3.1 also holds with respect toβ ya 2 .xb 1 . Interestingly, the theoretical and empirical variances do not match for β yb 1 .xa 2 . This is due to the fact that given {X, A 2 } the non-causal path B 1 ← B 2 → Y remains open and thereforeβ yx.a 2 b 1 is not covered by Proposition 3.1. This illustrates that the asymptotic variance result from Proposition 3.1 does not necessarily cover all coefficients in the considered OLS regression.

A.2 Existing results
For completeness we first give the well known asymptotic behavior of the least squares estimator in the Gaussian setting. The next three lemmas give theoretical properties of the possibly misspecified least squares regression. They form the foundation for the extension of our results to causal linear models with non-Gaussian errors in Proposition 3.1.
Lemma A.10. Let (X T , Y T , Z T ) T be a mean 0 random vector with finite variance such that X ⊥ ⊥ Y|Z. Then β yx.z = 0.
Proof. This result is well known in the Gaussian setting (cf. Section 2.5 of Anderson, 1958). It generalizes to random vectors with finite variance by the result from Lemma A.9, as β yx.z is fully determined by the covariance matrix alone.
Lemma A.11. (Cf. Corollary 11.1 of Buja et al., 2014) Let (X, Y, Z T ) T be a mean 0 random vector with finite variance and let Z = (X, Z T ) T . Then where δ yz = Y − β yz Z and δ xz = X − β xz Z denote the population error terms of the least squares regressions specified by their respective subscripts.
The following corollary gives a graphical criterion that is necessary and given a small restriction (see Remark 2.3) also sufficient for the existence of a valid adjustment set in a DAG. Recall that δ yz = Y − β yz Z denotes the population error term of the least squares regression specified by its subscripts. Moreover, we use the notation Z −x to denote Z \ {X}, for any set Z. . . , Y ky } and Z be pairwise disjoint node sets in G, such that Z is a valid adjustment set relative to (X, Y) in G.
In the proof of Proposition 3.1 from the Supplement of Nandy et al. (2017), it is shown that τ yx = β yx.z for singleton X and Y , and Z = pa(X, G), independently of the error distributions in the causal linear model. Their argument relies on the fact that both terms of interest depend on the distribution of V only through Σ vv . The general case hence follows from the Gaussian one.
Using Lemma A.9, this argument directly extends to a joint intervention of X on Y when Z is valid adjustment set relative to (X, Y) in G, implying that τ yx = β yx.z . Finally, by Lemma A.11,β yx.z is a consistent estimator of β yx.z for any conditioning set Z.
We now show that our asymptotic variance statement holds. Fix X i ∈ X and Y j ∈ Y and let Z = X ∪ Z. We first show that the result holds if δ y j z ⊥ ⊥ δ x i z −x i . In this case, the statement of Lemma A.11 simplifies to as n → ∞.
Since Z is a valid adjustment set relative to (X, Y) in G, we have already shown that (τ yx ) j,i = β y j x i .z −x i . Our claim then follows, as It is left to show that δ y j z ⊥ ⊥ δ x i z −x i . By Lemma B.1, all non-causal paths from X i to Y j are blocked by X −i ∪ Z and de(D, G) G). Hence, we can apply Lemma B.3 with X = X i , Y = Y j and Z = X ∪ Z to conclude that in fact δ y j z ⊥ ⊥ δ x i z −x i . Lemma B.1 relies on the technical Lemma B.2 that will be used throughout this Supplement. For any given pair X ∈ X and Y ∈ Y, it gives some properties of the set Z −x = X −x ∪ Z. Albeit Z −x is not necessarily a valid adjustment set relative to (X, Y ), it behaves similarly to one. Lemma B.3 then relies on these properties and the technical Lemma B.4 to show that our two residuals of interest are in fact independent. A full summary of how the following lemmas relate to each other is given in Figure 11. Lemma B.1. Let X, Y and Z be pairwise disjoint node sets in a causal DAG G, such that Z is a valid adjustment set with respect to (X, Y) and let Z = X ∪ Z. Consider any pair of nodes X ∈ X, Y ∈ Y and let D = cn(X, Y, G) ∩ cn(Z , Y, G) be the set of all nodes N ∈ cn(X, Y, G), such that there exists a directed path from N to Y which contains no nodes from Z . Then the following two statements hold: 1. All non-causal paths from X to Y are blocked by Z −x .

de(D, G) ∩
Proof. We first prove Statement 1. Fix X ∈ X and Y ∈ Y. We will prove our claim by contradiction, so suppose that there exists a non-causal path p from X to Y that is open given Z −x and assume that Z is a valid adjustment set relative to (X, Y) in G. We will show that this implies the existence of a proper non-causal path from X to Y that is open given Z, contradicting our assumption that Z is a valid adjustment set.
Suppose first that no non-endpoint node of p is in X, so that p is a proper non-causal path from X to Y. By the assumption that Z is a valid adjustment set, p must then be blocked by Z. Since p is assumed to be open given Z −x it is clearly also open given Z = X ∪ Z and hence, Lemma B.2 with A = X implies that there exists a proper non-causal path from X to Y that is open given Z.
Next, suppose that p contains some non-endpoint node in X. Let X be the node in X on p that is closest to Y . Then p = p(X , Y ) is a proper subpath of p. Since p is open given Z −x , X must be a collider on p. But then p is both proper and non-causal, and we can repeat the argument from the previous paragraph.
We now prove Statement 2. We will first show that D ⊆ cn(X, Y, G). Consider a node D ∈ D. Then D must lie on at least one causal path p from X to Y , where we can choose p such that p(D, Y ) contains no node in Z . Let X ∈ X be the node closest to D on p(X, D). Then p(X , Y ) is a proper causal path from X to Y containing D and hence D ∈ cn(X, Y, G).
We will now prove the statement by contradiction, so assume that there exists a node F ∈ de(D, G) ∩ Z −x and that Z is a valid adjustment set. Assume first that F ∈ X −x . But this implies that de(cn(X, Y, G)) ∩ X = ∅, which contradicts our assumption that Z is a valid adjustment set by Corollary A.12. Now assume that F ∈ Z. In this case F ∈ forb(X, Y, G), again contradicting our assumption that Z is a valid adjustment set.
Lemma B.2. Let X, Y and Z be pairwise disjoint node sets in a causal DAG G. Let A / ∈ Y be a node and consider a path p from A to Y in G. If p is blocked by Z and open given X ∪ Z, then there exists a proper non-causal path from X to Y that is open given Z.
Proof. Let Y ∈ Y be the endpoint of p. By assumption p is blocked by Z, while being open given X ∪ Z. This requires the following three statements to hold: 1. For any non-collider N on p, N / ∈ X ∪ Z.

For any collider
3. There exists at least one collider C on p such that de(C , G) ∩ Z = ∅.
Let C be the node closest to Y fulfilling Statement (3). By choice of C and the assumption that p is open given X ∪ Z, p = p(C , Y ) is open given Z and X ∪ Z. If p contains a node in X, any such node must be a collider. Assume that there is such a node and let X ∈ X be the one closest to Y on p . Since X is a collider, p (X , Y ) is non-causal and by choice of X , p (X , Y ) is also a proper path from X to Y. Since p is open given Z so is p (X , Y ) and hence we are done. Therefore, we will from now on suppose that no node on p is in X.
As C fulfills the requirements of both Statement (2) and (3), there exists a path p of the form C → · · · → X ∈ X, that is open given Z and where we choose X so that p contains no other node in X. Let I be the node closest to X on p that is also on p and consider q = p (X , I) ⊕ p (I, Y ). The path q is a path from X to Y and we now show that it is proper and non-causal. Since I lies on p and p contains no node on X by assumption, I = X . Thus, q is a proper and by nature of p being directed towards X , non-causal path from X to Y.
It now only remains to show that q is open given Z. As the two constituent paths are both open given Z it suffices to consider I. By choice of C , no node in p may be in Z and hence neither is I. Furthermore, by nature of p being directed towards X , I cannot be a collider on q and it thus follows that q is open given Z. Lemma B.3. Let X and Y be two nodes in a causal DAG G = (V, E), let V follow a causal linear model compatible with G and let Z be a node set, such that X ∈ Z and Y / ∈ Z . If Z −x is a set that 1. blocks all non-causal paths from X to Y in G and Proof. In order to simplify notation we refer to Z −x as Z throughout this proof. Let = { v 1 , . . . , vp } be the set of errors from the underlying causal linear model and consider δ xz and δ yz as functions of . Then there are minimally sized subsets xz and yz of , such that δ xz and δ yz are functions of xz and yz respectively. It suffices to show that xz ∩ yz = ∅, as then δ yz ⊥ ⊥ δ xz follows from the joint independence of the errors in . We will prove our claim by contraposition, so assume that there exists a node N such that n ∈ xz ∩ yz . We now show that the existence of such a node N implies either the existence of a non-causal path from X to Y that is open given Z or that de(D, G) ∩ Z = ∅. We will do so by going through a series of cases: (i) N = X ∈ Z , (ii) N ∈ Z and (iii) N / ∈ Z . Case (i): By Lemma B.4 with A = Y and W = Z there exists a non-causal path p from X to Y that is open given Z . Suppose, that p is blocked by Z. Then by Lemma B.2 with A = X,X = X, Y = Y and Z = Z there exists a non-causal path from X to Y that is open given Z. Otherwise, p is itself such a path.
Case (ii): By applying Lemma B.4 twice, once with A = X and W = Z and once with A = Y and W = Z we deduce that there exists i) a path p of the form X · · · → N that is open given Z, and ii) a path p of the form N ← · · · Y that is open given Z . If p blocked by Z, we can apply Lemma B.2 with A = N, X = X, Y = Y and Z = Z to conclude that there exists a non-causal path from X to Y that is open given Z. For the remainder of Case (ii) we will suppose that p is open given Z.
Let I be the node closest to X on p that is also on p . We now show that either q = p(X, I)⊕p (I, Y ) is a non-causal path from X to Y that is open given Z or de(D, G)∩ Z = ∅. Since both p(X, I) and p (I, Y ) are open given Z, it suffices to consider I to decide whether q is open given Z, so we will now sequentially consider the cases that I is N, X, Y , a non-collider on q and collider on q.
First suppose that I = N . Then q is of the form X · · · → N ← · · · Y and I = N ∈ Z is a collider on q. Hence, q is open given Z as well as non-causal. Now, consider the case that I ∈ {X, Y }. Then q is a subpath of p or p respectively and hence trivially open given Z. If q is non-causal, the first possible claim of our contrapositive statement holds true. Hence, suppose that q is a causal path from X to Y . If I = X and q is causal, p is of the form N . . . X → · · · → Y and hence X is a non-collider on p . But this contradicts that p is open given Z . If I = Y , then q is a subpath of p and obviously p(X, Y ) = q is causal. Since, p is open given Z and Y / ∈ Z it follows that either Y is collider on p, with de(Y, G) ∩ Z = ∅ or that Y is a non-collider on p and hence p(Y, N ) is of the from Y → . . . N . In the latter case, since N ∈ de(Z, G) and p(Y, N ) is open given Z, p(Y, N ) either contains a collider C, such that C ∈ de(Y, G) and de(C, G) ∩ Z = ∅ or N ∈ de(Y, G). Thus, in either case de(Y, G) ∩ Z = ∅. Further, since q is a causal path from X to Y that is open given Z, q cannot contain nodes from Z and therefore Y ∈ D. Thus, de(D, G) We now suppose that I / ∈ {N, X, Y } is a non-collider on q. Then I cannot be a collider on both p and p . Since p and p are open given Z it thus follows that I / ∈ Z. Therefore, q is also open given Z. If q is non-causal, the first possible claim of our contrapositive statement holds true. Hence, suppose that q is a causal path from X to Y . As q is causal, p(X, I) is directed towards I and we have already shown that I / ∈ Z. By the same argument as in the case I = Y , it then follows that I ∈ D and de (I, G) Consider now the case that I / ∈ {N, X, Y } is a collider on q. Clearly, q is non-causal. Further, if I is also a collider on either p or p , it follows that de (I, G) ∩ Z = ∅ and hence, q is open given Z in this case. Suppose that I is a collider on q, while being a non-collider on both p and p . Then p must be of the form X · · · → I → . . . N and since p (I, N ) is open given Z and N ∈ de(Z, G), it follows that de (I, G) ∩ Z = ∅. Hence, q is non-causal and open given Z.
Case (iii): Let us first suppose that Z ∩ de(N, G) = ∅. By Lemma B.4 with A = Y and W = Z , it follows that x ∈ yz or y ∈ yz . If x ∈ yz , then we are done by Case (i), so suppose that only the latter statement is true. By Lemma B.4 there then exists a directed path p from N to Y that is open given {X, Y } ∪ Z and as a directed path is therefore also open given Z. By Lemma B.4 with A = X and W = Z, it follows that there exists a path p from X to N that is directed towards X and open given {X} ∪ Z and hence, is also open given Z.
Let I be the node closest to X on p that is also on p . We now show that q = p(X, I) ⊕ p (I, Y ) is a non-causal path from X to Y that is open given Z. Since both p(X, I) and p (I, Y ) are open given Z, it suffices to consider I to decide whether q is open given Z, so we will now sequentially consider the possible properties of I.
If I = X, then q is subpath of p and X lies on p . But as p is directed this contradicts that it is open given {X} ∪ Z. If I = Y , then q is a subpath of p and hence a non-causal path from X to Y that is open given Z. If I / ∈ {Y, X}, then I is a non-collider on q and since no node in p may be in Z, I / ∈ Z and it thus follows that q is open given Z. As p(X, I) is directed towards X, q is non-causal. Thus, q is a non-causal path from X to Y that is open given Z.
For the remainder of Case (iii), we suppose that Z ∩ de(N, G) = ∅. By applying Lemma B.4 twice, once with A = X and W = Z and once with A = Y and W = Z , we deduce that there exists i) a path p of from X to N that is open given Z and ii) a path p from N to Y that is open given Z . If p is blocked by Z, we can conclude with Lemma B.2,as in Case (ii), that there exists a non-causal path from X to Y that is open given Z and are done. For the remainder of Case (iii), we suppose that p is open given Z.
Let I be the node closest to X on p that is also on p . As in case (ii), we will now show that either q = p(X, I) ⊕ p (I, Y ) is a non-causal path from X to Y that is open given Z or de(D, G) ∩ Z = ∅. We now sequentially consider the possible properties of I.
First suppose that I = N . Since Z ∩ de(N, G) = ∅ and N / ∈ Z we can immediately conclude that q is open given Z independently of whether N is a collider or a non-collider on q. Further, if q is causal it follows that N ∈ D and hence de(D, G) ∩ Z = ∅.
Suppose now I = N . Recall that Z ∩ de(N, G) = ∅ and that we have already shown that there exists a path p from X to N and another p from N to Y , such that the former is open given Z and the latter open given both Z and Z . But these are exactly the assumptions required to show our claim in the corresponding case I = N in Case (ii). Hence, our claim follows with the same argument.  Proof. We first give some preparatory thoughts on how to write δ aw as a function in the errors from the causal linear model. Consider a random vector V that follows a causal linear model compatible with a causal DAG G = (V, E) and fix some node V i ∈ V. We can then write V i as a linear function of the generating errors in the following way: where we use the convention that τ v i v i = 1 for any V i ∈ V and make use of the fact that Bollen, 1989). Consider the equation Applying equation (5) to the A and the W i terms in equation (6), we can write δ aw as a linear function in the error terms of the generating causal linear model of the form with coefficients γ v j ∈ R. We now prove Statement (1) by contraposition. So assume that M ∈ W and that there exists no path p from A to M whose last edge points into M and which is open given W. We will now show that this implies that m / ∈ aw It is sufficient to show that the coefficient γ m corresponding to m in equation (7) is equal to 0. The value of γ m is Our claim thus follows, if we show that where we have simplified the sum by removing those W i ∈ W with τ w i m = 0. Let W = de(M, G) ∩ W, W = W \ de(M, G) and W = pa(M, G) ∪ W . By construction, W contains all parents of M while containing no descendants of M . It thus follows with Lemma B.5 that W is a valid adjustment set relative to M and any node that is not in W . We note that W ∩ W = ∅ by construction. Further, A / ∈ pa(M, G) by assumption and hence A / ∈ W . Using the already proven first half of Proposition 3.1 to replace the total effects with appropriate regression coefficients and vice versa, it follows that Here, we use firstly that W is a valid adjustment set with respect to M and any node not in W to conclude that β w i m.w = τ w i m and β am.w = τ am . Secondly, we use Lemma C.3 in the second step, with T = A, W = M, Z = W and S = W −m . Lastly, P = pa(M, G) \ W ⊥ G A|W by Lemma B.6 and we use this result to simplify the conditioning sets in step three by invoking the first statement from Lemma C.5, with T = P, S = ∅, X = W and Y = {A}, allowing us to drop all nodes in P.
We now prove Statement (2). For M = A the statement is trivial. Hence consider a node M / ∈ W = W ∪ {X} and its corresponding coefficient γ m . For ease of notation let W k+1 = A. We will now show that for M / ∈ W , it holds that γ m = W j ∈W τ w j m.w −j γ w j . Using equation (8), this claim is equivalent to By Lemma B.7, τ w i m = W j ∈W τ w j m.w −j τ w i w j for any W i ∈ W and thus, our claim follows. The coefficient γ m can therefore only be non-zero, if at least one of the terms τ w j m.w −j γ w j is also non-zero. Let M ∈ W be a node, such that τ m m.w −m = 0 and γ m = 0. The first term being non-zero implies the existence of a directed path p from M to M that contains no additional nodes from W and is hence open given W and W. The second term being non-zero, implies that m ∈ aw , which by Statement (1) requires that there exists a path p of the form A · · · → M , that is open given W, with possibly A = M . Hence the first part of Statement (2) holds.
We now prove the second part of Statement (2). If M = A, p is a path of the claimed form, so suppose that M ∈ W. Let I be the node closest to A at which p and p intersect, and consider the path q = p(A, I) ⊕ p (I, M ). We will now show that q is open given W. If I = M of I = A, q is a subpath of p or p respectively and as both p and p are open given W we are done.
Hence, suppose that I / ∈ {A, M }. As p and p are open given W, it suffices to consider I. Suppose first that I ∈ W. Since p is directed and open given W it thus follows that I = M . Then M ∈ W is a collider on q and it follows that q is open given W. Suppose now that I / ∈ W. Since p is directed towards M and M ∈ W it follows that de(I, G) ∩ W = ∅. Hence, q is open given W independently of whether I is a collider or a non-collider.
Lemma B.5. Let X and Y be nodes in a causal DAG G. Let Z be a node set in G, such that Y / ∈ Z, pa(X, G) ⊆ Z and de(X, G) ∩ Z = ∅. Then Z is a valid adjustment set relative to (X, Y ).
Proof. As a DAG, G is trivially amenable relative to (X, Y ). Further, forb(X, Y, G) ⊆ de(X, G) and therefore Z fulfills the forbidden set condition (2) from Definition A.4.
It only remains to show that Z blocks all non-causal paths from X to Y so let p be such a path. Assume that p is of the form X → . . . Y . Then p must contain a collider C, such that C ∈ de(X, G). Since, by assumption de(X, G) ∩ Z = ∅ it follows that p is blocked by Z. Now, assume that p is of the form X ← . . . Y . Then p contains a non-collider N ∈ pa(X, G) and is thus blocked by Z. Lemma B.6. Let A and M be two nodes and W a node set in a DAG G, such that A / ∈ W and M ∈ W. Let P = pa(M, G) \ W. If no path from A to M , ending with an edge into M that is open given W exists, then P ⊥ G A|W.
Proof. We prove the claim by contraposition, so assume that a path p from A to some node P ∈ P that is open given W exists. By choice of P there exists a path p of the form P → M . Let I be the node closest to A on p that is also on p and consider q = p ⊕ p . If I = M , then q is a subpath of p and hence open given W. Further, M ∈ W must be a collider on p. But that implies that q is a path from A to M , ending with a node into M that is open given W. If I = P , then I / ∈ W is a non-collider on q and our claim again follows.
Lemma B.7. Consider a causal DAG G = (V, E) and let V follow a causal linear model compatible with G. Let N be a node and A = {A 1 , . . . , A k } be a node set in G, such that N / ∈ A. Then for any A i ∈ A.
Proof. We first define two objects. Given two nodes A and B and a node set C, let P ab.c denote the set of all directed paths from B to A not containing any nodes in C. Further, given a directed path p, let τ p denote the total effect along p, i.e. the product of the edge coefficients along p. We now prove our claim. Using the definition of the total effect via the path method we can rewrite the left hand term of equation (9) as τ a i n = p∈P a i n τ p , and similarly, the right hand term as Clearly, the total effect along a directed path q = p ⊕ p is equal to the the product of the total effect along p and the total effect along p . Using this we can rewrite equation (10) as where P a i n a j .a −j is the set of all directed paths p from N to A i , such that A j lies on p and p(N, A j ) contains no node from A −j . Clearly, for any two nodes A j , A k ∈ A, P a i n a j .a −j ∩ P a i n a k .a −k = ∅. Since every directed path from N to A i contains a node in A, it follows that A j ∈A P a i n a j .a −j is a partition of P a i n and therefore, C Proof of Theorem 3.4 Proof of Theorem 3.4. Let X = {X 1 , . . . , X kx } and Y = {Y 1 , . . . , Y ky } be disjoint node sets in a causal maximal PDAG G = (V, E) and let V follow a causal linear model compatible with G. Let Z 1 and Z 2 be two valid adjustment sets relative to (X, We first consider the case that G is a causal DAG. By applying Lemma C.2 with T = Z 1 \ Z 2 , S = Z 2 \ Z 1 and W = Z 1 ∩ Z 2 , it follows that for all X i ∈ X and Y j ∈ Y. Using the asymptotic variance formula from Proposition 3.1 it follows that The proof then extends to the causal maximal PDAG setting with the fact that by Lemma C.1, d-separation in a maximal PDAG implies d-separation in every represented DAG, including the true underlying one.
In the multivariate Gaussian setting, the result of Theorem 3.4 follows by Lemma C.2 and the well known asymptotic variance formula from Lemma A.8 directly and does not require the new result from Proposition 3.1. In this setting it also holds for a larger class of sets, since Lemma A.8, as opposed to Proposition 3.1, does not require the conditioning set to be a valid adjustment set.
Lemma C.1 shows that our definition of d-separation in maximal PDAGs is sensible, in the sense that it is compatible with d-separation in the DAGs represented by a maximal PDAG. It is analogous to Theorem 4.18 in Richardson and Spirtes (2002) for m-separation in maximal ancestral graphs (see also Lemma 20 in Zhang, 2008a) and Lemma 26 in Zhang (2008a) for m-separation in partial ancestral graphs.
Lemma C.1. Let X, Y and Z be pairwise disjoint node sets in a maximal PDAG G. Then X ⊥ D Y|Z in every DAG D ∈ [G], if and only if Z blocks every definite status path between any node in X and any node in Y in G.
Proof of Lemma C.1. We prove this statement by showing that the contrapositive statement is true.
Consider a definite status path p from X to Y that is open given Z in G and a DAG D ∈ [G]. Since D is in the equivalence class described by G it follows that D has the same adjacencies as G, and every edge A → B in G is also in D. Let p * be the corresponding path to p in D. Since every node on p is of definite status in G, every node on p * is of the same definite status in D. Since additionally, p is open given Z and since for every Conversely, if there is a path q from X to Y that is open given Z in every DAG D ∈ [G], then by the proof of Lemma 26 in Zhang (2008a) (see also the proof of Lemma 5.1.7 in Zhang, 2006), there is a definite status path q * from X to Y that is open given Z in the CPDAG C of any such D. Since G describes a subset of the Markov equivalence class of [C] (Meek, 1995), we can conclude with the same argument as above that the corresponding path q * * of q * in G is a definite status path from X to Y in G that is open given Z.

C.1 Residual linear variance inequalities
By Lemmas A.9 and A.11 the asymptotic limit of a least squares regression is a function of the covariance matrix only, even when the regression is misspecified. Hence, the following statements and proofs are essentially linear algebra formulated in statistical terms. They do not depend on any property of the Gaussian distribution. The following four Lemmas are simple generalizations of already existing results and are primarily given for completeness and conciseness; especially the latter three. Lemma C.2 is a simple extension of Lemma 4 in Kuroki and Cai (2004) from random variables X and Y to random vectors X and Y, while additionally also allowing W to be non-empty. Lemma C.3 is a an extension of a well known result from Cochran (1938) to vectors X, Y and Z. Lemma C.4 is a simple generalization of a result by Kuroki and Miyakawa (2003, Lemma 1) from random variables Z and S to random vectors Z and S. Note that it is quite similar to a result presented in Section 2.5 of Anderson (1958). Lemma C.5 is a generalization of results by Wermuth (1989, Results 1.2 and 5.2) from random variables X, Y , S and T to random vectors X, Y, S and T. Further, all of these Lemmas are also generalizations to non-Gaussian random variables using the result from Lemma A.9.
Proof. We first assume that T = ∅. Since S ⊥ ⊥ X|W it must also hold that S ⊥ ⊥ X i |(W T , X −i T ) T for all X i ∈ X, by the weak union property of conditional independence from Dawid (1979). Then, by Lemma C.4, σ y j y j .xws ≤ σ y j y j .xw and by Lemma C.5, We now assume that S = ∅. By Lemma C.4, σ x −i w and by Lemma C.5, σ y j y j .xw = σ y j y j .xwt .
We now assume T = ∅ and S = ∅. First, we show inequality (a). Since S ⊥ ⊥ X|(W T , T T ) it also holds that S ⊥ ⊥ X i |(W T , T T , X T −i ) T for all X i ∈ X by the weak union property of conditional independence. Thus, β x i s.tx −i w = 0 by Lemma A.10 and Then by Lemma C.4, For inequality (b) we use that by Lemma C.5, Y ⊥ ⊥ T|(W T , S T , X T ) T implies that β y j t.xws = 0 by Lemma A.10 and β y j s.txw = β y j s.xw by Lemma C.5. Hence,by Lemma C.3 β y j t.xw = β y j t.sxw + β y j s.txw β st.xw = β y j s.xw β st.xw .
Then by Lemma C.4, Lemma C.3. Let V = (S T , T T , W T , Z T ) T , with Z possibly of length zero, be a mean 0 random vector with finite variance. Then β tw.z = β tw.sz + β ts.wz β sw.z .
Proof. This proof is based on the uniqueness of the least squares regression. Precisely, by a projection argument, it holds that for any random vector (Y, X T ) T with Y = β yx X + , the least squares regression coefficient β yx is characterized by the property that E[ with E[ t (Z T , W T , S T )] = 0. Substituting equation (13) into equation (14) gives On the other hand, regressing T on (Z T , W T ) T directly yields with E[ t (Z T , W T )] = 0.
If Z = ∅, one can simply drop all terms involving Z. Lemma C.4. Let V = (S T , W T , Z T ) T , with S possibly of length zero, be a mean 0 random vector with finite variance. Then Proof. Let R = (S T , W T ) T . By Lemma A.9 it holds that Σ zr = β zr Σ rr . Combining this with the fact that Σ zz.r = Σ zz − Σ zr Σ −1 rr Σ T zr , it follows that If S = ∅ then R = W and our claim follows. So suppose that S = ∅. Since R = (S T , W T ) T , Σ zz.sw = Σ zz.r . Note that Σ rr = Σ ss Σ sw Σ ws Σ ww and β zr = β zs.w β zw.s .
Plugging this into equation (17) yields Using Σ ws = β ws Σ ss and Σ sw = Σ ss β T ws , we can rewrite equation (18) as By Lemma C.3 it holds that β zs = β zs.w + β zw.s β ws . Plugging this into equation (19) and then using equation (17) twice, we arrive at C.2 Proof of Corollaries 3.6, 3.7 and 3.8 Proof of Corollary 3.6. Let X and Y be disjoint node sets in a maximal PDAG G = (V, E) and let V follow a causal linear model compatible with G. Let Z and Z = Z\{P }, with P ∈ (pa(X, G) ∩ Z), be each a valid adjustment set relative to (X, Y) in G. Consider a DAG D compatible with G. By the completeness of the adjustment criterion, Z and Z are also valid adjustment sets relative to (X, Y) in D. Since pa(X, G) ⊆ pa(X, D) it also follows that P ∈ (pa(X, D) ∩ Z). We can therefore without loss of generality consider D rather than G.
We apply Theorem 3.4 with Z 1 = Z and Z 2 = Z . Since Z ⊂ Z it only remains to show that {P } ⊥ D Y|X ∪ Z . We prove this by showing that the existence of a path from P to Y that is open given X ∪ Z in D contradicts the assumption that Z is a valid adjustment set. Let p be such a path and p be the path X ← P, X ∈ X which exists by construction. Let I be the node closest to X on p which also lies on p and consider q = p (X, I) ⊕ p(I, Y ). If I = X, then q is a subpath of p and is hence open given X ∪ Z . As we assume Z to be a valid adjustment set, q must also be open given Z by Lemma B.2 and therefore has to be causal. This, however, implies that X is a non-collider on p, which contradicts our starting assumption that p is open given X ∪ Z . If I = P , then q is a non-causal path from X to Y. Since P / ∈ X ∪ Z is a non-collider on q, q is open given X ∪ Z and hence by Lemma B.2 and our assumptions, it must also be open given Z . Further, q is either proper or any node from X it contains, is a collider. Let X be the node closest to Y on q. Then q(X , Y ) is a proper, non-causal path from X to Y that is open given Z .
Proof of Corollary 3.7. Let X and Y be disjoint node sets in a maximal PDAG G = (V, E) and let V follow a causal linear model compatible with G. Let Z be a valid adjustment set relative to (X, Y) in G and let Z = Z ∪ {R}, with R ∈ pa(Y, G).
Consider a DAG D compatible with G. By the completeness of the adjustment criterion, Z and Z are also valid adjustment sets relative to (X, Y) in D. Since pa(Y, G) ⊆ pa(Y, D), it follows that R ∈ pa(Y, D) \ Z. We can therefore without loss of generality consider D rather than G.
We now apply Theorem 3.4 with Z 1 = Z and Z 2 = Z . Since Z ⊂ Z it only remains to show that {R} ⊥ G X|Z. We will now show that the existence of a noncausal path p from X to R that is open given Z contradicts the assumption that Z is a valid adjustment set. Since the existence of a causal path would contradict that R / ∈ forb(X, Y, D), implicit to the assumption that Z is a valid adjustment set, this suffices to show our claim.
Consider a non-causal path p from X to R and suppose that it is open given the valid adjustment set Z. Let X ∈ X be the last such node on p. Suppose that p(X, R) is directed from X to R and therefore R ∈ de(X, G). Since R ∈ pa(Y, G) this contradicts that R / ∈ forb(X, Y, G). Hence, p = p(X, . . . , R) must be a proper, non-causal path from X to R that is open given Z. Let p be the path R → Y that exists by assumption and let I be the node closest to Y on p which also lies on p and consider q = p (X, I)⊕p (I, Y ).
If I = Y , then q is a subpath of p and hence open given Z. Therefore, it has to be causal as otherwise its existence would contradict the assumption that Z is a valid adjustment set. Since R ∈ pa(Y, G), Y or one of its descendants is then a collider on p . By the causality of q, de(Y, G) ⊆ forb(X, Y, G) and thus, p may not be open given Z yielding a contradiction.
If I = R, then R / ∈ Z is a non-collider on q and hence q is a proper path from X to Y that is open given Z. Since p is non-causal and a subpath of q, it follows that q is non-causal. Hence, q is a proper non-causal path from X to Y that is open given Z again yielding a contradiction.
Proof of Corollary 3.8. Let X and Y be disjoint node sets in a DAG G = (V, E), such that pa(X, G) = ∅ and let V follow a causal linear model compatible with G. Let Z and Z be node sets in G, such that Z ∩ (de(X, G) We first show that Z and Z are valid adjustment sets relative to (X, Y) in G. Clearly, forb(X, Y, G) ⊆ de(X, G) and therefore Z ∩ forb(X, Y, G) = ∅. Let p be a proper noncausal path from some node X ∈ X to some node Y ∈ Y in G. By the assumption that pa(X, G) = ∅ it follows that p must be of the form X → . . . Y . As p is non-causal this implies the existence of a collider C on G, such that C ∈ de(X, G). As Z ∩ de(X, G) = ∅ it follows that Z ∩ de(C, G) = ∅ and thus p is blocked by Z. The same reasoning holds for Z .
Suppose now that Z ⊆ Z . We will show that Z \ Z ⊥ G X|Z. Consider a proper path p from some node Z ∈ Z \ Z to some node X ∈ X. As pa(X, G) = ∅, the path p must be of the form Z · · · ← X. Since Z / ∈ de(X, G), p must contain a collider C, such that C ∈ de(X, G). But since Z ∩ de(X, G) = ∅ it follows that Z ∩ de(C, G) = ∅ and therefore p is blocked given Z. Our claim then follows by Theorem 3.4.

D Proof of Theorem 3.9
Proof of Theorem 3.9. Let X and Y be disjoint node sets in a maximal PDAG G = (V, E) and let Z be a valid adjustment set relative to (X, Y) in G.
As a simple corollary of Lemma D.2 there is a unique partition Z = Z 1 ∪ Z 2 such that Y ⊥ G Z 1 |Z 2 ∪ X and Z 1 is of maximal size or equivalently Z 2 is of minimal size. By Lemma D.3 Y ⊥ G Z 1 |Z 1 ∪ Z 2 ∪ X for any two disjoint subsets Z 1 , Z 1 ⊆ Z 1 . Jointly, these two results imply that given any subset Z ⊆ Z, fulfilling Z 2 ⊆ Z , it holds that for any node Z ∈ Z From this it follows that the output of Algorithm 1 is Z 2 ⊆ Z, independently of the order at which the nodes in Z are considered. By the same d-separation result and Lemma D.1, both Z 2 and any possible intermediate Z 2 ⊇ Z 2 are valid adjustment sets relative to (X, Y) in G.
If V follows a causal linear model compatible with G we can apply Theorem 3.4 to conclude that a.var(τ z 2 xy ) ≤ a.var(τ z xy ). Further, by the minimality of Z 2 , no other subset of Z is guaranteed to have a better asymptotic variance than Z 2 by Theorem 3.4.
The following result is very similar to Proposition 2 from VanderWeele and Shpitser (2011), albeit in the graphical rather than the potential outcomes framework.
Lemma D.1. Consider pairwise disjoint node sets X, Y, Z 1 and Z 2 in a causal maximal PDAG G, such that Z = Z 1 ∪ Z 2 is a valid adjustment set relative to (X, Y) in G. If Y ⊥ G Z 1 |Z 2 ∪ X, then Z 2 is a valid adjustment set relative to (X, Y).
Proof. We prove the statement by contraposition, so assume that Z is a valid adjustment set relative to (X, Y) in G, whereas Z 2 is not. Since Z 2 ⊆ Z, Z 2 ∩ forb(X, Y, G) = ∅ trivially holds. Thus, there must be a proper, non-causal, definite status path p from X to Y that is open given Z 2 , while being blocked by Z. Therefore, p contains at least one non-collider N ∈ (Z \ Z 2 ) = Z 1 , where we choose the N closest to the endpoint node Y ∈ Y. Then the subpath path p(N, Y ) of p is a definite status path from Z 1 to Y that is open given Z 2 . As p is proper, p(N, Y ) does not contain any nodes in X and is therefore also open given Z 2 ∪ X. Thus, Y ⊥ G Z 1 |Z 2 ∪ X.
Lemma D.2. Consider pairwise disjoint node sets X, Y and Z in a causal maximal PDAG G, such that Y ⊥ G Z 1 |Z 2 ∪ X. Then given any other partition Proof. We first show that Y ⊥ G Z 2 ∩ Z 1 |(Z 2 ∩ Z 2 ) ∪ X by contradiction. Assume that Y ⊥ G Z 1 |Z 2 ∪ X and Y ⊥ G Z 1 |Z 2 ∪ X. Let p be a proper, definite status path from Z 2 ∩ Z 1 ⊆ Z 1 to some node Y ∈ Y and assume that p is open given (Z 2 ∩ Z 2 ) ∪ X. By assumption p is blocked by Z 2 ∪ X. As ((Z 2 ∩ Z 2 ) ∪ X) ⊆ (Z 2 ∪ X), this can only be the case if there exists a non-collider N ∈ (Z 2 \ Z 2 ) ⊆ Z 1 on p. As a subpath of p, p(N, Y ) is trivially open given (Z 2 ∩ Z 2 ) ∪ X and since p is proper, p(N, Y ) contains no node in Z 2 ∩ Z 1 . Therefore, p(N, Y ) is also open given (Z 2 ∩ Z 1 ) ∪ (Z 2 ∩ Z 2 ) ∪ X = Z 2 ∪ X. But this contradicts that Y ⊥ G Z 1 |Z 2 ∪ X. Thus, any such p must be blocked by (Z 2 ∩ Z 2 ) ∪ X.
By Lemma D.4, the two d-separation statements The two following Lemmas are general properties of the d-separation criterion from Verma and Pearl (1988). They extend to the causal maximal PDAG setting with the fact that by Lemma C.1, d-separation in a maximal PDAG implies d-separation in every represented DAG and vice versa. (c) Let Z be a valid adjustment set relative to (X, Y) in G, such that a.var(τ o yx ) = a.var(τ z yx ).
If V follows a causal linear model compatible with G and f is faithful to G then O ⊆ Z.
We prove each of the three Statements of Proposition E.1 separately, due to the complexity of the proofs.
Proof of Statement (a) of Proposition E.1. As we are considering a DAG, the amenability condition is trivially fulfilled. By construction O fulfills the forbidden set condition relative to (X, Y) in G in Definition A.4. By the assumption that there exists a valid adjustment set and Corollary A.12 it follows that X ∩ cn(X, Y, G) = ∅. Hence, we can invoke Lemma E.3 with T = X, to conclude that O fulfills the blocking condition relative to (X, Y) in G from Definition A.4. ( We now show that S and Y are not d-separated by W ∪ X. Let S ∈ S ⊆ O. Then there exists a directed path p from S to Y that consists of causal nodes, except for S. Hence, p cannot be blocked by X ∪ W, as (X ∪ W) ∩ (cn(X, Y, G) ∪ S) = ∅. As the underlying distribution f is assumed to be faithful to G it thus follows that β ys.xw = 0.

Proof of Statement
Within the proof of Statement (b) of Proposition E.1 we have already shown that for all X i ∈ X, for any valid adjustment set. Therefore, for Z to be asymptotically optimal, σ y j y j .xz ≤ σ y j y j .xo has to hold for all Y j ∈ Y. By equation (11) σ y j y j .xz − σ y j y j .xo = β y j s.xw Σ ss.xwt β T y j s.xw , for all Y j ∈ Y, where T = Z \ O = ∅. The case T = ∅ is the equivalent statement, with the convention that empty conditioning sets are omitted and follows directly from Lemma C.4 jointly with the fact that in this case O = Z ∪ S. As the conditional distributions are assumed to be non-degenerate, Σ ss.xwt is positive definite. We have already shown that β ys.xw = 0, so it follows that σ y j y j .xz > σ y j y j .xo for some Y j ∈ Y, which yields a contradiction.
While Lemma E.3 is a rather technical result, Lemma E.4 and Lemma E.5 have an intuitive interpretation. Lemma E.4 states that given a valid adjustment set Z, this set may not contain more information on Y conditionally on X than O. Lemma E.5 states that Z cannot contain less information on X than O. The surprising fact that O has both of these properties simultaneously is the key to its asymptotic optimality.
We now introduce an object that is used in the proof of Lemma E.3.
Definition E.2. Consider two disjoint node sets X and Y in a causal DAG G, such that Y ⊆ de(X, G) and let p be a path to some Y ∈ Y. Then the maximal causal segment of p with respect to (X, Y), is the longest subpath p(C, Y ) of p, such that all nodes on p(C, Y ) are in cn(X, Y, G) and p(C, Y ) is directed towards Y .
For an example consider Figure 7 and consider the path p = (X, A 1 , A 2 , Y, F ). The longest maximal causal segment with respect to (X, F ) of p is Y → F . It is the longest directed subpath of p that consists of causal nodes and ends in F . Lemma E.3. Let X and Y be disjoint node sets in a causal DAG G, such that Y ⊆ de(X, G) and let O = O(X, Y, G). Let T be a node set such that T∩de(cn(X, Y, G), G) = ∅ and T ∩ O = ∅. If an adjustment set relative to (X, Y) in G exists, then all proper non-causal paths from T to Y in G that contain no nodes from X \ T are blocked by O ∪ (X \ T).
Proof. Let p be a proper non-causal path from T ∈ T to Y ∈ Y that is open given O ∪ (X \ T) and contains no nodes in X \ T. Let c Y = p(C 1 , Y ) be the maximal causal segment of p with respect to (X, Y), where we use that Y ⊆ de(X, G) implies that Y ⊆ cn(X, Y, G). Then p is of the form (a) T · · · V → C 1 → · · · → Y or If p is of the form (a) and V ∈ O, then p is blocked by O∪(X\T). We now show that V ∈ O does in fact hold by contradiction, so suppose that V / ∈ O. Note that V = T , as otherwise p would be a causal path from T to Y . By assumption p is proper and contains no nodes from X \ T and hence V / ∈ X. As V is a parent of the causal node C 1 , V / ∈ O can only hold if V ∈ forb(X, Y, G). As V / ∈ X it follows that V ∈ de(cn(X, Y, G), G). Then there is a proper directed path from X to V . Additionally, p(V, Y ) is a directed path towards Y that does not contain a node in X, so V ∈ cn(X, Y, G), which contradicts that c Y is the maximal causal segment of p with respect to (X, Y).
If p is of the form (b), V ∈ de(cn(X, Y, G), G) which implies V = T . Further, p has to contain at least one collider, as otherwise T ∈ de(V, G) ⊆ de(cn(X, Y, G), G). Let V be the collider on p that is closest to V . Then V ∈ de(V, G), so V ∈ forb(X, Y, G). Therefore, de(V , G) ∩ O = ∅. By the assumption that there is an adjustment set relative to (X, Y) in G and Corollary A.12 it follows that X ∩ de(cn(X, Y, G), G) = ∅. Therefore, it also holds that de(V , G) ∩ (X \ T) = ∅. Since V is a collider on p and de(V , G) ∩ (O ∪ (X \ T)) = ∅, p is blocked by O ∪ (X \ T). Lemma E.4. Let X and Y be disjoint node sets in a causal DAG G, such that Y ⊆ de(X, G) and let O = O(X, Y, G). Let T be a set such that T ∩ (forb(X, Y, G) ∪ O) = ∅. If there exists a valid adjustment set relative to (X, Y) in G, then Y ⊥ G T|O ∪ X.
Proof. It is enough to show that all paths from T to Y that are proper, are blocked by O ∪ X. Let p be such a path from from T ∈ T to Y ∈ Y.
First, suppose that no node from X lies on p. If p is a non-causal path from T to Y , then by Lemma E.3, p is blocked by O ∪ X. If p is causal from T to Y , then by the fact that an(Y, G) ∩ forb(X, Y, G) = cn(X, Y, G), the non-forbidden node O closest to Y on p is in O. Since T / ∈ (forb(X, Y, G) ∪ O) such a node O exists and it holds that O = T . Clearly, O = Y and therefore, O is a non-collider on p. Hence p is blocked given O ∪ X. Now, assume that p contains at least one node from X. If a node in X is a noncollider on p, p is blocked by O ∪ X. So we can assume that all nodes on p that are in X are colliders. Let X ∈ X be the collider on p that is closest to Y . Then p(X, Y ) is a proper non-causal path from X to Y and since, by the already proven Statement (a) in Theorem E.1, O is a valid adjustment set relative to (X, Y) in G, O blocks p(X, Y ). Now assume that p(X, Y ) is open given O∪X while being blocked by O. By Lemma B.2 this contradicts that O is a valid adjustment. Hence, we can conclude that p is blocked by O ∪ X.
Lemma E.5. Let X, Y, S and W be pairwise disjoint node sets in a causal DAG G, Proof. For contraposition, suppose that X ⊥ G S|W and that there exists a valid adjustment set relative to (X, Y) in G. We will show that this implies the existence of a proper non-causal path from X to Y that is open given W and hence W is not a valid adjustment set.
Let p be a proper path from X ∈ X to S ∈ S that is open given W. Since S ∈ S ⊆ O, there exists a directed path p from S to some Y ∈ Y that consists of S and nodes in cn(X, Y, G). As W ∩ cn(X, Y, G) = ∅, p must be open given W.
Let I be the the node closest to X on p that is also on p and consider the path q = p(X, I) ⊕ p (I, Y ). Since I is on p and (S ∪ cn(X, Y, G)) ∩ X = ∅, I = X. Suppose now that I = Y . Then q is a subpath of p and since p is open given W, q is a path from X to Y that is open given W. Suppose, now that I = Y . As both p(X, I) and p (I, Y ) are open given W and since p (X, I) is directed towards Y , I is a non-collider on q. With the fact that I is on p and p contains no node in W we can thus conclude that q is a path from X to Y that is open given W.
We now show that q is proper. By the assumption that there is an adjustment set relative to (X, Y) in G and Corollary A.12 it follows that X ∩ de(cn(X, Y, G), G) = ∅. Hence, p does not contain a node in X and as p is proper, it follows that q is a proper path from X to Y.
It is left to show that q is a non-causal path. For contradiction, suppose that q is a causal path. Then p must be of the form X → · · · → I · · · S. Since S ∈ O ⊆ an(Y, G) \ Lemma E.6 and the fact that Y ∈ cn(X, Y, G) it follows that A ∈ de(cn(X, Y, D), D) in this case.
Lastly, suppose q corresponds to A ← · · · ← V j → · · · → Y in D. We will show that V j ∈ de(cn(X, Y, D), D), since then from A ∈ de(V j , D), it follows that A ∈ de(cn(X, Y, D), D). Since V j ∈ possde(A, G), V j ∈ forb(X, Y, G) and since V j is on q, V j = X. Hence, V j ∈ forb(X, Y, G) \ X. Let p V j be a proper causal path from X to V j in G (Lemma E.6) and let p * V j and q * be the paths corresponding to p V j and q in D. Then we can concatenate p * V j and q * (V j , Y ) so that V j ∈ de(cn(X, Y, D), D). let p D 1 be a proper causal path from X to D 1 . In order to prove that D 1 is in the set de(posscn(X, Y, G), G), we only need to show that there is a possibly causal path from D 1 to Y that does not contain a node in X. Hence, let −r = (A, . . . , D 1 , . . . , B). Since r is a possibly causal unshielded path from B to A, r and −r are paths of definite status. Then since −r(D 1 , B) is of the form D 1 − · · · − B, −r(D 1 , B) is a possibly causal path from D 1 to B. Since B ∈ posscn(X, Y, G), let s be a possibly causal path from B to Y ∈ Y that does not contain a node in X. Let D be the node on −r(D 1 , B) closest to D 1 that is also on s. Then q = −r(D 1 , D) ⊕ s(D, Y ) is a possibly causal path from D 1 to Y in G.
Lastly, since s does not contain a node in X any node in X on q would need to be on D 1 − · · · − D. However, this would contradict the amenability of G relative to (X, Y). Hence, q does not contain a node in X. Lemma E.9. (Cf. Lemma B.1 in Zhang, 2008b andLemma 3.6 in Perković et al., 2017) Let X and Y be two nodes in a maximal PDAG G. If p is a directed causal path from X to Y in G, then a subsequence p of p forms an unshielded directed causal path from X to Y in G.
F Supplement: Simulation study F.1 Setup Technical details: For our simulation we use R (3.5.2) and the R-package pcalg (2.6-11) by Kalisch et al. (2012). Graph: We uniformly draw the number of nodes in the DAG from {10, 20, 50, 100}, the expected neighborhood size from {2, 3, 4, 5} and the graph type to be either Erdös-Rényi or power law. We use the function randDAG from the R-package pcalg to generate 10'000 such graphs. Causal linear model: To each graph G we associate a causal linear model in the following manner. We draw the model's error distribution be either normal, logistic, uniform or a t-distribution with 5 degrees of freedom, with probabilities 1 2 , 1 6 , 1 6 and 1 6 . For each node, we then draw error parameters ensuring that the mean is 0 and the variance between 0.5 and 1.5. Specifically, for the normal distribution, the variance parameter is drawn uniformly from [0.5, 1.5] and the mean set to 0. For the logistic distribution, the scale parameter is drawn uniformly from [0.4, 0.7] and the location parameter is set to 0. For the uniform distribution, the sampling interval is of the form [−a, a], with a drawn uniformly from [1.2, 2.1]. Finally, the t-distribution is first normalized and then multiplied with the square root of a parameter v uniformly drawn from [0.5, 1.5]. Finally, we draw coefficients for each edge in G from a uniform distribution on [−2, −0.1] ∪ [0.1, 2]. The pair (X, Y ): For each DAG D, we draw one pair (X, Y ) in the following manner. We first draw |X| from {1, 2, 3}, with probabilities 1 2 , 1 4 and 1 4 . We then uniformly draw node sets X of the specified size, until X i ∈X de(X i , D) = ∅ and then draw Y uniformly from X i ∈X de(X i , D). We then verify whether X ∩ de(cn(X, Y, D), D) = ∅ and whether the true CPDAG is amenable with respect to (X, Y ). This is done to ensure that valid adjustment sets exist in both the true DAG and CPDAG. If either is not the case, we discard (X, Y ) and repeat the procedure. If no (X, Y ) is found, we draw a new DAG with the same parameters as the original. Total effect estimation: For each DAG D we uniformly draw a sample size n ∈ {125, 500, 2000, 10000}. We then sample 100 data sets of size n from the causal linear model corresponding to D. For each data set we compute a graph estimate G. When the error distribution is normal, a CPDAG is estimated with the Greedy Equivalence Search (GES) algorithm by Chickering (2002). This is done with the ges function from the Rpackage pcalg without tuning any of the parameters. When the error distribution is notnormal, we estimate a DAG with the Linear Non-Gaussian Acyclic Models (LiNGAM) algorithm by Shimizu et al. (2006). This is done with the lingam function from the R-package pcalg.
We then proceed to estimate the total effect of X on Y via covariate adjustment. We considere the graphical adjustment sets Adjust(X, Y, G), pa(X, G) \ forb(X, Y, G) and O(X, Y, G), computing them both from the true DAG D and the graph estimate G. Further, we considere the non-graphical empty set. The sets Adjust(X, Y, G) and O(X, Y, G) are guaranteed to be valid adjustment sets and the set pa(X, G)\forb(X, Y, G) is guaranteed to be a valid adjustment set for the case |X = 1|, but not if |X| > 1. The empty set is generally not a valid adjustment set. We then estimate the total effect in the following manner: 1. When the considered adjustment set Z is computed from D, τ z yx =β yx.z .
2. When the considered adjustment set Z is computed from G, NA, else.
For comparability, we also treat the non-graphical empty set as if it were graphical, coming from both D and G. Accordingly, we estimate total effects with both procedure (1) and (2). In total, we thus obtain 8 total effect estimates. The difference between procedure (1) and (2) arises from the fact that (X, Y ) is sampled in a manner ensuring that the two cases (21) and (22) do not occur for the true DAG and the corresponding true CPDAG. The decision to return 0 in (21) is based on the fact that the total effect on a non-descendant is 0. Since it affects all estimates with respect to G equally, it has the effect of making their average output more similar. We chose to return "NA" in (22), effectively discarding it, as in this case no valid adjustment set exists. When this occurs, we recommend the use of alternative total effect estimators such as the IDA algorithm by Maathuis et al. (2009) and the jointIDA algorithm by Nandy et al. (2017). Mean squared error computation: For each graph G, associated causal linear model and each of the corresponding 100 data sets, we compute 8 different estimates of τ yx as explained above. We compute the empirical mean squared error with respect to the true total effect of each estimator over these 100 estimates, where we look at each (τ yx ) i for X i ∈ X separately. Here, the true total effect of X on Y is computed from the path coefficients of the causal linear model.  Figure 15: Boxplots of the ratios of the mean squared errors provided by O(X, Y, G) and the three alternative adjustment sets ∅, pa(X, G)\forb(X, Y, G) and Adjust(X, Y, G), denoted respectively by "em", "pa" and "adj", as a function of sample size, expected neighborhood size, graph size, size of X, error distribution and graph type. The cases where the true causal DAG is used are given on the left and the ones where the causal graph is estimated on the right. and Adjust(X, Y, G), denoted "pa","O" and "adj" respectively, in the true causal DAG and the estimated graphs as a function of sample size, expected neighborhood size, graph size, size of X, error distribution and graph type.  Figure 17: The average percentage of estimated causal graphs G that do not have a valid adjustment set relative to (X, Y ), denoted "no VAS" or where Y / ∈ possde(X, G), denoted "Y not desc.", as a function of sample size, expected neighborhood size, graph size, size of X, error distribution and graph type.

F.2 Additional results
To understand how different settings impact the performance of each considered adjustment set, Figure 15 shows boxplots of the MSE ratios as a function of sample size, expected neighborhood size, graph size, size of X, error distribution and graph type. This plot reveals some interesting patterns.
When the true DAG is used, the ratios are generally stable across varying settings, with one major exception. The more complex the graph, i.e., the larger the graph and expected neighborhood size, the smaller the MSE ratios. This is probably due to the fact that all three alternative adjustment sets are less likely to be similar to O(X, Y, D) for larger or denser DAGs. Since O(X, Y, D) is guaranteed to be unbiased and to provide the smallest asymptotic variance, this leads to smaller ratios.
When the graph is estimated some of these effects disappear, since graph estimation is more challenging for larger and denser graphs, especially when the sample size is small. This indicates that O(X, Y, G) is more affected by graph estimation errors than the alternative adjustment sets. Even with these difficulties there are few especially large ratios, while, except for the comparison with Adjust(X, Y, G), there is a respectable number of ratios smaller than 0.5 in all settings.
The only set competitive with the optimal set is Adjust(X, Y, G). However, the set Adjust(X, Y, G) is also by construction a superset of the optimal set. This is reflected in the average sizes plotted in Figure 16. The average size of Adjust(X, Y, G) is nearly twice the size of O(X, Y, G). Similarly, O(X, Y, G) is on average larger than both pa(X, G) \ forb(X, Y, G) and of course also the empty set. However, while there is a gain in moving from the empty set to pa(X, G) \ forb(X, Y, G) and from there to O(X, Y, G), there is no corresponding gain in any setting from the increased size of Adjust(X, Y, G) compared to O(X, Y, G).

F.3 Issues related to graph estimation
In the course of the mean squared error simulations we had to estimate graphs. As explained in the setup, there are two issues that may arise. Firstly, there may not be a valid adjustment set relative to (X, Y ) in the estimated causal graph G, in which case we return "NA" for all adjustment sets. Secondly, it can happen that Y / ∈ possde(X, G), in which case we return 0 for all adjustment sets. Figure 17 shows the average percentage of estimated graphs affected by either of the two issues. We see that they occur mostly for i) low sample sizes, ii) normal errors, as we can only estimate a CPDAG in this case and iii) in the case |X| > 1, as only in this case X ∩ forb(X, Y, G) = ∅ may occur (see Corollary A.12).