Causal Effect Estimation

2021-06-29 本文已影响0人 shudaxu

Causality & Correlation

"相关性（Correlation/Prediction/Association）不意味着因果关系（Causality）”。但经常有人试图将控制变量加入回归方程中，以试图估计因果效应，这是一种错误的想法。[1]
即： $P(Y| X) \neq P(Y|do(x))$
数据本身不能反应因果关系，需要我们通过先验知识建立causal graph，再进行推断[1]。对于相同的数据下（observationally equivalent），我们根据不同的Causal Graph能得到不同的结论，所以，因果推断并不是一个从数据出发的科学，是需要我们领域知识来建模的学科。（Thus, causal effects cannot be estimated from the data itself without a causal story）

D-Seperation的三种基本模式[2]

Lemma1: Y-Mediator：
$X \rightarrow Y \rightarrow Z$ ，given Y时，X与Z条件独立（head to tail）
Lemma2: Y-Confounder:
$X \leftarrow Y \rightarrow Z$ ，given Y时，X与Z条件独立（head to head）
Lemma3 : Y-Collider：
$X \rightarrow Y \leftarrow Z$ ，given Y时，X与Z条件不独立，不conditioning on Y时，X，Z互相独立。（tail to tail）

D-Seperation的泛化模式与路径Open/Block的定义[2]

Lemma1：对于一个Causal Graph，给定A、B、C三个集合。在A与B之间的所有路径（不管方向如何），都被阻塞（block）的情况下，则 $A \perp B | C$ 。
Lemma2：定义路径的阻塞Block
对于任意一个路径，它被阻塞（block）的条件是（分如下两种情况）：
1、如果路径上的点Z，是head to head或者head to tail，那么 $Z \in C$ ，则此条路径被阻塞。
2、如果路径上的点Z，是tail to tail，那么 $Z \not\in C$ 且 $Z_{son} \not\in C$ （Z的子节点也不在C中）则此条路径被阻塞。
举例： $G_s: X \rightarrow Z \leftarrow C \rightarrow Y$
对于这个causal graph而言，有如下推论：
1、 $X \perp Y$ 天然成立（Z为Collier，天然阻塞了这条path）
2、在conditioning on C的情况下条件独立， $X \perp Y | C$ （C为confounder，可以加入，阻塞path，可以得到Conditioning on C的条件下，X，Y仍然独立）
3、在conditioning on Z的情况下条件不独立， $X \not\perp Y | Z$ （注意，X，Y本来是独立的，观测了Z反而不独立了，因为Z为Collider）
4、在conditioning on Z and C的情况下条件独立， $X \perp Y | C，Z$ （由于我们conditioning on了Collider Z，相当于在X与C之间打开了这条path，我们需要通过conditioning on这条路径上其它的Mediator或者Confounder来重新阻塞它）
用途：我们可以用Causal Graph中D-Seperation提供的假设。来检验我们的数据是否符合我们的假设。 Therefore, we can check whether our Graphical model is consistent with a given dataset by comparing the implied conditional independence with the observed conditional independence
即通过模型得到， $Y\perp X | Z$ 则检验如下等式是否满足：
$\rightarrow P(Y|X,Z) = P(Y|Z)$
$\rightarrow P(Y,X| Z) = P(Y|Z) * P(X|Z)$
局限性：那么，我们能否借此来用数据确定模型呢？其实是不能的。因为我们很容易设计出可以表现得等价的Causal Graph，他们被称作observationally equivalent。数据本身无法分辨他们。越是简单的模型，越是严苛。譬如 $X \rightarrow Z \rightarrow Y$ 与 $X \leftarrow Z \rightarrow Y$ 以及 $X \leftarrow Z \leftarrow Y$ imply the same conditional independence： $X \perp Y | Z$ 。我们便无法仅通过数据区分他们。所以我们需要extend our theory beyond conditional probabilities。（领域/业务建模）

形式化定义Causal Effect[4]

下述的内容给予假设的最基本的Causal Graph $G$
其DAG关系如下
$X \rightarrow Y$ ， $X \leftarrow Z \rightarrow Y$ ，（即Z为confounder）。

1、原生的 $P(Y|X)$ 并不能直接表示因果效应。
$P(Y|X)$ 可以被解释为很多变量interaction的结果。其中一些是因果关系（causal），而另一些只是单纯的观察性关联（purely observational）。We can say that any statistically meaningful association is the result of a causal relationship somewhere in the system, but not necessarily of the causal effect of interest $X \rightarrow Y$

2、定义因果效应 $P(Y| do(x))$ 。
定义： $P(Y| do(x))$ ，如果我们外生地（exogenous）干涉X，能对Y造成影响，则这部分影响是我们关心的应果效应。这意味着我们必须在系统之外（outside the system）来改变X，从而影响Y。外生地改变X是为了避免系统其他变量带来的影响。

3、如何得到 $P(Y| do(x))$ :

intervention：干涉以得到 $do(x)$ [4]
Eliminating incoming arrows，即去除定义X的边（箭头指向X）。因为我们消除了定义X的边，所以X可以成为exogenous的。an intervention is equivalent to eliminating arrows in a Causal Graph。其中的概率表示为 $P_{intervention}$ ，我们可以得到：
$P(Y|do(x)) :=P_{intervention}(Y|X)$ ，注意，这里不是等号，是定义为，define。
invariant probabilities：不变量[4]
a、Our intervention is atomic，即非X的子节点( $Y$ 为X的子节点descendants)将不会受到任何影响（No Side Effect）。
$P_{Intervention}(Z | X) = P_{Intervention}(Z) = P(Z)$
b、The conditional probability Y is invariant。Y的条件概率不会变
$P_{Intervention}(Y|X,Z) = P(Y|X,Z)$
根据不变量将干涉前后的的分布联系起来[4]
$P(Y|do(X))=P_{intervention}(Y|X)$
$=\sum_{Z} P_{intervention}(Y,Z| X)$ *全概率公式
$=\sum_{Z} P_{intervention}(Y,|X,Z) *P_{intervention}(Z|X)$ * 贝叶斯公式
$=\sum_{Z} P(Y,|X,Z) *P(Z)$ *上述不变量

4、得到泛化的Adjustment Formula：（即通过pre-intervention的概率分布，进行adjust，以获得causal effect）
$P(Y|do(x)) =\sum_{z} P(Y,|X,PA=z) *P(PA=z)$ ，PA为X的父节点
即，找到X的父节点PA，然后conditioning on it，得到依赖PA的条件概率 $P(Y,|X,PA=z)$ ，再根据 $P(PA=z)$ 计算其加权平均即可。
所以，其本质就是考虑其Parents的不同，获得加权平均。由于直接计算P(Y|X)的话，未考虑其Parents，很可能得出相反的结论，在[4]中的例子也有讲述。

5、别的手段：通过Randomized获得Causal Effect
其实我们进行完全随机试验，就是通过实验设计本身，消除指向X的Z。即实验组对照组仅有X不同。其他都是相同的，相当于X不受系统中任意的其他变量定义。这本身就是种外生地(exogenous)地修改X的手段。

Backdoor-criterion[5]

我们将上述的问题继续泛化一下。如果我们要通过Causal Graph得到 $P(Y|do(x))$ ，我们需要Conditioning on哪些变量？

根据Causal Effect，保留Causal Path，消除(Block)其他的Open Path
我们关注的causal effect是 $X \rightarrow M \rightarrow Y$ 这样的只包含mediator的path，称之为causal path。而其他path带来的effect是我们不关心的，希望消除的，因此，我们需要保证其他所有的open path都被阻塞block：we must make sure that all the other non-causal paths (the backdoor paths that have arrows into X) are blocked。阻塞的概念可以见上述D-Seperation中的定义。

Lemma1：Backdoor Criterion
对于Causal Graph中的集合Y,X,Z。如果X的子节点都不在Z中，且Z中的节点阻塞（Block）了X,Y之间所有的包含指向X箭头的路径，则说明Z符合Backdoor Criterion。得到了Z，我们即可以从当前数据中计算 $X \rightarrow Y$ 的Causal Effect：
$P(Y|do(x)) = \sum_{z} P(Y|X,Z)P(Z)$
根据backdoor criterion，我们可以知道什么样的变量我们应该包含，什么样的变量不应该包含。
举例，见[5]
注意该文的例子中，conditioning on collider会重新将原本阻塞path给打开（open）。所以光控制z是不够的。(Z在之前open的path中作为confounder或者mediator出现。在原本被Block的path中作为Collider）
注意1：
在Causal Graph中，统计信息天然地无视方向。the statistical information flows freely in the Graph, regardless of the direction of the arrows.

In Practice

场景[5]
1、比如我们要判断，吸烟X是否导致肺癌Y，且想用 $\theta$ 参数化X对Y的影响，所以要对 $\theta$ 进行估计。（即Causal Inference）
2、给定一组变量，发现变量之间的causal relationship。（即Causal Discovery。这个问题由上述Observationally Equivalent可以证明是无解的）
结论：所以我们的核心探讨点还是在Causal Inference上，即对 $\theta$ 进行估计
$\theta = E(Y_1) - E(Y_0)$
$= E(Y|do(x_1)) - E(Y|do(x_0))$ 【Causality】
$\neq E(Y|X_1) - E(Y|X_0)$ 【Prediction】
即：Prediction并不是Causality，所以我们用Prediction的一些手段并不能直接得到关系
估计 $\theta$ 的方法
1、对X进行随机化实验。
$Y_1 ,Y_0 \perp X$ 所以 $\theta = E(Y|X_1) - E(Y|X_0)$
即在Randomization 的条件下，Correlation = Causality。如果X为连续变量，我们可以直接用标准的Regression model来建模 $E(Y(x))=E(Y|X=x)$
2、Adjust for Confounder。
并不是所有条件下，我们都能对X进行完全的随机化实验。（比如样本量小，或者观察性实验）在这种状态下，我们可以使用上述的Backdoor Criterion，找出来我们需要Adjust的集合Z
根据Z得到： $Y_1 ,Y_0 \perp X | Z$
即： $E(Y|do(X_1)) = \int_z p(Y|X_1,Z)p(z) dz$
所以 $\theta = \int_z p(Y|X_1,Z)p(z) dz - \int_z p(Y|X_0,Z)p(z) dz$
所以将 $\theta$ 的估计转化为对 $p(Y|X,Z)$ 的估计。

关于Estimator的选择

A、注意，当我们使用一些nonparametric method非参方法时（即不对分布进行参数假设[7]，神经网络就是非参回归，线性回归是一个restriced非参回归），要特别谨慎。因为跟普通的prediction不同，这里我们并不是要进行bias&variance的trade off【在prediction中，bias & variance are equally important，但是在这里不是[6]】，而是要尽量地减少bias（因为bias带来的后果更严重）这里其实有一系列的研究，叫做 semiparametric inference[9]
B、特别地：当 $E(Y|X,Z)$ 是一个线性关系时，我们可以直接用线性回归来分析：In a linear regression, the coefficient in front of x is the causal effect of x if (i) the model is correct and (ii) all confounding variables are included in the regression.[6]
TODO：Semiparametric Estimator[6][9]

Cautions：与Prediction的差异[6]

Prediction: Predict Y after observing X = x
Causation: Predict Y after setting X = x.
根据上述结论Backdoor Criterion，我们可以得到正确估计 $P(Y|do(x))$ 的方法：
$E(Y|do(X_1)) = \int_z p(Y|X_1,Z)p(z) dz$ 【1】
而在Prediction中：
$E(Y|X =X_1) = \int_z p(Y,Z|X=X_1) dz$
$=\int_z p(Y|Z, X=X_1)p(Z| X=X_1) dz$ 【2】
注意与【1】与【2】的区别。（根据Backdoor-Criterion，明显这里Z与X不是独立的，所以两者不相等）

关于大数据环境下，Confounding的存在方式

其实很多人考虑在现实生产中，在大数据覆盖了方方面面的情况下，我们是否已经可以对万物都建模，都用特征描述了，那Confounding是否存在，或者以什么方式存在？
这个问题其实非常简单，那就是unobserved feature。举个例子，我们的特征 $X$ 包含用户的历史浏览点击记录，我们有个没有观测到的特征，比如用户近期经济状况 $Z$ 。很好理解， $Z$ 不仅影响了用户的点击行为 $Y$ ，同时也影响了用户的历史反馈特征 $X$ ，而且，这样的特征通常我们都没有观察到，所以，我们的估计，潜在都存在Confounding Bias。（由于大部分系统是一个循环的生态系统，所以这些bias在某种程度也导致了推荐所谓的同质化，马太效应等等）
Feedback Loop Amplifies Biases[10]

Refer
[1] Causal Effect：
目录：https://david-salazar.github.io/post/
见：https://david-salazar.github.io/2020/07/22/causality-invariance-under-interventions/
见：https://blog.csdn.net/wangyf112/article/details/109347121#d-%E5%88%86%E5%89%B2%EF%BC%9A%E4%B8%AD%E6%96%AD%E4%BF%A1%E6%81%AF%E7%9A%84%E6%B5%81%E5%8A%A8
之前对bias有些粗浅的讨论：Causal Bias

[2]D-Seperation
见：https://david-salazar.github.io/2020/07/18/causality-bayesian-networks/
简略：https://blog.csdn.net/u014717398/article/details/53559247

[4]Causal Effect定义Intervention
见：https://david-salazar.github.io/2020/07/22/causality-invariance-under-interventions/
简略：https://blog.csdn.net/wangyf112/article/details/109482192

[5]Backdoor Criterion
见：https://david-salazar.github.io/2020/07/25/causality-to-adjust-or-not-to-adjust/
简略：https://blog.csdn.net/wangyf112/article/details/109661332

[6] Causal Inference CMU
http://www.stat.cmu.edu/~larry/=stat401/Causal.pdf
estimator的选择见2.1章节，在prediction中，bias and variance are not equally important.。优化loss function的时候，其实也同时优化了bias and variance。

[7]在小样本AB test 中，我们可以用随机AA实验 + 分层显著性校验，校验每一个分层的 $Y_A,Y_B$ 是否有显著性差异。
interventional distribution：Identification of Conditional Interventional Distributions

[8]Nonparametric_regression
[Linear regression] is a restricted case of nonparametric regression where $f(x)$ is assumed to be affine.
https://en.wikipedia.org/wiki/Nonparametric_regression

[9]Semiparametric_model
https://en.wikipedia.org/wiki/Semiparametric_model

[10]
Feedback Loop and Bias Amplification in Recommender Systems