stata

Stata: 内生变量的交乘项如何处理?

2019-06-22  本文已影响0人  stata连享会

作者:崔娜 (西安外国语大学)

Stata 连享会: 知乎 | 简书 | 码云 | CSDN

2020寒假Stata现场班 (北京, 1月8-17日,连玉君-江艇主讲),「+助教招聘」

2020寒假Stata现场班

1. 背景

在实证分析中,计量模型中包含内生变量是经常发生的事情,模型中存在内生性问题(Endogeneity problem)将对估计结果带来什么影响?

在分析两变量的因果关系时,要求解释变量满足外生性。但在较多情况下,解释变量和被解释变量会同时受到一些不可观测因素的影响,或者被解释变量反向影响解释变量;此时解释变量就很难满足外生性要求,故而存在内生性问题。当内生性问题存在时,解释变量的估计系数是有偏的

举例说明,在其他条件不变的情况下,在梅雨季节到来之际,人们对雨伞的需求将会增加。面对这一可预期的正向需求冲击(Anticipated positive “demand shock”),销售雨伞的企业为了抓住这一时机可能带来的利益而决定提高雨伞的销售价格。在强劲的需求下,该企业的销售量大幅增加。

在这种情况下,人们观察到是雨伞的价格和销售量呈现同方向变动。在实证分析中,如果忽略季节性冲击,将会得到雨伞价格提高导致企业销售量增加这一错误的结论。

可见,研究中如果没有控制季节性带来的需求冲击,最小二乘法OLS估计得到的价格弹性系数将是有偏的(此例中价格弹性估计系数被高估);同时,残差项估计存在偏误将导致模型拟合程度不再可信(此例中残差项被低估导致模型拟合度增加)。

事实上,我们很难控制类似的需求冲击,因数据资源有限,添加足够多的控制变量也是一件非常困难的事,因此我们需要用工具变量法来应对内生性问题。

进一步,如果模型中包含了内生变量的交乘项,在实证操作中又该如何处理? 如在考察调节效应时,核心解释变量对被解释变量的边际影响将受到第三个变量取值的影响,此时往往需要引入交互项。对此,在该文中,我们将着重对内生变量的交互项如何处理这一问题进行阐述。

注:估计系数之所以有偏是因为,此时解释变量和误差项不相关的假设不再成立,即最小二乘估计量是具有最小方差线性无偏估计量的高斯-马尔可夫定理不再成立。企业根据季节性需求变动而制定价格的行为意味着解释变量(价格)和需求冲击是相关的,而未控制的需求冲击被纳入模型的误差项,故而价格和误差项是相关的。


连享会计量方法专题……

2. 包含内生变量交乘项的模型介绍

2.1 交乘项中仅有一个变量是内生变量

模型设定如下:

y=\beta_{0}+\beta_{1} x+\beta_{2} w+\beta_{3} xw+\nu \quad(1)

其中,y 是被解释变量,xw 是解释变量,xwxw 的交乘项。假设解释变量x 是内生的,w是外生的。由于x 是内生的,故而交乘项xw 也是内生的。

此时,如果 zx 的有效工具变量,则 zw 也是交乘项 xw 的有效工具变量。有效的工具变量是指工具变量满足相关性和排他性约束。

使用两阶段最小二乘法(2SLS)对模型进行估计:

第一阶段

x=\gamma_{0}+\gamma_{1} z+\gamma_{2} zw+\gamma_{3} w+\nu^{x}

xw=\eta_{0}+\eta_{1} z+\eta_{2} zw+\eta_{3} w+\nu^{x w}

第二阶段:将第一阶段回归的拟合值 \widehat{x}\widehat{xw} 代入原模型中进行 OLS 估计。

y=\vartheta_{0}+\vartheta_{1} \widehat{x}+\vartheta_{2} w+\vartheta_{3} \widehat{xw}+\varepsilon

2.2 交乘项中的两个变量均为内生变量

模型 2 设置如下:

y=\alpha_{0}+\alpha_{1} x_{1}+\alpha_{2}x_{2}+\alpha_{3}x_{1}x_{2}+\alpha_{4}w+v \quad(2)

y 是被解释变量,x_{1}x_{2} 以及w都是解释变量,x_{1}x_{2}x_{1}x_{2} 的交乘项。

我们可以区分如下几种情况来讨论:


3. Stata 实操

3.1 输入数据

clear
input float(y x x2 w) int(z z2) byte z3
89.1  99.6  96.7  101  12  28  1
99.2 102.6  98.1 100.1  15  35  2
  99 125.6  100  100  17  37  3
  100 130.1 104.9  90.6  22  42  4
111.6 135.6 104.9  86.5  36  47  5
122.2 142.2 109.5  89.7  45  51  6
117.6 157.6 110.8  90.6  66  56  7
121.1 125.2 112.3  82.8  89  60  8
  136  136 109.3  70.1  99  65  9
154.2 154.2 105.3  65.4 118  69 10
153.6 153.6 101.7  61.3 134  74 11
158.5 155.5  95.4  62.5 151  78 12
140.6 140.7  96.4  63.6 167  83 13
136.2 176.2  97.6  52.6 184  87 14
  168 185.8 102.4  59.7 200  92 15
154.3 186.3 101.6  59.5 217  96 16
  149  189 103.8  61.3 233 101 17
end

延续前文模型 (1) 的假设,我们先处理交互项中只有一个内生变量的情况。内生变量 x 与外生变量 w 的交乘项即 xw,两变量交乘之前需要做中心化处理,中心化处理后的变量前面添加前缀 c.

3.2 不考虑内生性的估计结果

假如不考虑处理内生性,直接用 OLS 进行估计。命令及估计结果如下:

. reg y x w c.x#c.w

      Source |       SS       df       MS              Number of obs =      17
-------------+------------------------------           F(  3,    13) =   28.54
       Model |  8379.49124     3  2793.16375           Prob > F      =  0.0000
    Residual |  1272.46727    13  97.8820979           R-squared     =  0.8682
-------------+------------------------------           Adj R-squared =  0.8377
       Total |  9651.95851    16  603.247407           Root MSE      =  9.8935

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |  -.3163248   .4977039    -0.64   0.536    -1.391549    .7588991
           w |  -2.034815   .9218448    -2.21   0.046     -4.02634   -.0432906
             |
     c.x#c.w |   .0065936   .0062886     1.05   0.314    -.0069921    .0201793
             |
       _cons |   260.1575   79.66213     3.27   0.006     88.05793    432.2571
------------------------------------------------------------------------------

. est store ols

3.3 工具变量法处理内生性问题

3.3.1 分两阶段进行OLS估计

第一阶段:

.reg x z c.z#c.w w 

      Source |       SS       df       MS              Number of obs =      17
-------------+------------------------------           F(  3,    13) =   14.24
       Model |  8967.24481     3   2989.0816           Prob > F      =  0.0002
    Residual |   2729.3535    13  209.950269           R-squared     =  0.7667
-------------+------------------------------           Adj R-squared =  0.7128
       Total |  11696.5983    16  731.037395           Root MSE      =   14.49

------------------------------------------------------------------------------
           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           z |   .2768927   .3651089     0.76   0.462    -.5118772    1.065663
             |
     c.z#c.w |   .0006577   .0056221     0.12   0.909     -.011488    .0128034
             |
           w |    .008591   .6070539     0.01   0.989    -1.302869    1.320051
       _cons |   112.1664    58.4022     1.92   0.077    -14.00388    238.3367
------------------------------------------------------------------------------

. predict xhat,xb

以及:

. gen xw=c.x#c.w 
. reg xw z c.z#c.w w 

      Source |       SS       df       MS              Number of obs =      17
-------------+------------------------------           F(  3,    13) =    3.32
       Model |  14559695.6     3  4853231.87           Prob > F      =  0.0538
    Residual |    19023954    13  1463381.08           R-squared     =  0.4335
-------------+------------------------------           Adj R-squared =  0.3028
       Total |  33583649.6    16   2098978.1           Root MSE      =  1209.7

------------------------------------------------------------------------------
          xw |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           z |  -13.93205   30.48196    -0.46   0.655    -79.78433    51.92023
             |
     c.z#c.w |   .5283734   .4693708     1.13   0.281    -.4856405    1.542387
             |
           w |   103.5223    50.6813     2.04   0.062    -5.968033    213.0126
       _cons |   746.8227   4875.843     0.15   0.881    -9786.796    11280.44
------------------------------------------------------------------------------

. predict xwhat,xb

第二阶段:

. reg y xhat w xwhat 

      Source |       SS       df       MS              Number of obs =      17
-------------+------------------------------           F(  3,    13) =   33.76
       Model |  8554.00892     3  2851.33631           Prob > F      =  0.0000
    Residual |  1097.94959    13  84.4576604           R-squared     =  0.8862
-------------+------------------------------           Adj R-squared =  0.8600
       Total |  9651.95851    16  603.247407           Root MSE      =  9.1901

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        xhat |  -.6431676   .5010309    -1.28   0.222    -1.725579    .4392439
           w |  -2.812121   .9459745    -2.97   0.011    -4.855775   -.7684673
       xwhat |   .0147201   .0072755     2.02   0.064    -.0009977    .0304378
       _cons |   279.2446   81.99616     3.41   0.005     102.1026    456.3865
------------------------------------------------------------------------------

. est store stage2

3.3.2 使用命令 ivregress 2sls 进行估计

使用命令 ivregress 2sls y w (x c.x#c.w=z c.z#c.w ) 可得到如下估计结果:

. ivregress 2sls y w (x c.x#c.w= z c.z#c.w)

Instrumental variables (2SLS) regression               Number of obs =      17
                                                       Wald chi2(3)  =   82.74
                                                       Prob > chi2   =  0.0000
                                                       R-squared     =  0.8179
                                                       Root MSE      =  10.168

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   -.643167   .5543448    -1.16   0.246    -1.729663    .4433289
             |
     c.x#c.w |   .0147201   .0080497     1.83   0.067     -.001057    .0304971
             |
           w |   -2.81212   1.046634    -2.69   0.007    -4.863486   -.7607545
       _cons |   279.2445   90.72123     3.08   0.002     101.4341    457.0548
------------------------------------------------------------------------------
Instrumented:  x c.x#c.w
Instruments:   w z c.z#c.w

. est sto sls

该估计后,可以用命令 estat overid 来进行过度识别检验,使用命令 estat firststage, all forcenonrobust 可以检验工具变量和内生变量的相关性。

3.3.3 使用命令 ivreg2 进行估计

使用命令 ivreg2 y w (x c.x#c.w=z c.z#c.w ) 可得到如下估计结果:

. ivreg2 y w (x c.x#c.w= z c.z#c.w)

IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =       17
                                                      F(  3,    13) =    21.09
                                                      Prob > F      =   0.0000
Total (centered) SS     =  9651.958509                Centered R2   =   0.8179
Total (uncentered) SS   =  297003.9601                Uncentered R2 =   0.9941
Residual SS             =  1757.595436                Root MSE      =    10.17

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   -.643167   .5543448    -1.16   0.246    -1.729663    .4433289
             |
     c.x#c.w |   .0147201   .0080497     1.83   0.067     -.001057    .0304971
             |
           w |   -2.81212   1.046634    -2.69   0.007    -4.863486   -.7607545
       _cons |   279.2445   90.72123     3.08   0.002     101.4341    457.0548
------------------------------------------------------------------------------
Underidentification test (Anderson canon. corr. LM statistic):           4.061
                                                   Chi-sq(1) P-val =    0.0439
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):                2.040
Stock-Yogo weak ID test critical values: 10% maximal IV size              7.03
                                         15% maximal IV size              4.58
                                         20% maximal IV size              3.95
                                         25% maximal IV size              3.63
Source: Stock-Yogo (2005).  Reproduced by permission.
------------------------------------------------------------------------------
Sargan statistic (overidentification test of all instruments):           0.000
                                                 (equation exactly identified)
------------------------------------------------------------------------------
Instrumented:         x c.x#c.w
Included instruments: w
Excluded instruments: z c.z#c.w
------------------------------------------------------------------------------

. est store ivreg2

具体解释如下:

Summary results for first-stage regressions
-------------------------------------------
                                           (Underid)            (Weak id)
Variable     | F(  2,    13)  P-val | SW Chi-sq(  1) P-val | SW F(  1,    13)
x            |       3.02    0.0838 |       89.65   0.0000 |       68.56
c.x#c.w      |       2.05    0.1679 |       14.15   0.0002 |       10.82
. ivreg2 y w (x c.x#c.w = z z3 c.z#c.w c.z3#c.w)

IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =       17
                                                      F(  3,    13) =    27.63
                                                      Prob > F      =   0.0000
Total (centered) SS     =  9651.958509                Centered R2   =   0.8645
Total (uncentered) SS   =  297003.9601                Uncentered R2 =   0.9956
Residual SS             =   1307.69705                Root MSE      =    8.771

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |  -.5392977   .4735302    -1.14   0.255      -1.4674    .3888045
             |
     c.x#c.w |    .008479   .0058378     1.45   0.146    -.0029629     .019921
             |
           w |   -2.40654   .8677687    -2.77   0.006    -4.107335   -.7057444
       _cons |   300.7927   77.29763     3.89   0.000     149.2921    452.2932
------------------------------------------------------------------------------
Underidentification test (Anderson canon. corr. LM statistic):          12.331
                                                   Chi-sq(3) P-val =    0.0063
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):                7.263
Stock-Yogo weak ID test critical values:  5% maximal IV relative bias    11.04
                                         10% maximal IV relative bias     7.56
                                         20% maximal IV relative bias     5.57
                                         30% maximal IV relative bias     4.73
                                         10% maximal IV size             16.87
                                         15% maximal IV size              9.93
                                         20% maximal IV size              7.54
                                         25% maximal IV size              6.28
Source: Stock-Yogo (2005).  Reproduced by permission.
------------------------------------------------------------------------------
Sargan statistic (overidentification test of all instruments):           3.734
                                                   Chi-sq(2) P-val =    0.1546
------------------------------------------------------------------------------
Instrumented:         x c.x#c.w
Included instruments: w
Excluded instruments: z z3 c.z#c.w c.z3#c.w
------------------------------------------------------------------------------

上面估计结果中的 不可识别检验 显示,工具变量与内生解释变量显著相关;弱工具变量检验 的 Cragg-Donald Wald F 统计量显示,对于名义显著性水平为5%的检验,其真实显著性水平不会超过 25%。过度识别检验 显示,Sargan statistic 的值为 3.734,对应的 p 值为 0.1546,因此我们没有理由拒绝 所有工具变量都是外生的 这一原假设。

3.4 考虑内生性和未考虑内生性的估计结果比较

. est table ols stage2 sls ivreg2,b se

------------------------------------------------------------------
    Variable |    ols         stage2        sls         ivreg2    
-------------+----------------------------------------------------
           x | -.31632478                -.64316696   -.64316696  
             |  .49770389                 .55434482    .55434482  
           w | -2.0348154   -2.8121211   -2.8121202   -2.8121202  
             |  .92184485    .94597454    1.0466344    1.0466344  
             |
     c.x#c.w |  .00659358                 .01472006    .01472006  
             |  .00628861                 .00804967    .00804967  
             |
        xhat |              -.64316759                            
             |               .50103093                            
       xwhat |               .01472006                            
             |               .00727549                            
       _cons |  260.15749    279.24458    279.24446    279.24446  
             |  79.662127    81.996163    90.721228    90.721228  
------------------------------------------------------------------
                                                      legend: b/se

可以看到,使用命令 ivregress 2sls(第三列)和 ivreg2(第四列)可以得到完全相同的估计结果;使用两步 OLS(第二列)得到的估计结果与第三列和第四列非常接近;如果我们的工具变量是有效的,则不考虑内生性进行的 OLS(第一列)估计结果是存在偏误的,但其各项估计系数的标准误最小。

3.5 两个内生变量交互项的估计

对应于理论模型 (2),交乘项 x_{1}x_{2} 由两个内生变量组成,假设 z_{1}z_{2} 分别是 x_{1}x_{2} 的有效工具变量。使用命令 ivreg2 进行估计:

. rename x x1 
. rename z z1
. ivreg2 y w (x1 x2 c.x1#c.x2 = z1 z2 c.z1#c.z2)

IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =       17
                                                      F(  4,    12) =     9.49
                                                      Prob > F      =   0.0011
Total (centered) SS     =  9651.958509                Centered R2   =   0.7207
Total (uncentered) SS   =  297003.9601                Uncentered R2 =   0.9909
Residual SS             =  2695.484402                Root MSE      =    12.59

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |  -13.95301   15.31361    -0.91   0.362    -43.96713    16.06112
          x2 |  -16.57089     19.215    -0.86   0.388     -54.2316    21.08982
             |
   c.x1#c.x2 |    .134195   .1468872     0.91   0.361    -.1536986    .4220886
             |
           w |  -2.219712   1.045534    -2.12   0.034     -4.26892   -.1705032
       _cons |    2024.78   2077.805     0.97   0.330    -2047.643    6097.204
------------------------------------------------------------------------------
Underidentification test (Anderson canon. corr. LM statistic):           1.146
                                                   Chi-sq(1) P-val =    0.2843
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):                0.289
Stock-Yogo weak ID test critical values:                       <not available>
------------------------------------------------------------------------------
Sargan statistic (overidentification test of all instruments):           0.000
                                                 (equation exactly identified)
------------------------------------------------------------------------------
Instrumented:         x1 x2 c.x1#c.x2
Included instruments: w
Excluded instruments: z1 z2 c.z1#c.z2
------------------------------------------------------------------------------

进一步,如果 z_{3} 也是 x_{2} 的工具变量,则使用命令:

ivreg2 y w (x1 x2 c.x1#c.x2 = c.z1##c.z2##c.z3)

. ivreg2 y w (x1 x2 c.x1#c.x2 = c.z1##c.z2##c.z3)

IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =       17
                                                      F(  4,    12) =    20.27
                                                      Prob > F      =   0.0000
Total (centered) SS     =  9651.958509                Centered R2   =   0.8724
Total (uncentered) SS   =  297003.9601                Uncentered R2 =   0.9959
Residual SS             =  1231.520009                Root MSE      =    8.511

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   -2.32377   3.287648    -0.71   0.480    -8.767442    4.119901
          x2 |  -2.578948   4.341722    -0.59   0.553    -11.08857    5.930671
             |
   c.x1#c.x2 |   .0230522   .0318354     0.72   0.469    -.0393441    .0854485
             |
           w |  -1.409116   .3043243    -4.63   0.000    -2.005581   -.8126513
       _cons |   495.6395   459.5336     1.08   0.281    -405.0299    1396.309
------------------------------------------------------------------------------
Underidentification test (Anderson canon. corr. LM statistic):          10.433
                                                   Chi-sq(5) P-val =    0.0639
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):                1.816
Stock-Yogo weak ID test critical values:  5% maximal IV relative bias    13.95
                                         10% maximal IV relative bias     8.50
                                         20% maximal IV relative bias     5.56
                                         30% maximal IV relative bias     4.44
Source: Stock-Yogo (2005).  Reproduced by permission.
------------------------------------------------------------------------------
Sargan statistic (overidentification test of all instruments):           3.522
                                                   Chi-sq(4) P-val =    0.4746
------------------------------------------------------------------------------
Instrumented:         x1 x2 c.x1#c.x2
Included instruments: w
Excluded instruments: z1 z2 c.z1#c.z2 z3 c.z1#c.z3 c.z2#c.z3 c.z1#c.z2#c.z3
------------------------------------------------------------------------------

4. 结语

  1. 引入尽可能多的控制变量,避免与解释变量相关的变量进入误差项而引起内生性问题,采用全面的数据资料(data-rich approach)对模型进行估计。在模型中纳入遗漏变量进行OLS估计得到的解释变量标准误最小,进行IV估计得到的标准误最大,存在遗漏变量未处理内生性得到的标准误居中。

  2. 如果使用IV处理内生性,在进行2SLS前需要确保其工具变量是有效的。工具变量需要满足两个条件,其一工具变量需与内生变量显著相关,这个可以进行相关性检验来论述;其二,工具变量需满足排他性约束,该外生性条件无法进行检验,文献多采用理论层面的逻辑论证来阐述工具变量的外生性。

  3. 如果模型中包含内生变量的交互项(如调节效应),同样,在进行2SLS前需要论证工具变量及该工具变量的交互项分别是有效的。在具体是实操中,可以使用Stata命令ivreg2来实现,同时该命令得到的估计结果提供了一系列关于工具变量有效性的检验。此外,在对交互项系数进行经济学解释时,即考察解释变量对被解释变量的边际影响时,需要研究者特别小心谨慎。

  4. 如果模型出于规范性和描述性的研究目的,如进行因果识别,则处理内生性是非常必要的;但如果模型的主要目的在于预测,则内生性未必是个大问题,尤其是数据生成方式在样本内和样本外相同时,直接使用OLS估计得到的预测结果至少和纠正内生性得到的结果一样好(as well or better as)。

  5. 使用工具变量法的前提是存在内生解释变量,可以用Hausman test进行检验,其原假设为所有解释变量均是外生的。当然,Hausman检验的前提是IV是有效的。


5.参考文献

  1. Ebbes, P., D. Papies, H. van Heerde. 2016, Dealing with endogeneity: A nontechnical guide for marketing researchers[C], Handbook of market research 1-37.
  2. Endogenous Variable in Interaction, Simulatenous Equations, Use of IV
  3. How to do an instrumental varialble regression with an instrumented interaction term in Stata
  4. Multiple endogenous variables - now what?!
  5. 陈强.高级计量经济学及Stata应用(第二版,第10章P153)北京:高等教育出版社,2014。

关于我们


欢迎加入Stata连享会(公众号: StataChina)

5C5OhRqjkOw)

上一篇 下一篇

猜你喜欢

热点阅读