模型幻觉

2020-12-05 本文已影响0人水之心

在搭建模型的过程中，我们往往会从已知的特征中提取更多新的特征，并以此搭建更为复杂的模型，但是模型越复杂，越会值其本身掉入不断“自我催眠，强化偏见”的过程，从而引起过度拟合的问题。如果将毫不相关的变量加入到模型中，也会得到相应的参数估计值，而这个估计值几乎不可能为0，这就造成了所谓的“模型幻觉”。模型幻觉会引起模型参数的不可靠，更严重的是使得原本可能较为正确的估计扭曲为错误，比如将原来变量的正效应估计为负效应（变量对应的参数为正时成为正效应，否则为负效应）。

!pip install statsmodels

import statsmodels.api as sm
import numpy as np
import pandas as pd


def generateData():
    """
    生成模型数据
    """
    np.random.seed(5320)
    x = np.array(range(0, 20))/2
    error = np.round(np.random.randn(20), 2)
    y = 0.05*x + error
    # 新加入无关变量z恒等于1
    z = np.zeros(20) + 1
    return pd.DataFrame({"x": x, "z": z, "y": y})


def wrongCoef():
    """
    由于新变量的加入，正效应为负效应
    """
    features = ["x", "z"]
    labels = ["y"]
    data = generateData()
    X = data[features]
    Y = data[labels]
    # 没有多余变量，x系数符合估计正确，为正
    model = sm.OLS(Y, X["x"])
    res = model.fit()
    print("没有新变量时")
    print(res.summary())
    # 加入多余变量后，x的系数符合估计错误，为负
    model1 = sm.OLS(Y, X)
    res1 = model1.fit()
    print("加入新变量后")
    print(res1.summary())
    
wrongCoef()

显示结果：

没有新变量时
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.204
Model:                            OLS   Adj. R-squared (uncentered):              0.162
Method:                 Least Squares   F-statistic:                              4.878
Date:                Sat, 05 Dec 2020   Prob (F-statistic):                      0.0397
Time:                        09:24:08   Log-Likelihood:                         -29.583
No. Observations:                  20   AIC:                                      61.17
Df Residuals:                      19   BIC:                                      62.16
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x              0.0969      0.044      2.209      0.040       0.005       0.189
==============================================================================
Omnibus:                        0.871   Durbin-Watson:                   2.037
Prob(Omnibus):                  0.647   Jarque-Bera (JB):                0.815
Skew:                           0.275   Prob(JB):                        0.665
Kurtosis:                       2.179   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
加入新变量后
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                 -0.050
Method:                 Least Squares   F-statistic:                   0.09171
Date:                Sat, 05 Dec 2020   Prob (F-statistic):              0.765
Time:                        09:24:08   Log-Likelihood:                -27.982
No. Observations:                  20   AIC:                             59.96
Df Residuals:                      18   BIC:                             61.96
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x             -0.0243      0.080     -0.303      0.765      -0.193       0.144
z              0.7873      0.445      1.768      0.094      -0.148       1.723
==============================================================================
Omnibus:                        0.939   Durbin-Watson:                   2.375
Prob(Omnibus):                  0.625   Jarque-Bera (JB):                0.886
Skew:                           0.338   Prob(JB):                        0.642
Kurtosis:                       2.221   Cond. No.                         11.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

这里可能并不直观，下面看看一个抛物线的例子。

"""
此脚本用于展示随机变量引起的模型幻觉
"""
import numpy as np
import matplotlib.pyplot as plt


def generate_data(seed, num):
    x = 0
    np.random.seed(seed)
    data = []
    for i in range(num):
        x += np.random.normal()
        data.append(x)
    return data


def visualize_data(series1, series2):
    """
    根据给定的fpr和tpr，绘制ROC曲线
    """
    # 为在Matplotlib中显示中文，设置特殊字体
    plt.rcParams["font.sans-serif"] = ["SimHei"]
    # 在Matplotlib中显示负号
    plt.rcParams['axes.unicode_minus'] = False
    # 创建一个图形框
    fig = plt.figure(figsize=(12, 6), dpi=80)
    # 在图形框里只画两幅图
    ax = fig.add_subplot(1, 2, 1)
    ax.plot(series1)
    ax1 = fig.add_subplot(1, 2, 2)
    ax1.plot(series2)
    plt.show()


if __name__ == "__main__":
    series1 = generate_data(4096, 200)
    series2 = generate_data(2046, 200)
    visualize_data(series1, series2)

显示图像：

可以看到，不同的观察窗口（x 的取值）得到完全不同的两个模型。

模型幻觉

猜你喜欢

热点阅读