Python可视化，如何处理数据误差？

2019-02-15 本文已影响9人 1a076099f916

什么是数据误差？

对任何一种科学测量方法来说，准确地衡量数据误差都是无比重要的事情，甚至比数据本身还要重要。

举个例子，假如我要用一种天文学观测手段评估哈勃常数(the HubbleConstant)——银河外星系相对地球退行速度与距离的比值。我知道目前的公认值大约是71(km/s) / Mpc，而我用自己的方法测得的值是74(km/s) / Mpc。那么，我的测量值可信吗?

如果仅知道一个数据，是不可能知道是否可信的。假如我现在知道了数据可能存在的不确定性:当前的公认值大概是71±2.5(km/s) / Mpc，而我的测量值是74±5(km/s) / Mpc。那么现在我的数据与公认值一致吗?

这个问题可以从定量的角度进行回答。在数据可视化的结果中用图形将误差有效地显示出来，就可以提供更充分的信息。

进群：700341555获取【Python学习资料】

基本误差线

基本误差线(errorbar)，可以通过一个Matplotlib函数来创建。

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
x= np.linspace(0, 10, 50)
dy= 0.8
y= np.sin(x) +dy*np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
</pre>

Python可视化，如何处理数据误差？

其中，fmt 是一种控制线条和点的外观的代码格式，语法与 plt.plot 的缩写代码相同。

自定义误差线

除了基本选项之外，errorbar 还有许多改善结果的选项。通过这些额外的选项，你可以轻松自定义误差线图形的绘画风格。

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
ecolor='lightgray', elinewidth=3, capsize=0);
</pre>

Python可视化，如何处理数据误差？

除了这些选项之外，你还可以设置水平方向的误差线(xerr)、单侧误差线(one-sided errorbar)，以及其他形式的误差线。

连续误差

有时候可能需要显示连续变量的误差。虽然Matplotlib没有内置的简便方法，解决这个问题，但是通过plt.plot与plt.fill_between来解决。

用高斯过程回归演示连续误差;

我们将用Scikit-Learn程序库API里面一个简单的高斯过程回归方法来演示。这是用一种非常灵活的非参数方程，对带有不确定性的连续测量值进行拟合的方法。这里不对高斯过程回归方法介绍，而是将注意力放在数据可视化上面。

<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from sklearn.gaussian_process import GaussianProcess

定义模型和要画的数据

model = lambda x: x * np.sin(x)
xdata = np.array([1,3,5,6,8])
ydata = model(xdata)

计算高斯过程拟合结果

gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1,
random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)
xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE) #s*sigma~95%的置信区间

将上面的参数传入plt.errorbar函数，我们不是真的要为1000个数据点画上1000条误差线，相反，可以通过plt.fill_between函数中设置颜色来表示连续误差线

plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')

颜色为灰色，设置透明度为0.2

plt.fill_between(xfit, yfit - dyfit, yfit+dyfit, color='gray', alpha=0.2)
</pre>

Python可视化，如何处理数据误差？

从结果图形中可以非常直观地看出高斯过程回归方法拟合的效果:在接近样本点的区域，模型受到很强的约束，拟合误差非常小，非常接近真实值;而在远离样本点的区域，模型不受约束，误差不断增大。

Python可视化，如何处理数据误差？

什么是数据误差？

基本误差线

自定义误差线

连续误差

定义模型和要画的数据

计算高斯过程拟合结果

将上面的参数传入plt.errorbar函数，我们不是真的要为1000个数据点画上1000条误差线，相反，可以通过plt.fill_between函数中设置颜色来表示连续误差线

颜色为灰色，设置透明度为0.2

猜你喜欢

热点阅读