DATA ANALYSIS PROCESS

EDA-识别outlier

2018-10-08  本文已影响2人  IntoTheVoid
Visualizing single variables with histograms

在IPython Shell中,首先'Existing Zoning Sqft'列使用.describe()方法计算列的摘要统计信息。您会注意到min和max值之间存在极大的差异,因此需要相应地调整绘图。在这种情况下,最好以对数刻度查看图。关键字参数logx=Truelogy=True可以传入

image.png
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()
image.png

As you saw here, you still needed to look at the summary statistics to help understand your data better. You expected a large amount of counts on the left side of the plot because the 25th, 50th, and 75th percentiles have a value of 0. The plot shows us that there are barely any counts near the max value, signifying an outlier.

Visualizing multiple variables with boxplots

直方图是可视化单个变量的好方法。为了可视化多个变量,箱图很有用,尤其是当其中一个变量是分类变量时

使用箱线图来比较列(数值变量)'initial_cost'的不同值'Borough'(分类变量)。pandas .boxplot()方法是一种快速的方法,您必须指定columnby参数。在这里,可视化'initial_cost'变化的通过 'Borough'的不同分类

# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create the boxplot
df.boxplot(column='initial_cost',by='Borough', rot=90)

# Display the plot
plt.show()
image.png

You can see the 2 extreme outliers are in the borough of Manhattan. An initial guess could be that since land in Manhattan is extremely expensive, these outliers may be valid data points. Again, further investigation is needed to determine whether or not you can drop or keep those points in your data.

Visualizing multiple variables with scatter plots

比较两个数值变量列时,使用散点图更好.

# Create and display the second scatter plot
df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
image.png

it seems like there is a strong correlation between 'initial_cost' and 'total_est_fee'. In addition, take note of the large number of points that have an 'initial_cost' of 0.

上一篇 下一篇

猜你喜欢

热点阅读