使用Pandas熟悉数据

2019-04-12 本文已影响0人随侯珠

The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd. We do this with the command

任何机器学习的项目开始第一步都是让你自己熟悉数据。Pandas将是你需要使用的模块。Pandas是在数据科学领域探索和操作数据最重要的工具，多数人在他们代码里将pandas缩写为'pd'，我们在命令行这样写：

import pandas as pd

The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database.

Pandas模块最重要的部分是DataFrame，一个DataFrame可以想象成一张表，用来保存数据。这跟Excel里的sheet和数据库的table类似。

Pandas has powerful methods for most things you'll want to do with this type of data.

Pandas拥有强大的方法几乎能解决所有你处理数据可能用到的事。

As an example, we'll looking at data about home prices in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.

举个例子，我们来看一下澳大利亚·墨尔本的房价数据。在这个实际操作里，对于新的数据集，例如爱荷华州的房价处理也可以使用相同的处理过程

The example (Melbourne) data is at the file path ../input/melbourne-housing-snapshot/melb_data.csv.

墨尔本的样例数据在路径../input/melbourne-housing-snapshot/melb_data.csv下

We load and explore the data with the following commands:

我们用如下的命令来加载和探索数据：

输入：

# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

输出:

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

Interpreting Data Description 数据含义描述

The results show 8 numbers for each column in your original dataset. The first number, the count, shows how many rows have non-missing values.

结果显示了原始数据集每一列都有8行数据

Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

会有很多种原因导致数据丢失。举例来说，当调查一个只有一间卧室的房子时，就不能采集到第二间卧室的大小数据。我们晚些会有专题介绍有关数据丢失的。

The second value is the mean, which is the average. Under that, std is the standard deviation, which measures how numerically spread out the values are.

第二个数据是 mean，也就是平均值，在这之下std是标准偏差，用于度量值在数值上的分布情况。

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

为了说明min, 25%, 50%, 75%, max这些值的含义，想象一下将每一列的数据按照从低到高排成一个列表，第一个数据就是min(最小值)，当你在列表中前进1/4时，你将会找到一个大于25%和小于75%的值，这个值就代表25%，50%和75%按照相同的方式定义，max就是最大值

Your Turn 轮到你了

Get started with your first coding exercise

开始你的第一次编程练习

使用Pandas熟悉数据

Interpreting Data Description 数据含义描述

Your Turn 轮到你了

猜你喜欢

热点阅读