Pandas 数据添加清洗截取查询

2019-07-22  本文已影响0人  Bounty_Hunter

Dataframe

DataFrame 是一种二维的数据结构,非常接近于excel形式。它的竖行称之为columns,横行称之为index,也就是说可以通过 columnsindex 来确定一个元素的位置

构建dataframe

pd.dataframe( columns = ('第一列','第二列','第三列') )

生成空的pandas表

 left = pd.DataFrame({'id':[0,1,2],'name':['Tom','Jack','Mary']})
 right = pd.DataFrame({'id':[0,1,2],'year':[23,34,34],'parent_id':[1,2,4]})
 print(left)
 print(right)

数据添加

数据清洗

数据截取

dataframe.iloc[]

"integer-location based indexing" 切片函数,返回一个 Series

参数由想截取行和列组成,

逗号前面的截取的是某行,后面截取的是某列

返回的是series 类型

dataframe.log()

数据查询

Dataframe 有很多筛选数据的功能,数据分析往往都是从对原始数据筛选这一步开始

筛选返回的仍然是dataframe类型

索引查询

  marketing[['Age','Gender']][ (marketing['AmountSpent' ] > 1000) & (marketing['OwnHome'] == 'Own')]

Challenge

Printing select columns and rows

从文件读取数据至 Dataframe,截取目标数据

pd.read_csv()函数和dataframe.iloc[]函数

import pandas as pd
marketing = pd.read_csv('DirectMarketing.csv')
print(marketing.iloc[:10,[2,3,7]])

Value counts

询问用户想找那一列,返回该列的每个唯一值对应的数据个数

input()函数,从dataframe中根据列名抽取 seriesseriesvalue_count()函数

import pandas as pd
marketing = pd.read_csv('DirectMarketing.csv')
name = input('Enter a column name: ')
print(marketing[ name ].value_counts())

Slicing

根据索引截取某几行,从第某列到第某列

dataframe.iloc[]函数

import pandas as pd
marketing = pd.read_excel('DirectMarketing.xlsx') 
print(marketing.iloc[[3,42,67,101],:8])

Query

Outputs the Salary, Catalogs and AmountSpent of the first 10 people where their AmountSpent is less than 1000.

筛选数据中的Salary, Catalogs 和 AmountSpent 列的前10行,条件是AmountSpent 小于 1000

数据筛选

import pandas as pd 
marketing = pd.read_excel('DirectMarketing.xlsx') 
print(marketing[['Salary','Catalogs','AmountSpent']][marketing.AmountSpent < 1000 ][:10])

Query with input

Ask for a salary and then output the salary of the first 5 people with a salary greater than that specified.

要求用户输入薪水,筛选数据所有列的前5行,条件是薪水比用户输入的多

数据筛选

import pandas as pd 
marketing = pd.read_excel('DirectMarketing.xlsx') 
#print(marketing)
num = int(input('Enter a Salary: '))
s = marketing[['Salary']][marketing.Salary > num][:5]['Salary']
print(s)

Adding a new column

loads in the Excel file AmountSpent_13to17.xlsx and add in a column called ' Difference' which contains the amount spent in 2017 - amount spend in 2013. You should print out the first 5 lines (lines 0 - 4 inclusive).

读取文件,在表后面添加一列,并取前五行

在原数据后添加新列df['新列名'] = 数据,因为输出结果是dataframe 因此用df.head(5) 而不是截取函数

import pandas as pd
spent = pd.read_excel('AmountSpent_13to17.xlsx')
spent['Difference'] = spent['2017']-spent['2013']
print(spent.head(5))
上一篇 下一篇

猜你喜欢

热点阅读