pandas练习

2019-01-27 本文已影响0人小T数据站

https://github.com/ajcr/100-pandas-puzzles/blob/master/100-pandas-puzzles.ipynb

导入pandas

将pandas以pd的名字引入

import pandas

查看导入的pandas的版本

pd.version

打印出pandas库需要的库的所有版本信息

pd.show_versions()

数据库基础
一些基础的选择、排序、添加、聚合数据的方法

import numpy as  np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

创建索引为labels的数据框df

df = pd.DataFrame(data=data,index=labels)

展示df基础信息和它的数据

df.describe()

返回df的前三行

df.head(3)

只选择 'animal' 和'age' 列

df[['animal','age']]

选择['animal', 'age']的 [3, 4, 8]行

df.loc[df.index[[3,4,8]],['animal','age']]

选择visits大于 2的数据

df.query('visits > 2')

选择有缺失值的行, i.e. is NaN.

df[df['age'].isnull()]

选择 animal 是cat和age小于3.

df.query('animal=="cat" and age < 3') # 单=是赋值，双==才是判别

选择 age is between 2 and 4 (inclusive)的行

df.query('age>= 2 and age<=4')

将age里的'f'改成1.5.

df.loc['f','age'] = 1.5

计算visits的和 (the total number of visits).

df['visits'].sum()

计算每一个动物的平均年龄

df.groupby(by='animal')['age'].mean()
16.添加新的一行 'k' ，数据自己填，然后再将此行删除返回原数据框
df.loc['k'] = [5,'dog',2,'no']
df = df.drop('k')
df

对每个动物的数量进行计数

df['animal'].value_counts()

Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.

df.sort_values(by=['age','visits'],ascending=[False,True])

The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.

df['priority'] = df['priority'].map({'yes': True, 'no': False})
df

In the 'animal' column, change the 'snake' entries to 'python'.

df['animal']=df['animal'].replace('snake','python')
df

For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

df.pivot_table(values='age',index='animal',columns='visits',aggfunc='mean')

数据框：超越基础
你可能需要结合使用两个或者更多的方法去得到正确的答案

You have a DataFrame df with a column 'A' of integers. For example:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
How do you filter out rows which contain the same integer as the row immediately above?

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.drop_duplicates(subset='A',keep='first')

Given a DataFrame of numeric values, say
df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
how do you subtract the row mean from each element in the row?

df = pd.DataFrame(np.random.random(size=(5, 3)))
df.sub(df.mean(axis=1),axis=0)

Suppose you have DataFrame with 10 columns of real numbers, for example:
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
Which column of numbers has the smallest sum? (Find that column's label.)

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df.sum().idxmin()

How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

len(df.drop_duplicates(keep=False))

You have a DataFrame that consists of 10 columns of floating--point numbers. Suppose that exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the column which contains the third NaN value.
(You should return a Series of column labels.)

(df.isnill().cumsum(axis=1)==3).idxmax(axis=1)

A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For example:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
For each group, find the sum of the three greatest values.

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
df.groupby(by='grps')['vals'].nlargest(3).sum(level=0)

A DataFrame has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). For each group of 10 consecutive integers in 'A' (i.e. (0, 10], (10, 20], ...), calculate the sum of the corresponding values in column 'B'.

df.groupby(pd.cut(df['A'],b ins=np.arange(0,101,10)))['B'].sum()
29~32 hard部分先pass

Series and DatetimeIndex

Create a DatetimeIndex that contains each business day of 2015 and use it to index a Series of random numbers. Let's call this Series s.

time_index = pd.date_range('2015-01-01','2015-12-31',freq='B')
s = pd.Series(np.random.rand(len(time_index)),index=time_index)
s.head()

Find the sum of the values in s for every Wednesday.

s[s.index.weekday == 2].sum()

For each calendar month in s, find the mean of values.

s.resample('M').mean()

For each group of four consecutive calendar months in s, find the date on which the highest value occurred.

s.groupby(pd.Grouper(freq='4M')).idxmax()

Create a DateTimeIndex consisting of the third Thursday in each month for the years 2015 and 2016.

pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU')

pandas练习

猜你喜欢

热点阅读