pandas练习
https://github.com/ajcr/100-pandas-puzzles/blob/master/100-pandas-puzzles.ipynb
- 导入pandas
- 将pandas以pd的名字引入
import pandas
- 查看导入的pandas的版本
pd.version
- 打印出pandas库需要的库的所有版本信息
pd.show_versions()
- 数据库基础
一些基础的选择、排序、添加、聚合数据的方法
import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
- 创建索引为labels的数据框df
df = pd.DataFrame(data=data,index=labels)
- 展示df基础信息和它的数据
df.describe()
- 返回df的前三行
df.head(3)
- 只选择 'animal' 和'age' 列
df[['animal','age']]
- 选择['animal', 'age']的 [3, 4, 8]行
df.loc[df.index[[3,4,8]],['animal','age']]
- 选择visits大于 2的数据
df.query('visits > 2')
- 选择有缺失值的行, i.e. is NaN.
df[df['age'].isnull()]
- 选择 animal 是cat和age小于3.
df.query('animal=="cat" and age < 3') # 单=是赋值,双==才是判别
- 选择 age is between 2 and 4 (inclusive)的行
df.query('age>= 2 and age<=4')
- 将age里的'f'改成1.5.
df.loc['f','age'] = 1.5
- 计算visits的和 (the total number of visits).
df['visits'].sum()
- 计算每一个动物的平均年龄
df.groupby(by='animal')['age'].mean()
16.添加新的一行 'k' ,数据自己填,然后再将此行删除返回原数据框
df.loc['k'] = [5,'dog',2,'no']
df = df.drop('k')
df
- 对每个动物的数量进行计数
df['animal'].value_counts()
- Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.
df.sort_values(by=['age','visits'],ascending=[False,True])
- The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.
df['priority'] = df['priority'].map({'yes': True, 'no': False})
df
- In the 'animal' column, change the 'snake' entries to 'python'.
df['animal']=df['animal'].replace('snake','python')
df
- For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).
df.pivot_table(values='age',index='animal',columns='visits',aggfunc='mean')
- 数据框:超越基础
你可能需要结合使用两个或者更多的方法去得到正确的答案
- You have a DataFrame df with a column 'A' of integers. For example:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
How do you filter out rows which contain the same integer as the row immediately above?
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.drop_duplicates(subset='A',keep='first')
- Given a DataFrame of numeric values, say
df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
how do you subtract the row mean from each element in the row?
df = pd.DataFrame(np.random.random(size=(5, 3)))
df.sub(df.mean(axis=1),axis=0)
- Suppose you have DataFrame with 10 columns of real numbers, for example:
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
Which column of numbers has the smallest sum? (Find that column's label.)
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df.sum().idxmin()
- How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?
len(df.drop_duplicates(keep=False))
- You have a DataFrame that consists of 10 columns of floating--point numbers. Suppose that exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the column which contains the third NaN value.
(You should return a Series of column labels.)
(df.isnill().cumsum(axis=1)==3).idxmax(axis=1)
- A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For example:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
For each group, find the sum of the three greatest values.
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
df.groupby(by='grps')['vals'].nlargest(3).sum(level=0)
- A DataFrame has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). For each group of 10 consecutive integers in 'A' (i.e. (0, 10], (10, 20], ...), calculate the sum of the corresponding values in column 'B'.
df.groupby(pd.cut(df['A'],b ins=np.arange(0,101,10)))['B'].sum()
29~32 hard部分先pass
- Series and DatetimeIndex
- Create a DatetimeIndex that contains each business day of 2015 and use it to index a Series of random numbers. Let's call this Series s.
time_index = pd.date_range('2015-01-01','2015-12-31',freq='B')
s = pd.Series(np.random.rand(len(time_index)),index=time_index)
s.head()
- Find the sum of the values in s for every Wednesday.
s[s.index.weekday == 2].sum()
- For each calendar month in s, find the mean of values.
s.resample('M').mean()
- For each group of four consecutive calendar months in s, find the date on which the highest value occurred.
s.groupby(pd.Grouper(freq='4M')).idxmax()
- Create a DateTimeIndex consisting of the third Thursday in each month for the years 2015 and 2016.
pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU')