数据蛙数据分析每周作业呆鸟的Python数据分析pandas

pandas练习

2019-01-27  本文已影响0人  小T数据站

https://github.com/ajcr/100-pandas-puzzles/blob/master/100-pandas-puzzles.ipynb

  1. 将pandas以pd的名字引入

import pandas

  1. 查看导入的pandas的版本

pd.version

  1. 打印出pandas库需要的库的所有版本信息

pd.show_versions()

import numpy as  np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
  1. 创建索引为labels的数据框df

df = pd.DataFrame(data=data,index=labels)

  1. 展示df基础信息和它的数据

df.describe()

  1. 返回df的前三行

df.head(3)

  1. 只选择 'animal' 和'age' 列

df[['animal','age']]

  1. 选择['animal', 'age']的 [3, 4, 8]行

df.loc[df.index[[3,4,8]],['animal','age']]

  1. 选择visits大于 2的数据

df.query('visits > 2')

  1. 选择有缺失值的行, i.e. is NaN.

df[df['age'].isnull()]

  1. 选择 animal 是cat和age小于3.

df.query('animal=="cat" and age < 3') # 单=是赋值,双==才是判别

  1. 选择 age is between 2 and 4 (inclusive)的行

df.query('age>= 2 and age<=4')

  1. 将age里的'f'改成1.5.

df.loc['f','age'] = 1.5

  1. 计算visits的和 (the total number of visits).

df['visits'].sum()

  1. 计算每一个动物的平均年龄

df.groupby(by='animal')['age'].mean()
16.添加新的一行 'k' ,数据自己填,然后再将此行删除返回原数据框
df.loc['k'] = [5,'dog',2,'no']
df = df.drop('k')
df

  1. 对每个动物的数量进行计数

df['animal'].value_counts()

  1. Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.

df.sort_values(by=['age','visits'],ascending=[False,True])

  1. The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.

df['priority'] = df['priority'].map({'yes': True, 'no': False})
df

  1. In the 'animal' column, change the 'snake' entries to 'python'.

df['animal']=df['animal'].replace('snake','python')
df

  1. For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

df.pivot_table(values='age',index='animal',columns='visits',aggfunc='mean')

  1. You have a DataFrame df with a column 'A' of integers. For example:
    df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
    How do you filter out rows which contain the same integer as the row immediately above?

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.drop_duplicates(subset='A',keep='first')

  1. Given a DataFrame of numeric values, say
    df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
    how do you subtract the row mean from each element in the row?

df = pd.DataFrame(np.random.random(size=(5, 3)))
df.sub(df.mean(axis=1),axis=0)

  1. Suppose you have DataFrame with 10 columns of real numbers, for example:
    df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
    Which column of numbers has the smallest sum? (Find that column's label.)

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df.sum().idxmin()

  1. How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

len(df.drop_duplicates(keep=False))

  1. You have a DataFrame that consists of 10 columns of floating--point numbers. Suppose that exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the column which contains the third NaN value.
    (You should return a Series of column labels.)

(df.isnill().cumsum(axis=1)==3).idxmax(axis=1)

  1. A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For example:
    df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
    'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
    For each group, find the sum of the three greatest values.

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
df.groupby(by='grps')['vals'].nlargest(3).sum(level=0)

  1. A DataFrame has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). For each group of 10 consecutive integers in 'A' (i.e. (0, 10], (10, 20], ...), calculate the sum of the corresponding values in column 'B'.

df.groupby(pd.cut(df['A'],b ins=np.arange(0,101,10)))['B'].sum()
29~32 hard部分先pass

  1. Create a DatetimeIndex that contains each business day of 2015 and use it to index a Series of random numbers. Let's call this Series s.

time_index = pd.date_range('2015-01-01','2015-12-31',freq='B')
s = pd.Series(np.random.rand(len(time_index)),index=time_index)
s.head()

  1. Find the sum of the values in s for every Wednesday.

s[s.index.weekday == 2].sum()

  1. For each calendar month in s, find the mean of values.

s.resample('M').mean()

  1. For each group of four consecutive calendar months in s, find the date on which the highest value occurred.

s.groupby(pd.Grouper(freq='4M')).idxmax()

  1. Create a DateTimeIndex consisting of the third Thursday in each month for the years 2015 and 2016.

pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU')

上一篇 下一篇

猜你喜欢

热点阅读