用Numpy和Pandas分析二维数据

2018-04-18 本文已影响0人 esskeetit

1. 数据说明

UNIT
Remote unit that collects turnstile information. Can collect from multiple banks of turnstiles. Large subway stations can have more than one unit.
DATEn
Date in “yyyymmdd” (20110521) format.
TIMEn
Time in “hh:mm:ss” (08:05:02) format.
ENTRIESn
Raw reading of cummulative turnstile entries from the remote unit. Occasionally resets to 0.
EXITSn
Raw reading of cummulative turnstile exits from the remote unit. Occasionally resets to 0.
ENTRIESn_hourly
Difference in ENTRIES from the previous REGULAR reading.
EXITSn_hourly
Difference in EXITS from the previous REGULAR reading.
datetime
Date and time in “yyyymmdd hh:mm:ss” format (20110501 00:00:00). Can be parsed into a Pandas datetime object without modifications.
hour
Hour of the timestamp from TIMEn. Truncated rather than rounded.
day_week
Integer (0 6 Mon Sun) corresponding to the day of the week.
weekday
Indicator (0 or 1) if the date is a weekday (Mon Fri).
station
Subway station corresponding to the remote unit.
latitude
Latitude of the subway station corresponding to the remote unit.
longitude
Longitude of the subway station corresponding to the remote unit.
conds Categorical variable of the weather conditions (Clear, Cloudy etc.) for the time and location.
fog
Indicator (0 or 1) if there was fog at the time and location.
precipi
Precipitation in inches at the time and location.
pressurei
Barometric pressure in inches Hg at the time and location.
rain
Indicator (0 or 1) if rain occurred within the calendar day at the location.
tempi
Temperature in ℉ at the time and location.
wspdi
Wind speed in mph at the time and location.
meanprecipi
Daily average of precipi for the location.
meanpressurei
Daily average of pressurei for the location.
meantempi
Daily average of tempi for the location.
meanwspdi
Daily average of wspdi for the location.
weather_lat
Latitude of the weather station the weather data is from.
weather_lon
Longitude of the weather station the weather data is from.

questions i thought of :

what variables are related to subwary ridership?
-- which stations have the most riders?
-- what are the ridership patterns over time?
-- how does the weather affect ridership?
what patterns can i find in the weather?
-- is the temperature rising throughout the month?
-- how does weather vary across the city?

3. 二维numpy数组

two-dimensional data:
python:list of lists
numpy:2D array
pandas:dataframe

2D arrays as opposed to array of arrays:

more memory efficient
accessing element is a bit different a[1,3]
mean(),std() operate on entire array

import numpy as np


ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])
print ridership
print ridership[1, 3]
print ridership[1:3, 3:5]
print ridership[1, :]
    
# Vectorized operations on rows or columns
print ridership[0, :] + ridership[1, :]
print ridership[:, 0] + ridership[:, 1]
    
# Vectorized operations on entire arrays
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print a + b


write a function:
find the max riders on the first day
find the mean riders per days
def mean_riders_for_max_station(ridership):
    
    overall_mean = ridership.mean() # Replace this with your code
    max_station = ridership[0,:].argmax()
    mean_for_max = ridership[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

4. NumPy 轴

行的平均值

ridership.mean(axis=1)

列的平均值

ridership.mean(axis=0)

import numpy as np


a = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
    
print a.sum()
print a.sum(axis=0)
print a.sum(axis=1)
    

ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

def min_and_max_riders_per_day(ridership):
    mean_ridership_for_station = ridership.mean(axis=0)
    
    max_daily_ridership = mean_ridership_for_station.max()    # Replace this with your code
    min_daily_ridership = mean_ridership_for_station.min()   # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

5. NumPy 和 Pandas 数据类型

Pandas dataframe 每一列可以是不同的类型
dataframe.mean() 计算每一列的平均值

6. 访问 DataFrame 元素

.loc['索引名']  #访问相应的一行
.iloc[9] #按位置获取一行
.iloc[1,3]
df['列名']  #获取列
df.values #返回不含列名称或行索引，仅含有df中值的numpy二维数据，这样就可以计算整个df的统计量

import pandas as pd

# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)


# DataFrame creation
print ridership_df.iloc[0]
print ridership_df.loc['05-05-11']
print ridership_df['R003']
print ridership_df.iloc[1, 3]
        
print ridership_df[['R003', 'R005']]
    
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print df.sum()
print df.sum(axis=1)
print df.values.sum()
    
def mean_riders_for_max_station(ridership):
    overall_mean = ridership.values.mean() 
    max_station = ridership.iloc[0].argmax()  #return the colunm name 
    mean_for_max = ridership.loc[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

7. 将数据加载到 DataFrame 中

DataFrame 可有效表示csv文件内容，可使每一列的数据类型不同

df = pd.read_csv('filename.csv')

8. 计算相关性

默认情况下，Pandas 的 std() 函数使用贝塞耳校正系数来计算标准偏差。调用 std(ddof=0) 可以禁止使用贝塞耳校正系数。
计算皮尔森系数时，需要使用ddof=0

NumPy 的 corrcoef() 函数可用来计算皮尔逊积矩相关系数，也简称为“相关系数”。

import pandas as pd
def correlation(x, y):
    x_standard = (x-x.mean())/x.std(ddof=0) 
    y_standard = (y-y.mean())/y.std(ddof=0)
    return (x_standard * y_standard).mean()

9. Pandas 轴名

axis = 1 axis='column' 行
axis = 0 axis='index' 列

10. DataFrame 向量化运算

import pandas as pd

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
print df1 + df2
    
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
print df1 + df2

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                    index=['row1', 'row2', 'row3'])
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                    index=['row4', 'row3', 'row2'])
print df1 + df2


# Cumulative entries and exits for one station for a few hours.
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})

def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Fill in this function to take a DataFrame with cumulative entries
    and exits (entries in the first column, exits in the second) and
    return a DataFrame with hourly entries and exits (entries in the
    first column, exits in the second).
    '''
    return entries_and_exits-entries_and_exits.shift(1)

11. DataFrame applymap()

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [10, 20, 30],
    'c': [5, 10, 15]
})
    
def add_one(x):
    return x + 1
        
print df.applymap(add_one)
    
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
 
def convert_grade(x):
    if x>= 90:
        return 'A'
    elif x>= 80:
        return 'B'
    elif x>= 70:
        return 'C'
    elif x>=60:
        return 'D'
    else:
        return 'F'
def convert_grades(grades):
    
    return grades.applymap(convert_grade)

12.DataFrame apply()

def standardize_column(column):
    return (column - column.mean())/column.std(ddof=0)
def standardize(df):
    return df.apply(standardize_column)

计算得出的默认标准偏差类型在 numpy 的 .std() 和 pandas 的 .std() 函数之间是不同的。默认情况下，numpy 计算的是总体标准偏差，ddof = 0。另一方面，pandas 计算的是样本标准偏差，ddof = 1。如果我们知道所有的分数，那么我们就有了总体——因此，要使用 pandas 进行归一化处理，我们需要将“ddof”设置为 0。

13. DataFrame apply() 使用案例 2

将一列数据转化为单个值

def column_second_largest(column):
    sorted_values = column.sort_values(ascending = False)
    return sorted_values.iloc[1]
    
def second_largest(df):
    '''
    Fill in this function to return the second-largest value of each 
    column of the input DataFrame.
    '''
    return df.apply(column_second_largest)

14. 向 Series 添加 DataFrame

import pandas as pd
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    0: [10, 20, 30, 40],
    1: [50, 60, 70, 80],
    2: [90, 100, 110, 120],
    3: [130, 140, 150, 160]
})

# Adding a Series to a square DataFrame    
print df + s
    
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
# Adding a Series to a one-row DataFrame 
print df + s

s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
# Adding a Series to a one-column DataFrame
print df + s
    

    
# Adding when DataFrame column names match Series index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
    
print df + s
    
# Adding when DataFrame column names don't match Series index
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
print df + s

df.add(s) --- df+s
df.add(s,axis='columns')
df.add(s,axis='index')

将dataframe与series相加，就是将dataframe的每一列与series的每一个值相加，它根据series的索引值和dataframe的列名匹配dataframe和series.

15. 再次归一化每一列

def standardize(df):
    '''
    归一化每一列
    '''
    return (df-df.mean())/df.std(ddof=0)

def standardize_rows(df):
    '''
    归一化每一行
    '''
    mean = df.mean(axis='columns')
    mean_difference = df-mean
    std = df.std(axis = 'columns',ddof=0)
    return mean_difference/std

16. Pandas groupby()

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])


print example_df
    
grouped_data = example_df.groupby('even')
print grouped_data.groups
    
# Group by multiple columns
grouped_data = example_df.groupby(['even', 'above_three'])
print grouped_data.groups
    
# Get sum of each group
grouped_data = example_df.groupby('even')
print grouped_data.sum()
    

grouped_data = example_df.groupby('even')
print grouped_data.sum()['value']
print grouped_data['value'].sum()

17. 每小时入站和出站数

def hourly(column):
return column - column.shift(1)

def get_hourly_entries_and_exits(entries_and_exits):
'''
Fill in this function to take a DataFrame with cumulative entries
and exits and return a DataFrame with hourly entries and exits.
The hourly entries and exits should be calculated separately for
each station (the 'UNIT' column).
'''
return entries_and_exits.groupby('UNIT')[['ENTRIESn','EXITSn']].apply(hourly)

18.合并 Pandas DataFrame

import pandas as pd

subway_df = pd.DataFrame({
    'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
             'R004', 'R004'],
    'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
              '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                 14656174, 14660126, 14664247, 14668301],
    'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
               14451851, 14454734, 14457780, 14460818],
    'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
    'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                  -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})

weather_df = pd.DataFrame({
    'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
              '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                  40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
    'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                  -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
    'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                   30.01,  30.01],
    'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
    'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
})

def combine_dfs(subway_df, weather_df):
    '''
    Fill in this function to take 2 DataFrames, one with subway data and one with weather data,
    and return a single dataframe with one row for each date, hour, and location. Only include
    times and locations that have both subway data and weather data available.
    '''
    return subway_df.merge(weather_df,
        on=['DATEn','hour','latitude','longitude'],
        how='inner')