用Numpy和Pandas分析二维数据
1. 数据说明
-
UNIT
Remote unit that collects turnstile information. Can collect from multiple banks of turnstiles. Large subway stations can have more than one unit. -
DATEn
Date in “yyyymmdd” (20110521) format. -
TIMEn
Time in “hh:mm:ss” (08:05:02) format. -
ENTRIESn
Raw reading of cummulative turnstile entries from the remote unit. Occasionally resets to 0. -
EXITSn
Raw reading of cummulative turnstile exits from the remote unit. Occasionally resets to 0. -
ENTRIESn_hourly
Difference in ENTRIES from the previous REGULAR reading. -
EXITSn_hourly
Difference in EXITS from the previous REGULAR reading. -
datetime
Date and time in “yyyymmdd hh:mm:ss” format (20110501 00:00:00). Can be parsed into a Pandas datetime object without modifications. -
hour
Hour of the timestamp from TIMEn. Truncated rather than rounded. -
day_week
Integer (0 6 Mon Sun) corresponding to the day of the week. -
weekday
Indicator (0 or 1) if the date is a weekday (Mon Fri). -
station
Subway station corresponding to the remote unit. -
latitude
Latitude of the subway station corresponding to the remote unit. -
longitude
Longitude of the subway station corresponding to the remote unit. -
conds Categorical variable of the weather conditions (Clear, Cloudy etc.) for the time and location.
-
fog
Indicator (0 or 1) if there was fog at the time and location. -
precipi
Precipitation in inches at the time and location. -
pressurei
Barometric pressure in inches Hg at the time and location. -
rain
Indicator (0 or 1) if rain occurred within the calendar day at the location. -
tempi
Temperature in ℉ at the time and location. -
wspdi
Wind speed in mph at the time and location. -
meanprecipi
Daily average of precipi for the location. -
meanpressurei
Daily average of pressurei for the location. -
meantempi
Daily average of tempi for the location. -
meanwspdi
Daily average of wspdi for the location. -
weather_lat
Latitude of the weather station the weather data is from. -
weather_lon
Longitude of the weather station the weather data is from.
questions i thought of :
-
what variables are related to subwary ridership?
-- which stations have the most riders?
-- what are the ridership patterns over time?
-- how does the weather affect ridership? -
what patterns can i find in the weather?
-- is the temperature rising throughout the month?
-- how does weather vary across the city?
3. 二维numpy数组
two-dimensional data:
python:list of lists
numpy:2D array
pandas:dataframe
2D arrays as opposed to array of arrays:
- more memory efficient
- accessing element is a bit different a[1,3]
- mean(),std() operate on entire array
import numpy as np
ridership = np.array([
[ 0, 0, 2, 5, 0],
[1478, 3877, 3674, 2328, 2539],
[1613, 4088, 3991, 6461, 2691],
[1560, 3392, 3826, 4787, 2613],
[1608, 4802, 3932, 4477, 2705],
[1576, 3933, 3909, 4979, 2685],
[ 95, 229, 255, 496, 201],
[ 2, 0, 1, 27, 0],
[1438, 3785, 3589, 4174, 2215],
[1342, 4043, 4009, 4665, 3033]
])
print ridership
print ridership[1, 3]
print ridership[1:3, 3:5]
print ridership[1, :]
# Vectorized operations on rows or columns
print ridership[0, :] + ridership[1, :]
print ridership[:, 0] + ridership[:, 1]
# Vectorized operations on entire arrays
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print a + b
write a function:
find the max riders on the first day
find the mean riders per days
def mean_riders_for_max_station(ridership):
overall_mean = ridership.mean() # Replace this with your code
max_station = ridership[0,:].argmax()
mean_for_max = ridership[:,max_station].mean() # Replace this with your code
return (overall_mean, mean_for_max)
4. NumPy 轴
行的平均值
ridership.mean(axis=1)
列的平均值
ridership.mean(axis=0)
import numpy as np
a = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
print a.sum()
print a.sum(axis=0)
print a.sum(axis=1)
ridership = np.array([
[ 0, 0, 2, 5, 0],
[1478, 3877, 3674, 2328, 2539],
[1613, 4088, 3991, 6461, 2691],
[1560, 3392, 3826, 4787, 2613],
[1608, 4802, 3932, 4477, 2705],
[1576, 3933, 3909, 4979, 2685],
[ 95, 229, 255, 496, 201],
[ 2, 0, 1, 27, 0],
[1438, 3785, 3589, 4174, 2215],
[1342, 4043, 4009, 4665, 3033]
])
def min_and_max_riders_per_day(ridership):
mean_ridership_for_station = ridership.mean(axis=0)
max_daily_ridership = mean_ridership_for_station.max() # Replace this with your code
min_daily_ridership = mean_ridership_for_station.min() # Replace this with your code
return (max_daily_ridership, min_daily_ridership)
5. NumPy 和 Pandas 数据类型
Pandas dataframe 每一列可以是不同的类型
dataframe.mean() 计算每一列的平均值
6. 访问 DataFrame 元素
.loc['索引名'] #访问相应的一行
.iloc[9] #按位置获取一行
.iloc[1,3]
df['列名'] #获取列
df.values #返回不含列名称或行索引,仅含有df中值的numpy二维数据,这样就可以计算整个df的统计量
import pandas as pd
# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
data=[[ 0, 0, 2, 5, 0],
[1478, 3877, 3674, 2328, 2539],
[1613, 4088, 3991, 6461, 2691],
[1560, 3392, 3826, 4787, 2613],
[1608, 4802, 3932, 4477, 2705],
[1576, 3933, 3909, 4979, 2685],
[ 95, 229, 255, 496, 201],
[ 2, 0, 1, 27, 0],
[1438, 3785, 3589, 4174, 2215],
[1342, 4043, 4009, 4665, 3033]],
index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
'05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
columns=['R003', 'R004', 'R005', 'R006', 'R007']
)
# DataFrame creation
print ridership_df.iloc[0]
print ridership_df.loc['05-05-11']
print ridership_df['R003']
print ridership_df.iloc[1, 3]
print ridership_df[['R003', 'R005']]
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print df.sum()
print df.sum(axis=1)
print df.values.sum()
def mean_riders_for_max_station(ridership):
overall_mean = ridership.values.mean()
max_station = ridership.iloc[0].argmax() #return the colunm name
mean_for_max = ridership.loc[:,max_station].mean() # Replace this with your code
return (overall_mean, mean_for_max)
7. 将数据加载到 DataFrame 中
DataFrame 可有效表示csv文件内容,可使每一列的数据类型不同
df = pd.read_csv('filename.csv')
8. 计算相关性
默认情况下,Pandas 的 std()
函数使用贝塞耳校正系数来计算标准偏差。调用 std(ddof=0)
可以禁止使用贝塞耳校正系数。
计算皮尔森系数时,需要使用ddof=0
NumPy 的 corrcoef() 函数可用来计算皮尔逊积矩相关系数,也简称为“相关系数”。
import pandas as pd
def correlation(x, y):
x_standard = (x-x.mean())/x.std(ddof=0)
y_standard = (y-y.mean())/y.std(ddof=0)
return (x_standard * y_standard).mean()
9. Pandas 轴名
axis = 1 axis='column' 行
axis = 0 axis='index' 列
10. DataFrame 向量化运算
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
print df1 + df2
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
print df1 + df2
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
index=['row1', 'row2', 'row3'])
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
index=['row4', 'row3', 'row2'])
print df1 + df2
# Cumulative entries and exits for one station for a few hours.
entries_and_exits = pd.DataFrame({
'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
3144808, 3144895, 3144905, 3144941, 3145094],
'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
1088317, 1088328, 1088331, 1088420, 1088753]
})
def get_hourly_entries_and_exits(entries_and_exits):
'''
Fill in this function to take a DataFrame with cumulative entries
and exits (entries in the first column, exits in the second) and
return a DataFrame with hourly entries and exits (entries in the
first column, exits in the second).
'''
return entries_and_exits-entries_and_exits.shift(1)
11. DataFrame applymap()
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [10, 20, 30],
'c': [5, 10, 15]
})
def add_one(x):
return x + 1
print df.applymap(add_one)
grades_df = pd.DataFrame(
data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio',
'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
def convert_grade(x):
if x>= 90:
return 'A'
elif x>= 80:
return 'B'
elif x>= 70:
return 'C'
elif x>=60:
return 'D'
else:
return 'F'
def convert_grades(grades):
return grades.applymap(convert_grade)
12.DataFrame apply()
def standardize_column(column):
return (column - column.mean())/column.std(ddof=0)
def standardize(df):
return df.apply(standardize_column)
计算得出的默认标准偏差类型在 numpy 的 .std() 和 pandas 的 .std() 函数之间是不同的。默认情况下,numpy 计算的是总体标准偏差,ddof = 0。另一方面,pandas 计算的是样本标准偏差,ddof = 1。如果我们知道所有的分数,那么我们就有了总体——因此,要使用 pandas 进行归一化处理,我们需要将“ddof”设置为 0。
13. DataFrame apply() 使用案例 2
将一列数据转化为单个值
def column_second_largest(column):
sorted_values = column.sort_values(ascending = False)
return sorted_values.iloc[1]
def second_largest(df):
'''
Fill in this function to return the second-largest value of each
column of the input DataFrame.
'''
return df.apply(column_second_largest)
14. 向 Series 添加 DataFrame
import pandas as pd
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
0: [10, 20, 30, 40],
1: [50, 60, 70, 80],
2: [90, 100, 110, 120],
3: [130, 140, 150, 160]
})
# Adding a Series to a square DataFrame
print df + s
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
# Adding a Series to a one-row DataFrame
print df + s
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
# Adding a Series to a one-column DataFrame
print df + s
# Adding when DataFrame column names match Series index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
'a': [10, 20, 30, 40],
'b': [50, 60, 70, 80],
'c': [90, 100, 110, 120],
'd': [130, 140, 150, 160]
})
print df + s
# Adding when DataFrame column names don't match Series index
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
'a': [10, 20, 30, 40],
'b': [50, 60, 70, 80],
'c': [90, 100, 110, 120],
'd': [130, 140, 150, 160]
})
print df + s
df.add(s) --- df+s
df.add(s,axis='columns')
df.add(s,axis='index')
将dataframe与series相加,就是将dataframe的每一列与series的每一个值相加,它根据series的索引值和dataframe的列名匹配dataframe和series.
15. 再次归一化每一列
def standardize(df):
'''
归一化每一列
'''
return (df-df.mean())/df.std(ddof=0)
def standardize_rows(df):
'''
归一化每一行
'''
mean = df.mean(axis='columns')
mean_difference = df-mean
std = df.std(axis = 'columns',ddof=0)
return mean_difference/std
16. Pandas groupby()
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
'value': values,
'even': values % 2 == 0,
'above_three': values > 3
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
print example_df
grouped_data = example_df.groupby('even')
print grouped_data.groups
# Group by multiple columns
grouped_data = example_df.groupby(['even', 'above_three'])
print grouped_data.groups
# Get sum of each group
grouped_data = example_df.groupby('even')
print grouped_data.sum()
grouped_data = example_df.groupby('even')
print grouped_data.sum()['value']
print grouped_data['value'].sum()
17. 每小时入站和出站数
def hourly(column):
return column - column.shift(1)
def get_hourly_entries_and_exits(entries_and_exits):
'''
Fill in this function to take a DataFrame with cumulative entries
and exits and return a DataFrame with hourly entries and exits.
The hourly entries and exits should be calculated separately for
each station (the 'UNIT' column).
'''
return entries_and_exits.groupby('UNIT')[['ENTRIESn','EXITSn']].apply(hourly)
18.合并 Pandas DataFrame
import pandas as pd
subway_df = pd.DataFrame({
'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
'R004', 'R004'],
'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
'05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'ENTRIESn': [ 4388333, 4388348, 4389885, 4391507, 4393043, 14656120,
14656174, 14660126, 14664247, 14668301],
'EXITSn': [ 2911002, 2911036, 2912127, 2913223, 2914284, 14451774,
14451851, 14454734, 14457780, 14460818],
'latitude': [ 40.689945, 40.689945, 40.689945, 40.689945, 40.689945,
40.69132 , 40.69132 , 40.69132 , 40.69132 , 40.69132 ],
'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
-73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})
weather_df = pd.DataFrame({
'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
'05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'latitude': [ 40.689945, 40.69132 , 40.689945, 40.69132 , 40.689945,
40.69132 , 40.689945, 40.69132 , 40.689945, 40.69132 ],
'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
-73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
'pressurei': [ 30.24, 30.24, 30.32, 30.32, 30.14, 30.14, 29.98, 29.98,
30.01, 30.01],
'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'tempi': [ 52. , 52. , 48.9, 48.9, 54. , 54. , 57.2, 57.2, 48.9, 48.9],
'wspdi': [ 8.1, 8.1, 6.9, 6.9, 3.5, 3.5, 15. , 15. , 15. , 15. ]
})
def combine_dfs(subway_df, weather_df):
'''
Fill in this function to take 2 DataFrames, one with subway data and one with weather data,
and return a single dataframe with one row for each date, hour, and location. Only include
times and locations that have both subway data and weather data available.
'''
return subway_df.merge(weather_df,
on=['DATEn','hour','latitude','longitude'],
how='inner')