“菜鸟”程序员学习笔记我爱编程

我来做数据--如何对数据进行处理以满足机器学习技术(一):Mo

2016-10-29  本文已影响2595人  认真学计算机

标签(空格分隔): 数据分析 python 数据挖掘


MovieLens 1M数据集

一组从20世纪90年末到21世纪初由MovieLens用户提供的电影评分数据。这些数据中包括电影评分、电影元数据(风格类型和年代)以及关于用户的人口统计学数据(年龄、邮编、性别和职业等)。

MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。分为三个表:评分、用户信息和电影信息。
以下代码,通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中:

import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('G:\\lcw\\movielens\\users.dat', sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('G:\\lcw\\movielens\\ratings.dat', sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('G:\\lcw\\movielens\\movies.dat', sep='::', header=None, names=mnames)

# 如果是读取CSV的数据格式(说明):
import pandas as pd
  
# Reading data locally
df = pd.read_csv('/Users/al-ahmadgaidasaad/Documents/d.csv')
  
# Reading data from web
data_url = "https://raw.githubusercontent.com/alstat/Analysis-with-Programming/master/2014/Python/Numerical-Descriptions-of-the-Data/data.csv"
df = pd.read_csv(data_url)

问题:对分布在三个表的数据进行分析同时进行分析很难,那必须将所有的数据都合并到一个表中进行分析,下面,用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键:

>>> data = pd.merge(pd.merge(ratings,users),movies)
>>> data
         user_id  movie_id  rating   timestamp gender  age  occupation    zip  \
0              1      1193       5   978300760      F    1          10  48067   
1              2      1193       5   978298413      M   56          16  70072   
2             12      1193       4   978220179      M   25          12  32793   
3             15      1193       4   978199279      M   25           7  22903   
4             17      1193       5   978158471      M   50           1  95350   
5             18      1193       4   978156168      F   18           3  95825   
6             19      1193       5   982730936      M    1          10  48073   
7             24      1193       5   978136709      F   25           7  10023   
8             28      1193       3   978125194      F   25           1  14607   
9             33      1193       5   978557765      M   45           3  55421   
10            39      1193       5   978043535      M   18           4  61820   
11            42      1193       3   978038981      M   25           8  24502   

下面对pandas进行聚类操作:
1、按性别计算每部电影的平均得分,用Pivot_table方法:

DataFrame 对象有一个 .pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean', fill_value=None, margins=False, dropna=True) 方法可以用来制作透视表,同时 pd.pivot_table() 也是一个顶层函数。

>>> df
   A   B   C      D
0  foo one small  1
1  foo one large  2
2  foo one large  2
3  foo two small  3
4  foo two small  3
5  bar one large  4
6  bar one small  5
7  bar two small  6
8  bar two large  7
>>> table = pivot_table(df, values='D', rows=['A', 'B'], cols=['C'], aggfunc=np.sum)
>>> table
          small  large
foo  one  1      4
     two  6      NaN
bar  one  5      4
     two  6      7

具体实例:

>>> mean_ratings = data.pivot_table('rating', rows='title',cols='gender',aggfunc='mean')
>>> mean_ratings[:5]
gender                                F         M
title                                            
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024

2、过滤掉评分数据不够250条的电影,为了达到这个目的,先对title进行分组,然后利用size() 得到一个含有各电影分组大小的Series对象:

>>> ratings_by_titlr = data.groupby('title').size()
>>> ratings_by_titlr[:10]
title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64
>>> active_titles = ratings_by_titlr.index[ratings_by_titlr >250]
>>> active_titles

>>> mean_ratings = mean_ratings.ix[active_titles]
>>> active_titles
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)', u'101 Dalmatians (1961)', ...], dtype='object')
>>> data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year': [2000, 2001, 2002, 2001, 2002],'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
>>> frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five'])
>>> frame2.columns
Index([u'year', u'state', u'pop', u'debt'], dtype='object')
>>> frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
>>> obj = frame2.index
>>> frame2.ix[obj]
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

这里涉及到切片,DataFrame对象的标准切片语法为:.ix[::,::].ix对象可以接受两套切片,分别为行(axis =0)和列(axis =1)的方向:

>>> df
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

[3 rows x 3 columns]

>>> df.ix[:2,:2]
   Ohio  Texas
a     0      1
c     3      4

[2 rows x 2 columns]

>>> df.ix['a','Ohio']
0

而不使用 ix,直接切的情况就特殊了:索引时,选取的是列,切片时,选取的是行

>>> df['Ohio']
a    0
c    3
d    6
Name: Ohio, dtype: int32
>>> df[:'c']
   Ohio  Texas  California
a     0      1           2
c     3      4           5

[2 rows x 3 columns]
>>> df[:2]
   Ohio  Texas  California
a     0      1           2
c     3      4           5

[2 rows x 3 columns]

使用布尔型数组的情况,注意行与列的不同切法(列切法的“:”不能省)

>>> df['Texas']>=4
a    False
c     True
d     True
Name: Texas, dtype: bool
>>> df[df['Texas']>=4]
   Ohio  Texas  California
c     3      4           5
d     6      7           8

[2 rows x 3 columns]
>>> df.ix[:,df.ix['c']>=4]
   Texas  California
a      1           2
c      4           5
d      7           8

[3 rows x 2 columns]
>>> top_female_ratings = mean_ratings.sort_index(by = 'F', ascending = False)
>>> top_female_ratings

3、计算评分分歧,找出男性和女性观众分歧最大的电影。在mean_ratings 加上一个用于存放平均得分之差的列,并对其进行排序:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
>>> sort_by_diff = mean_ratings.sort_index(by='diff')
>>> sort_by_diff[:15]
gender                                        F         M      diff
title                                                              
Dirty Dancing (1987)                   3.790378  2.959596 -0.830782
Jumpin' Jack Flash (1986)              3.254717  2.578358 -0.676359
Grease (1978)                          3.975265  3.367041 -0.608224
Little Women (1994)                    3.870588  3.321739 -0.548849
Steel Magnolias (1989)                 3.901734  3.365957 -0.535777
Anastasia (1997)                       3.800000  3.281609 -0.518391
Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
Color Purple, The (1985)               4.158192  3.659341 -0.498851
Age of Innocence, The (1993)           3.827068  3.339506 -0.487561
Free Willy (1993)                      2.921348  2.438776 -0.482573
French Kiss (1995)                     3.535714  3.056962 -0.478752
Little Shop of Horrors, The (1960)     3.650000  3.179688 -0.470312
Guys and Dolls (1955)                  4.051724  3.583333 -0.468391
Mary Poppins (1964)                    4.197740  3.730594 -0.467147
Patch Adams (1998)                     3.473282  3.008746 -0.464536

对排序结果反序并取出前15行,得到的则是观众更喜欢的电影:[:: -1] 取反

>>> sort_by_diff[:: -1][:15]
gender                                         F         M      diff
title                                                               
Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
Longest Day, The (1962)                 3.411765  4.031447  0.619682
Cable Guy, The (1996)                   2.250000  2.863787  0.613787
Evil Dead II (Dead By Dawn) (1987)      3.297297  3.909283  0.611985
Hidden, The (1987)                      3.137931  3.745098  0.607167
Rocky III (1982)                        2.361702  2.943503  0.581801
Caddyshack (1980)                       3.396135  3.969737  0.573602
For a Few Dollars More (1965)           3.409091  3.953795  0.544704
Porky's (1981)                          2.296875  2.836364  0.539489
Animal House (1978)                     3.628906  4.167192  0.538286
Exorcist, The (1973)                    3.537634  4.067239  0.529605
Fright Night (1985)                     2.973684  3.500000  0.526316
Barb Wire (1996)                        1.585366  2.100386  0.515020

但是,只是想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或者标准差

>>> rating_std_by_title = data.groupby('title')['rating'].std()
>>> rating_std_by_title
title
$1,000,000 Duck (1971)                 1.092563
'Night Mother (1986)                   1.118636
'Til There Was You (1997)              1.020159
'burbs, The (1989)                     1.107760
...And Justice for All (1979)          0.878110
1-900 (1994)                           0.707107
10 Things I Hate About You (1999)      0.989815
101 Dalmatians (1961)                  0.982103
101 Dalmatians (1996)                  1.098717
12 Angry Men (1957)                    0.812731
13th Warrior, The (1999)               1.140421
187 (1997)                             1.057919
2 Days in the Valley (1996)            0.921592
20 Dates (1998)                        1.151943
20,000 Leagues Under the Sea (1954)    0.869685
...
Name: rating, Length: 3706, dtype: float64

接下来,过滤掉评论不足250条的记录:

>>> rating_std_by_title = rating_std_by_title.ix[active_titles]
>>> rating_std_by_title .order(ascending = False)[: 10]
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64
>>> ser=Series([3,2,0,3],index=list('abcd'))
>>> ser
a    3
b    2
c    0
d    3
dtype: int64
>>> ser.rank()
a    3.5
b    2.0
c    1.0
d    3.5
dtype: float64
>>> ser.rank(method='min')
a    3
b    2
c    1
d    3
dtype: float64
>>> ser.rank(method='max')
a    4
b    2
c    1
d    4
dtype: float64
>>> ser.rank(method='first')
a    3
b    2
c    1
d    4
dtype: float64
上一篇 下一篇

猜你喜欢

热点阅读