数据分析案例(USA.gov)
来自Bitly的USA.gov数据
2011年,URL缩短服务Bitly跟美国政府网站USA.gov合作,提供了一份从生成.gov或.mil短链接的用户那里收集来的匿名数据。
以每小时快照为例,文件中各行的格式为JSON:
data:image/s3,"s3://crabby-images/902f4/902f42f81059bb0b02acef979f3f860ffb9cebd7" alt=""
使用json模块及其loads函数逐行加载已经下载好的数据文件
import json
records = [json.loads(line) for line in open(path)]
data:image/s3,"s3://crabby-images/15fd4/15fd4bfed496590ccfc8a73c3af1f7748a1d6b7c" alt=""
用纯Python代码对时区进行计数
求该数据集中常出现的是哪个时区(即tz字段)
time_zones = [rec['tz'] for rec in records]
data:image/s3,"s3://crabby-images/7894a/7894a834a055e4ccfefaf9fc3afc5e477ce9cd16" alt=""
因为并不是所有记录都有时区字段
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
data:image/s3,"s3://crabby-images/df0c2/df0c2240636b6d5d6dc319c92dee9afd95297f9f" alt=""
对时区进行计数,这里介绍两个办法:一个较难(只使用标准Python库),另一个较简单(使用pandas)
计数的办法之一是在遍历时区的过程中将计数值保存在字典中
def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int) # values will initialize to 0
for x in sequence:
counts[x] += 1
return counts
defaultdict
t的作用是在于,当字典里的key不存在但被查找时,返回的不是keyError而是一个默认值(返回的是工厂函数的默认值,比如list对应[ ],str对应的是空字符串,set对应set( ),int对应0)
data:image/s3,"s3://crabby-images/1ef56/1ef563fb2e38333a10c2f11c115f1001e0f98fe7" alt=""
要得到前10位的时区及其计数值
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
data:image/s3,"s3://crabby-images/4629d/4629d762c61de1ed39ce06b593b648a0bd7e86a8" alt=""
collections.Counter类,它可以使这项工作更简单
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)
data:image/s3,"s3://crabby-images/d4138/d41384294861f0057d9c7caad8d26cdbba49e72f" alt=""
用pandas对时区进行计数
import pandas as pd
frame = pd.DataFrame(records)
frame.info()
data:image/s3,"s3://crabby-images/4f1f7/4f1f7341a6eced021919c7eea4d2a68b0850e772" alt=""
data:image/s3,"s3://crabby-images/f2015/f20158a507b25cebafd43cf561449dbf6500b94e" alt=""
对Series使用value_counts方法
tz_counts = frame['tz'].value_counts()
data:image/s3,"s3://crabby-images/56392/56392549f71f7cc83f6cef357c1fee290d33b1be" alt=""
可以用matplotlib可视化这个数据。为此,我们先给记录中未知或缺失的时区填上一个替代值。fillna函数可以替换缺失值(NA),而未知值(空字符串)则可以通过布尔型数组索引加以替换
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
data:image/s3,"s3://crabby-images/49c09/49c098eabf64532a543040d59f5e2ca624aaaba3" alt=""
用seaborn包创建水平柱状图
import seaborn as sns
subset = tz_counts[:10]
sns.barplot(y=subset.index, x=subset.values)
data:image/s3,"s3://crabby-images/5a396/5a39686799caa7e34dc3cdc9ccb71ecc887266c6" alt=""
a字段含有执行URL短缩操作的浏览器、设备、应用程序的相关信息
data:image/s3,"s3://crabby-images/57ad8/57ad846d93927409e4b990f2d78690e3fddef969" alt=""
将这些”agent”字符串中的所有信息都解析出来是一件挺郁闷的工作。一种策略是将这种字符串的第一节(与浏览器大致对应)分离出来并得到另外一份用户行为摘要
results = pd.Series([x.split()[0] for x in frame.a.dropna()])
data:image/s3,"s3://crabby-images/5528b/5528b29804c61f719ccfa77c9fab9a3c86716f8f" alt=""
假设想按Windows和非Windows用户对时区统计信息进行分解。为了简单起见,我们假定只要agent字符串中含有”Windows”就认为该用户为Windows用户。由于有的agent缺失,所以首先将它们从数据中移除
cframe = frame[frame.a.notnull()]
然后计算出各行是否含有Windows的值
cframe['os'] = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
data:image/s3,"s3://crabby-images/838fc/838fc5647cc359df711e909f7acbbc9e4ecf860c" alt=""
根据时区和新得到的操作系统列表对数据进行分组
by_tz_os = cframe.groupby(['tz', 'os'])
分组计数,类似于value_counts函数,可以用size来计算。并利用unstack对计数结果进行重塑
agg_counts = by_tz_os.size().unstack().fillna(0)
data:image/s3,"s3://crabby-images/7baa9/7baa99bcbad1b8f5c092b85a5c8074e5494df01e" alt=""
根据agg_counts中的行数构造了一个间接索引数组
indexer = agg_counts.sum(1).argsort()
data:image/s3,"s3://crabby-images/ccc86/ccc86b2acfef528865dafc9c91e0dec1731c3421" alt=""
通过take按照这个顺序截取了后10行大值
count_subset = agg_counts.take(indexer[-10:])
data:image/s3,"s3://crabby-images/91ee8/91ee8c9262875ad14d6f93a0794af75aeecefe27" alt=""
pandas有一个简便方法nlargest,可以做同样的工作
data:image/s3,"s3://crabby-images/8b809/8b809462f4cf61c49f586da583e3d8082b7d664d" alt=""
传递一个额外参数到seaborn的barpolt函数,来画一个堆积条形图
count_subset = count_subset.stack()
count_subset.name = 'total'
count_subset = count_subset.reset_index()
count_subset[:10]
data:image/s3,"s3://crabby-images/12ff7/12ff7430b94ff584a3c46a83f9bde23ed2a84ebe" alt=""
这张图不容易看出Windows用户在小分组中的相对比例,因此标准化分组百分比之和为1
def norm_total(group):
group['normed_total'] = group.total / group.total.sum()
return group
results = count_subset.groupby('tz').apply(norm_total)
sns.barplot(x='normed_total', y='tz', hue='os', data=results)
data:image/s3,"s3://crabby-images/bdacd/bdacdde8c6b004600f2c7b5fb3e5803b5f26d6ac" alt=""
还可以用groupby的transform方法,更高效的计算标准化的和
g = count_subset.groupby('tz')
results2 = count_subset.total / g.total.transform('sum')