[Python/R语言] 用R和python解决数据分析120题
2020-04-16 本文已影响0人
半为花间酒
转载请注明:陈熹 chenx6542@foxmail.com (简书号:半为花间酒)
若公众号内转载请联系公众号:早起Python
题源:
早起python 《Pandas进阶修炼120题》数据:
https://pan.baidu.com/s/1UoN4qcj4Fzbk1EWBagd-8g
提取码:b7h7数据分析120题系列:
为什么出这个专题:
R语言和pandas都是数据处理的重要工具
而二者的高下争论时有存在
我相信对于数据而言没有绝对的孰优孰劣
需要做的应该是在必要时权衡最合适的办法感谢 公众号
早起python
提供数据分析120题
这些题目是一个契机
帮助我比较了两种语言处理不同问题的共性
当然也发现了各自的灵活和缺陷它们覆盖多数数据分析初期可能遇到的问题
无论是对R语言还是对python技能的提升
相信都有很大帮助(陈熹 2020年4月)
![](https://img.haomeiwen.com/i22672581/d51ff36d6c407497.png)
- python解法
import pandas as pd
import numpy as np
df = pd.read_excel(r'C:\Users\chenx\Documents\Data Analysis\Pandas51-80.xls')
- R解法
可以用openxlsx包或java环境的xlsx包读,详见21题
也可以用非常智能的rio包
还能简化直接用readr包读csv
rio::import('C:/Users/chenx/Documents/Data Analysis/Pandas51-80.xls')
# 也可以转存csv后再读
library(readr)
df <- read_csv('C:/Users/chenx/Documents/Data Analysis/Pandas51-80.csv')
# rio包支持文件转化,但数据中的中文会出现乱码,推荐英文可以用以下方法
rio::convert('test.xlsx','test.csv')
![](https://img.haomeiwen.com/i22672581/e49e82349fefd284.png)
- python解法
df.head(3)
- R解法
head(df,3)
![](https://img.haomeiwen.com/i22672581/8e8c6d10feab7b4f.png)
![](https://img.haomeiwen.com/i22672581/bda4567401febaed.png)
- python解法
df.isnull().sum()
- R解法
colSums(is.na(df))
![](https://img.haomeiwen.com/i22672581/fc4b8d56b2e2a238.png)
![](https://img.haomeiwen.com/i22672581/91d07a7dc6c8a910.png)
- python解法
df[df['日期'].isnull()]
- R解法
df[is.na(df$日期),]
![](https://img.haomeiwen.com/i22672581/c483c2f784d2351e.png)
![](https://img.haomeiwen.com/i22672581/80c86dc2a9a264be.png)
- python解法
for i in df.columns:
if df[i].count() != len(df):
row = df[i][df[i].isnull().values].index.tolist()
print('列名:"{}", 第{}行位置有缺失值'.format(i,row))
- R解法
library(glue)
for (i in names(df)){
if(sum(is.na(df[,'日期'])) != 0){
res1 <- which(is.na(df[,i]))
res2 <- paste(res1,collapse = ',')
print(glue('列名:"{i}", 第[{res2}]行有缺失值'))
}
}
(注:pandas的行数是从0开始计算的)
![](https://img.haomeiwen.com/i22672581/3f2231ded98abaad.png)
![](https://img.haomeiwen.com/i22672581/85dc8a6fed36200a.png)
- python解法
df.dropna(axis=0, how='any', inplace=True)
- R解法
df <- na.omit(df)
![](https://img.haomeiwen.com/i22672581/b0dc0b6cdff2d302.png)
![](https://img.haomeiwen.com/i22672581/ad40ff27235dd2f0.png)
- python解法
# Jupyter运行matplotlib
%matplotlib inline
df['收盘价(元)'].plot()
# 等价于
import matplotlib.pyplot as plt
plt.plot(df['收盘价(元)'])
- R解法
library(ggplot2)
df %>%
ggplot(aes(日期,`收盘价(元)`)) +
geom_line()
![](https://img.haomeiwen.com/i22672581/c51ec68ba53ff8ed.png)
![](https://img.haomeiwen.com/i22672581/c29277a4c153edc3.png)
- python解法
plt.rcParams['font.sans-serif'] = ['SimHei'] # 解决中文乱码
plt.rcParams['axes.unicode_minus'] = False # 解决符号问题
df[['收盘价(元)','开盘价(元)']].plot()
- R解法
df %>%
ggplot() +
geom_line(aes(日期,`收盘价(元)`), size=1.2, color='steelblue') +
geom_line(aes(日期,`开盘价(元)`), size=1.2, color='orange') +
ylab(c('价格(元)'))
# 这种画出来没有图例,当然可以手动添加,但为了映射方便可以用另一种方法
library(tidyr)
df %>%
select(日期,`开盘价(元)`,`收盘价(元)`) %>%
pivot_longer(c(`开盘价(元)`,`收盘价(元)`),
names_to='type',values_to='price') %>%
ggplot(aes(日期,price,color=type)) +
geom_line(size=1.2) +
scale_color_manual(values=c('steelblue','orange')) +
theme_bw() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.title = element_blank(),
legend.position = c(0.86, 0.9)
)
![](https://img.haomeiwen.com/i22672581/476b044321bab168.png)
![](https://img.haomeiwen.com/i22672581/64f5053db9979035.png)
- python解法
plt.hist(df['涨跌幅(%)'])
# 等价于
df['涨跌幅(%)'].hist()
- R解法
df %>%
ggplot(aes(`涨跌幅(%)`)) +
geom_histogram()
# 可以指定bins
![](https://img.haomeiwen.com/i22672581/cd1901e366291d4c.png)
![](https://img.haomeiwen.com/i22672581/42ec41db8e641047.png)
- python解法
df['涨跌幅(%)'].hist(bins = 30)
- R解法
df %>%
ggplot(aes(`涨跌幅(%)`)) +
geom_histogram(bins=30)
这是很有意义的一次结果
可以发现matplotlib和ggplot2做直方图采取的划分策略是不同的
![](https://img.haomeiwen.com/i22672581/32128a1839c4e063.png)
![](https://img.haomeiwen.com/i22672581/87ca6c2df6937401.png)
- python解法
temp = pd.DataFrame(columns = df.columns.to_list())
- R解法
temp <- as_tibble(names(df))
![](https://img.haomeiwen.com/i22672581/fe6e2c9687627cc2.png)
- python解法
for index,row in df.iterrows():
if type(row[13]) != float:
temp = temp.append(df.loc[index])
- R解法
换手率这一列属性为chr,需要先强转数值型
如果转换失败会变成NA,判断即可
df[is.na(as.numeric(df$`换手率(%)`)),]
![](https://img.haomeiwen.com/i22672581/62407e6822d75e02.png)
![](https://img.haomeiwen.com/i22672581/5d9b1be92acc86c4.png)
- python解法
df[df['换手率(%)'] == '--']
- R解法
df %>%
filter(`换手率(%)` == '--')
结果同上一题
![](https://img.haomeiwen.com/i22672581/f8353ffabebb738d.png)
- python解法
df = df.reset_index(drop=True)
- R解法
rownames(df) <- NULL
# 如果是tibble则索引始终是按顺序
![](https://img.haomeiwen.com/i22672581/6662d6d3e2f39e2c.png)
- python解法
lst = []
for index,row in df.iterrows():
if type(row[13]) != float:
lst.append(index)
df.drop(labels=lst,inplace=True)
- R解法
df[!is.na(as.numeric(df$`换手率(%)`)),]
# 或者根据前几题的经验,非数字就是'--'
df <- df %>%
filter(`换手率(%)` != '--')
![](https://img.haomeiwen.com/i22672581/cc9d2fa04af701d0.png)
![](https://img.haomeiwen.com/i22672581/2a63bd8b188793e0.png)
- python解法
df['换手率(%)'].plot(kind='kde',xlim=(0,0.6))
- R解法
df$`换手率(%)` <- as.double(df$`换手率(%)`)
ggplot(df) +
geom_density(aes(`换手率(%)`))
![](https://img.haomeiwen.com/i22672581/cf71e70d327a7c4a.png)
![](https://img.haomeiwen.com/i22672581/f819ea94d33e0586.png)
- python解法
df['收盘价(元)'].diff()
- R解法
df %>%
summarise(delta = `收盘价(元)` - lag(`收盘价(元)`))
![](https://img.haomeiwen.com/i22672581/624cc40b5c4445ab.png)
![](https://img.haomeiwen.com/i22672581/3fa488cee899d8ce.png)
- python解法
data['收盘价(元)'].pct_change()
- R解法
df %>%
summarise(pct_change = (`收盘价(元)` - lag(`收盘价(元)`))/lag(`收盘价(元)`))
![](https://img.haomeiwen.com/i22672581/8b9a64d2f10cdc5a.png)
![](https://img.haomeiwen.com/i22672581/59c58dd7ca11e496.png)
- python解法
df.set_index('日期')
- R解法
df %>%
column_to_rownames(var='日期')
![](https://img.haomeiwen.com/i22672581/18cc73ad1ff4c0d2.png)
![](https://img.haomeiwen.com/i22672581/a1278f6b7ed4475b.png)
- python解法
df['收盘价(元)'].rolling(5).mean()
- R解法
利用RcppRoll包开窗
library(RcppRoll)
df %>%
transmute(avg_5 = roll_mean(`收盘价(元)`,n = 5,align="right",fill = NA))
![](https://img.haomeiwen.com/i22672581/5f880bc9a31c8708.png)
![](https://img.haomeiwen.com/i22672581/1c314d5b6cc526f6.png)
- python解法
df['收盘价(元)'].rolling(5).sum()
- R解法
df %>%
transmute(sum_5 = roll_sum(`收盘价(元)`,n = 5,align="right",fill = NA))
![](https://img.haomeiwen.com/i22672581/e62794dd174ee6fe.png)
![](https://img.haomeiwen.com/i22672581/122822c027c3cbfd.png)
- python解法
df['收盘价(元)'].plot()
df['收盘价(元)'].rolling(5).mean().plot()
df['收盘价(元)'].rolling(20).mean().plot()
- R解法
df %>%
mutate(avg_5 = roll_mean(`收盘价(元)`,n = 5,align="right",fill = NA),
avg_20 = roll_mean(`收盘价(元)`,n = 20,align="right",fill = NA)) %>%
ggplot() +
geom_line(aes(日期,`收盘价(元)`),color = 'steelblue',size = 1.2) +
geom_line(aes(日期,avg_5),color = 'orange',size = 1.2) +
geom_line(aes(日期,avg_20),color = 'green',size = 1.2)
![](https://img.haomeiwen.com/i22672581/b48e37c388b077ce.png)
![](https://img.haomeiwen.com/i22672581/9ee046231b58064b.png)
- python解法
进行重抽样前需要保证索引是日期(时间)格式
df = df.set_index('日期')
df['收盘价(元)'].resample('W').max()
- R解法
我不是专业做量化交易的
时间序列的处理不是很熟悉
用R和python的结果存在出入待后续解决
library(plyr)
res <- dlply(df,.(cut(日期,"1 week")),"[")
res_max <- sapply(res,function(n)max(n$`收盘价(元)`),simplify=TRUE)
as.data.frame(res_max)
![](https://img.haomeiwen.com/i22672581/5143823439bcd51f.png)
![](https://img.haomeiwen.com/i22672581/ada99b6587e364bf.png)
- python解法
df['收盘价(元)'].plot()
df['收盘价(元)'].resample('7D').max().plot()
- R解法
res %>%
rownames_to_column('date')
res$date <- as.Date(res$date)
ggplot(df) +
geom_line(aes(日期,`收盘价(元)`),color = 'steelblue',size = 1.2) +
geom_line(data = res, aes(date,res_max),
color = 'orange',size = 1.2)
![](https://img.haomeiwen.com/i22672581/f4feacc14bfee1a3.png)
![](https://img.haomeiwen.com/i22672581/d9808c15c999e463.png)
- python解法
df.shift(5)
- R解法
lag(df,5)
![](https://img.haomeiwen.com/i22672581/1c375a4e846be109.png)
![](https://img.haomeiwen.com/i22672581/64b87b10eaa72da4.png)
- python解法
df.shift(-5)
- R解法
lead(df,5)
结果类似上一题,故不展示
![](https://img.haomeiwen.com/i22672581/f32dd9f8e9354f8d.png)
- python解法
df['开盘价(元)'].expanding(min_periods=1).mean()
- R解法
R中没有expanding完全一致的函数
考虑到expanding实际功能就是累积均值
可以用cummean
但cummean的功能和我预想的不同
可能是包之间相互干扰
最后采用cumsum/1:n的形式完成本题
res <- df %>%
transmute(cummean = cumsum(`开盘价(元)`)/1:dim(df)[1])
![](https://img.haomeiwen.com/i22672581/10082037f40b8a6b.png)
![](https://img.haomeiwen.com/i22672581/e529e52eaad03d79.png)
- python解法
df['expanding Open mean']=df['开盘价(元)'].expanding(min_periods=1).mean()
df[['开盘价(元)', 'expanding Open mean']].plot(figsize=(16, 6))
- R解法
library(tidyr)
df %>%
cbind(res) %>%
dplyr::rename(Opening_Price = `开盘价(元)`,
Expanding_Open_Mean = cummean) %>%
select(日期,Opening_Price,Expanding_Open_Mean) %>%
pivot_longer(c(Opening_Price,Expanding_Open_Mean),
names_to = 'type',
values_to ='price') %>%
ggplot(aes(日期,price,color = type)) +
geom_line(size=1.2) +
scale_color_manual(values=c('orange','steelblue')) +
theme_bw() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.title = element_blank(),
legend.position = c(0.9, 0.9)
)
![](https://img.haomeiwen.com/i22672581/35a6308cd51df48f.png)
![](https://img.haomeiwen.com/i22672581/478af21bebad9c24.png)
布林线指标,即BOLL指标,其英文全称是“Bollinger Bands”,布林线(BOLL)由约翰·布林先生创造,其利用统计原理,求出股价的标准差及其信赖区间
![](https://img.haomeiwen.com/i22672581/92c7f65305b29b61.png)
- python解法
df['former 30 days rolling Close mean']=df['收盘价(元)'].rolling(20).mean()
df['upper bound']=df['former 30 days rolling Close mean']+2*df['收盘价(元)'].rolling(20).std()
df['lower bound']=df['former 30 days rolling Close mean']-2*df['收盘价(元)'].rolling(20).std()
- R解法
df <- df %>%
mutate(avg_20 = roll_mean(`收盘价(元)`,n = 20,align="right",fill = NA),
upper_bound = avg_20 + 2 * roll_sd(`收盘价(元)`,n = 20,align="right",fill = NA),
lower_bound = avg_20 - 2 * roll_sd(`收盘价(元)`,n = 20,align="right",fill = NA))
算的结果对不对最后一题见分晓
![](https://img.haomeiwen.com/i22672581/d62ef0973c53f6d5.png)
- python解法
df[['收盘价(元)', 'former 30 days rolling Close mean','upper bound','lower bound' ]].plot(figsize=(16, 6))
- R解法
df %>%
dplyr::rename(former_30_days_rolling_Close_mean = avg_20,
Closing_Price = `收盘价(元)`) %>%
select(日期,Closing_Price,
former_30_days_rolling_Close_mean,upper_bound,lower_bound) %>%
pivot_longer(c(Closing_Price,former_30_days_rolling_Close_mean,upper_bound,lower_bound),
names_to = 'type',
values_to ='price') %>%
ggplot(aes(日期,price,color = type)) +
geom_line(size=1.2) +
scale_color_manual(values=c('steelblue','orange','red','green')) +
theme_bw() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.title = element_blank(),
legend.position = c(0.6, 0.2)
)
![](https://img.haomeiwen.com/i22672581/9e65f41238bd5de9.png)