大数据 爬虫Python AI SqlPython小哥哥

药品是真的贵!利用Python对药品销售进行数据分析!

2019-04-18  本文已影响0人  14e61d025165

我们可以用Python中的numpy、pandas、matplotlib等包对数据进行可视化分析。

一、数据分析的基本步骤

image

欢迎加入新手技术交流基地:1004391443 群里有大牛解答,有资源,有源码,学不学的会就看你了!

二、提出问题

一切的数据分析目的是为了解决问题,问题明确了才能为后续的分析过程确定了方向,帮助我们有效的选取数据,分析研究。

本案的分析目标是从销售数据中分析出以下业务指标:

1)月均消费次数

2)月均消费金额

3)客单价

4)消费趋势

三、理解问题

1、导入数据

需提前安装numpy 、pandas包

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

import numpy as np
import pandas as pd

</pre>

加载数据文件

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

x1= pd.ExcelFile(r'D:\数据分析\朝阳医院2018年销售数据.xlsx', dtype='object')

</pre>

读取数据表

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF =x1.parse('Sheet1',dtype='object')

</pre>

2、查看数据基本状况

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF.head(10)

</pre>

image

查看数据的基本大小

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF.shape

</pre>

(6578, 7)

查看数据数据表中的每项的基本类型

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF.dtypes

</pre>

image

用描述函数查看各项的基本状况

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF.describe()

</pre>

image

通过以上操作我们可以了解到,该项目的有七项基本项目(购药时间、社保卡号、商品编码、商品名称、销售数量、应收金额、实收金额),项目行数为6578。

四、数据清洗

获得数据后,不能马上进行数据分析。往往第一手获得数据并不符合我们的数据分析的要求,而且数据表格可能会含有缺失值、异常值等,这使得我们数据分析产生偏差。这就要求我们在数据分析前需要对数据进行清洗。而且在数据分析中,有大约60%的时间花在数据清洗过程。

数据清洗步骤:

1.选择子集

2.列名重命名

3.缺失数据处理

4.数据类型转换

5.数据排序

6.异常值处理

1、选择子集

有事遇到项目较多,分析中只需用到某几项,我们需要选择数据集的子集作为研究对象。

本案不需要选择子集,下面示范怎么选择子集(一般方法):

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF1=salesDF.loc[0:10,'社保卡号':'销售数量']
salesDF1

</pre>

image

2、列名重命名

如果数据列名符合习惯名称和数据分析,这就需要对列名进行重命名。

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

#要用花括号,购药时间->销售时间
namechang={'购药时间':'销售时间'}
salesDF.rename(columns = namechang,inplace=True) 
#参数inplace=True表示覆盖元数据集

</pre>

改好后,再次查看数据

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF.head()

</pre>

image

3、缺失数据处理

对于缺失数据,如果缺失数据较少时我们可以直接删除数据。如果缺失数据较多,可以通过建立模型插值来填充数据。

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

#查看缺失值的数量
salesDF[salesDF[['销售时间','社保卡号']].isnull().values == True]

</pre>

image

序号6574因为销售时间和社保卡号都缺失,所以出现了两次,需要去掉重复数据。

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

NaDF = salesDF[salesDF[['销售时间','社保卡号']].isnull().values ==True].drop_duplicates()
NaDF

</pre>

image

查看缺失值数量

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

NaDF.shape

</pre>

(3, 7)

查看原数据规模

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF.shape

</pre>

(6578, 7)

删除缺失值

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF = salesDF.dropna(subset=['销售时间','社保卡号'],how = 'any')
#how='any' 在给定的任何一列中有缺失值就删除,
#how='all' 在给定的所有列中都有缺失值就删除。

</pre>

数据删除后并不能自动更新数据序号,如未能更新会导致后续数据合并等操作出错。

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

#重新更新序号,使得序号修改为从0到N按顺序的索引值
salesDF=salesDF.reset_index(drop=True)

</pre>

4、数据类型转换

把数量、金额项目从字符串类型转换为浮点型类型。

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF['销售数量'] = salesDF['销售数量'].astype('float')
salesDF['应收金额'] = salesDF['应收金额'].astype('float')
salesDF['实收金额'] = salesDF['实收金额'].astype('float')
print('转换后的数据类型:\n',salesDF.dtypes)

</pre>

image

把日期项目从字符串类型转换为日期类型

销售日期中含有日期和星期,这里只需用到日期,此需要对数据分割。

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

#日期转换
def dateChange(dateLaw):
 dateList = [] #建立空列表
 for i in dateLaw:
 #例如2018-01-01 星期五,分割后为:2018-01-01
 str = i.split(' ')[0] #按空格分割,取第一列
 dateList.append(str)
 dateChangeOut = pd.Series(dateList) #数组化
 return dateChangeOut
dateChangeOut = dateChange(salesDF['销售时间'])
dateChangeOut

</pre>

0 2018-01-01

1 2018-01-02

2 2018-01-06

3 2018-01-11

4 2018-01-15

5 2018-01-20

6 2018-01-31

7 2018-02-17

8 2018-02-22

9 2018-02-24

10 2018-03-05

11 2018-03-05

12 2018-03-05

13 2018-03-07

14 2018-03-09

15 2018-03-15

16 2018-03-15

17 2018-03-15

18 2018-03-20

19 2018-03-22

20 2018-03-23

21 2018-03-24

22 2018-03-24

23 2018-03-28

24 2018-03-29

25 2018-04-05

26 2018-04-07

27 2018-04-13

28 2018-04-22

29 2018-05-01

...

6545 2018-04-05

6546 2018-04-05

6547 2018-04-09

6548 2018-04-10

6549 2018-04-10

6550 2018-04-10

6551 2018-04-12

6552 2018-04-13

6553 2018-04-13

6554 2018-04-14

6555 2018-04-15

6556 2018-04-15

6557 2018-04-15

6558 2018-04-15

6559 2018-04-16

6560 2018-04-17

6561 2018-04-18

6562 2018-04-21

6563 2018-04-22

6564 2018-04-24

6565 2018-04-25

6566 2018-04-25

6567 2018-04-25

6568 2018-04-26

6569 2018-04-26

6570 2018-04-27

6571 2018-04-27

6572 2018-04-27

6573 2018-04-27

6574 2018-04-28

Length: 6575, dtype: object

替换原销售时间数据

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF['销售时间'] = dateChangeOut

</pre>

查看替换情况

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF.head()

</pre>

image

在转换后查询是否产生新的缺失值

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

salesDF['销售时间'].isnull().any()

</pre>

False

没有产生新的缺失值

把销售时间的数据类型转换为日期型

<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

dateOut=pd.to_datetime(salesDF['销售时间'], format = '%Y-%m-%d', errors='coerce')
#format 是原始数据中日期的格式
#errors='cperce' 如果原始数据不符合日期的格式,这输出值为NaT
dateOut

</pre>

上一篇 下一篇

猜你喜欢

热点阅读