Python入门学习笔记
2020-11-07 本文已影响0人
Jason数据分析生信教室
Introduction to Data Science in python --Datacamp
有关于Python入门的工具书和参考资料有很多,根据作者的背景不同,切入点完全不一样。建议初学者根据自己使用Python的目的来选择。比如说码农背景的作者会把人带入编写小程序的世界里去。这里我的目的是操作编辑数据和数据分析,切入点会和R比较相似。
Module是什么
- 是关联性很强的工具库,相当于R里的包
- 常用例
-- matplotlib: 绘图
-- pandas:构建数据
-- scikit-learn: 机器学习函数库
-- scipy:数学计算
-- nltk:自然语言
加载module
import numpy as np
用Pandas操作数据框
设置好路径以后,
# Import pandas under the alias pd
import pandas as pd
# Load the CSV "credit_records.csv"
credit_records = pd.read_csv('credit_records.csv')
# Display the first five rows of credit_records using the .head() method
print(credit_records.head())
suspect location date item price
0 Kirstine Smith Groceries R Us January 6, 2018 broccoli 1.25
1 Gertrude Cox Petroleum Plaza January 6, 2018 fizzy drink 1.90
2 Fred Frequentist Groceries R Us January 6, 2018 broccoli 1.25
3 Gertrude Cox Groceries R Us January 12, 2018 broccoli 1.25
4 Kirstine Smith Clothing Club January 9, 2018 shirt 14.25
credit_records.head()
: 查看前五行
credit_records.info()
:查看整体数据情况,有点像R里的summary
选择列
有两种方法,一种是用['变量名']
来指定列
items = credit_records['item']
print(items)
还有一种方法是.变量名
items = credit_records.item
print(items)
效果是一样的,但是不一样的地方是.变量名
中不能出现空格啊奇奇怪怪的标点符号之类的东西。
根据逻辑提取数据框里的数据
这个和R很像
mpr在这里是一个数据库,Age
,Status
,Dog Breed
,Status
是其中的一些变量。可以通过==
, !=
,>
,>=
等逻辑运算来提取数据。
# Select the dogs where Age is greater than 2
greater_than_2 = mpr[mpr.Age > 2]
print(greater_than_2)
# Select the dogs whose Status is equal to Still Missing
still_missing = mpr[mpr.Status=='Still Missing']
print(still_missing)
# Select all dogs whose Dog Breed is not equal to Poodle
not_poodle = mpr[mpr['Dog Breed']!='Poodle']
print(not_poodle)
简单绘图
大概分三步
- 导入画图工具
from xxx import xxx as xxx
- 构建图形
xxx.plot(x,y)
- 展示结果
xxx.show()
# From matplotlib, import pyplot under the alias put
from matplotlib import pyplot as plt
# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)
# Display Deshaun's plot
plt.show()
- 添加标题
plt.title()
- 添加y轴标签
plt.ylabel()
- 添加副标题
plt.legend()
线图
# Lines
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label='Deshaun')
plt.plot(aditya.day_of_week, aditya.hours_worked, label='Aditya')
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label='Mengfei')
# Add a title
plt.title("Officer Deshaun's plot")
# Add y-axis label
plt.ylabel("day_of_week")
# Legend
plt.legend()
# Display plot
plt.show()
- 标注某个坐标点的信息
plt.text(x,y,"Info")
-
linestyle 选择线条style
dot ::
dashed:--
line:''
-
marker 选择点的style
cirl :o
diamond :d
square:s
linestye | marker |
---|---|
绘图进阶
scatter plot图
# Explore the data
print(cellphone.head())
# Create a scatter plot of the data from the DataFrame cellphone
plt.scatter(cellphone.x, cellphone.y)
# Add labels
plt.ylabel('Latitude')
plt.xlabel('Longitude')
# Display the plot
plt.show()
棒状图
- bar plot
# Display the DataFrame hours using print
print(hours)
# Create a bar plot from the DataFrame hours
plt.bar(hours.officer, hours.avg_hours_worked,
# Add error bars
yerr=hours.std_hours_worked)
# Display the plot
plt.show()
- 叠加的bar.plot
指定参数bottom
# Plot the number of hours spent on desk work
plt.bar(hours.officer, hours.desk_work, label='Desk Work')
# Plot the hours spent on field work on top of desk work
plt.bar(hours.officer,hours.field_work,bottom=hours.desk_work,label="Field Work")
# Add a legend
plt.legend()
# Display the plot
plt.show()
直方图
# Change the range to start at 5 and end at 35
plt.hist(puppies.weight,
range=(5, 35))
# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')
# Display
plt.show()
至此,最基本的数据操作和绘图已经没有问题了,接下来可以进行进阶学习。