Python入门学习笔记

2020-11-07 本文已影响0人 Jason数据分析生信教室

Introduction to Data Science in python --Datacamp
有关于Python入门的工具书和参考资料有很多，根据作者的背景不同，切入点完全不一样。建议初学者根据自己使用Python的目的来选择。比如说码农背景的作者会把人带入编写小程序的世界里去。这里我的目的是操作编辑数据和数据分析，切入点会和R比较相似。

Module是什么

是关联性很强的工具库，相当于R里的包
常用例
-- matplotlib: 绘图
-- pandas：构建数据
-- scikit-learn：机器学习函数库
-- scipy：数学计算
-- nltk：自然语言

加载module

import numpy as np

用Pandas操作数据框

设置好路径以后，

# Import pandas under the alias pd
import pandas as pd

# Load the CSV "credit_records.csv"
credit_records = pd.read_csv('credit_records.csv')

# Display the first five rows of credit_records using the .head() method
print(credit_records.head())
            suspect         location              date         item  price
0    Kirstine Smith   Groceries R Us   January 6, 2018     broccoli   1.25
1      Gertrude Cox  Petroleum Plaza   January 6, 2018  fizzy drink   1.90
2  Fred Frequentist   Groceries R Us   January 6, 2018     broccoli   1.25
3      Gertrude Cox   Groceries R Us  January 12, 2018     broccoli   1.25
4    Kirstine Smith    Clothing Club   January 9, 2018        shirt  14.25

credit_records.head(): 查看前五行
credit_records.info():查看整体数据情况，有点像R里的summary

选择列

有两种方法，一种是用['变量名']来指定列

items = credit_records['item']
print(items)

还有一种方法是.变量名

items = credit_records.item
print(items)

效果是一样的，但是不一样的地方是.变量名中不能出现空格啊奇奇怪怪的标点符号之类的东西。

根据逻辑提取数据框里的数据

这个和R很像
mpr在这里是一个数据库，Age,Status,Dog Breed,Status是其中的一些变量。可以通过==, !=,>,>=等逻辑运算来提取数据。

# Select the dogs where Age is greater than 2
greater_than_2 = mpr[mpr.Age > 2]
print(greater_than_2)

# Select the dogs whose Status is equal to Still Missing
still_missing = mpr[mpr.Status=='Still Missing']
print(still_missing)

# Select all dogs whose Dog Breed is not equal to Poodle
not_poodle = mpr[mpr['Dog Breed']!='Poodle']
print(not_poodle)

简单绘图

大概分三步

导入画图工具from xxx import xxx as xxx
构建图形xxx.plot(x,y)
展示结果xxx.show()

# From matplotlib, import pyplot under the alias put 
from matplotlib import pyplot as plt
# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)
# Display Deshaun's plot
plt.show()

添加标题 plt.title()
添加y轴标签plt.ylabel()
添加副标题plt.legend()

线图

# Lines
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label='Deshaun')
plt.plot(aditya.day_of_week, aditya.hours_worked, label='Aditya')
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label='Mengfei')

# Add a title
plt.title("Officer Deshaun's plot")

# Add y-axis label
plt.ylabel("day_of_week")

# Legend
plt.legend()
# Display plot
plt.show()

标注某个坐标点的信息 plt.text(x,y,"Info")

linestyle 选择线条style
dot : :
dashed: --
line: ''
marker 选择点的style
cirl : o
diamond : d
square: s

linestye	marker

绘图进阶

scatter plot图

# Explore the data
print(cellphone.head())

# Create a scatter plot of the data from the DataFrame cellphone
plt.scatter(cellphone.x, cellphone.y)

# Add labels
plt.ylabel('Latitude')
plt.xlabel('Longitude')

# Display the plot
plt.show()

棒状图

bar plot

# Display the DataFrame hours using print
print(hours)

# Create a bar plot from the DataFrame hours
plt.bar(hours.officer, hours.avg_hours_worked,
        # Add error bars
       yerr=hours.std_hours_worked)

# Display the plot
plt.show()

叠加的bar.plot
指定参数bottom

# Plot the number of hours spent on desk work
plt.bar(hours.officer, hours.desk_work, label='Desk Work')

# Plot the hours spent on field work on top of desk work
plt.bar(hours.officer,hours.field_work,bottom=hours.desk_work,label="Field Work")

# Add a legend
plt.legend()

# Display the plot
plt.show()

直方图

# Change the range to start at 5 and end at 35
plt.hist(puppies.weight,
        range=(5, 35))

# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')

# Display
plt.show()

至此，最基本的数据操作和绘图已经没有问题了，接下来可以进行进阶学习。