爬虫学习笔记

2020-03-03  本文已影响0人  zfz_amzing

爬虫

Python复习

数据类型

文件操作

w在文件关闭后会保存,再次打开后写操作都会覆盖掉之前的内容

fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "来啊"
fh.write(data)
fh.close()
fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "快活啊"
fh.write(data)
fh.close()

# 文件内容:快活啊

改为a后会在之前的内容追加写

fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "来啊"
fh.write(data)
fh.close()
fh = open("D:\\Python\\1.txt", "a", encoding="utf-8")
data = "快活啊"
fh.write(data)
fh.close()
# 文件内容:来啊快活啊

异常处理

面向对象

类中的方法第一个参数是 self 的才可以被实例调用

class Father:
    def speak(self):
        print("I can speak")


class Mother:
    def write(self):
        print("I can write")


class Son(Father):
    def speak(self):
        print("I can speak well")


class Daughter(Father, Mother):
    pass


father = Father()
father.speak()
son = Son()
son.speak()
daughter = Daughter()
daughter.speak()
daughter.write()


# I can speak
# I can speak well
# I can speak
# I can write

正则表达式

正则表达式可以对数据进行筛选,提取出我们关注的信息

原子

使用正则表达式依赖re模块

原子时正则表达式中最基本的单位,每个正则表达式中至少要包含一个原子,常见的原子类型有:

  1. 普通字符作为原子

    import re
    
    string = "xixihaha"
    # 普通字符作为原子
    pat = "ih"
    result = re.search(pat, string)
    print(result)
    # <re.Match object; span=(3, 5), match='ih'>
    
    
  1. 非打印字符作为原子

    # 非打印字符做原子
    # \n 换行符 \t 制表符
    string = '''xixi
    haha
    '''
    pat = "\n"
    result = re.search(pat, string)
    print(result)
    # <re.Match object; span=(4, 5), match='\n'>
    
  1. 通用字符作为原子

    • \w 匹配字母、数字、下划线 \W 匹配除字母、数字、下划线之外的
    • \d 匹配十进制数字 \D匹配除十进制数字之外的
    • \s 空白字符 \S除空白字符之外的
    #通用字符作原子
    string = '''xixiha123haha'''
    pat = "\d\d"
    result = re.search(pat, string)
    print(result)
    # <re.Match object; span=(5, 8), match='a12'>
    
  1. 原子表 将不同的原子组成表

    string = '''zhangfuzhi'''
    pat = "zhang[efg]u" # 原子表[efg]中包含'e' 'f' 'g',如果这三个原子中有匹配的则提取出来
    result = re.search(pat, string)
    print(result)
    # <re.Match object; span=(0, 7), match='zhangfu'>
    # 反例:
    string = '''zhangfuzhi'''
    pat = "zhang[eg]u"
    result = re.search(pat, string)
    print(result)
    # None
    

元字符

元字符就是正则表达式中一些具有特殊含义的字符,比如重复N此次前面的字符等

模式修正符

string = '''Python'''
pat = "pyt"
result = re.search(pat, string, re.I)
print(result)
# <re.Match object; span=(0, 3), match='Pyt'>

贪婪模式与懒惰模式

贪婪模式的核心就是尽可能多的匹配,懒惰模式的核心点时尽可能少的匹配。默认是贪婪模式

string = '''Pythony'''
pat = "p.*y"  # 贪婪模式
pat2 = "p.*?y"  # ?代表使用懒惰模式
result = re.search(pat, string, re.I)
result2 = re.search(pat2, string, re.I)
print(result)
print(result2)

# <re.Match object; span=(0, 7), match='Pythony'>
# <re.Match object; span=(0, 2), match='Py'>

正则表达式函数

正则实例

匹配.com或.cn

string = "<a href='http://www.baidu.com'>百度</a><a href='http://www.jd.com'>百度</a>"
pat = "[a-zA-Z]+://[^\s]*[.com|.cn]"
result = re.compile(pat).findall(string)
print(result)
# ['http://www.baidu.com', 'http://www.jd.com']

匹配电话号码

string = "dashd0534-5657888dasbd a0534-5325695asdshgiusae//.001-12345678"
pat = "\d{4}-\d{7}|\d{3}-\d{8}"
result = re.compile(pat).findall(string)
print(result)
# ['0534-5657888', '0534-5325695', '001-12345678']

简单练手

爬取出版社信息并写入文件

url = "https://read.douban.com/provider/all"
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36 Edg/80.0.361.62'}
ret = urllib.request.Request(url, headers=header)
res = urllib.request.urlopen(ret)
data = res.read().decode('utf-8')

# <div class="name">重庆大学出版社</div>
pat = ">(.{1,10}?出版社)"
result = re.compile(pat).findall(data)
print(result)
fh = open("D:\\Python\\1.txt", "w")
for i in result:
    fh.write(i + '\n')
fh.close()

Requests库

BeautifulSoup库

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

xpath

xpath是一门在xml文档中查找信息的语言

节点(node)

元素、属性、文本、命名空间、文档(根)节点

节点关系

xpath语法

表达式 描述
nodename 选取此节点的所有子节点
/ 从当前节点选区直接子节点
// 从当前节点选取子孙节点
. 选取当前节点
.. 选取当前节点的父节点
@ 选取属性
上一篇 下一篇

猜你喜欢

热点阅读