使用BeautifulSoup爬虫html中的URL

2020-06-15 本文已影响0人 Mottil

使用bs4中的BeautifulSoup库

from bs4 import BeautifulSoup

如果是对html文件的爬虫，可先对文件操作打开

file=open("xx.html",'r',encoding='utf-8')

python中对文件的读写权限如下

file对象中常使用的函数

对html文件读取所有内容

html=file.read()

使用BeautifulSoup创建一个解析对象，使用

Be=BeautifulSoup(html,'html.parser')

循环获取所有a标签后，获取a标签中“herf”中的所有链接，再拼接域名，将完整的域名写入一个新的文件中

for kin Be.find_all('a'):

ur=k.get('href')

urls="xx.com"+ur

urlfile=open("wechatURls.txt",'a',encoding='utf-8')

“a”是循环写入的方法，如果未使用“a”写入数据，将会覆盖之前写入的内容

urlfile.write(urls+'\n')

urlfile.close()

file.close()