Python 实现简单爬虫
2018-01-29 本文已影响75人
獨荹儛臨
玩Python不做点爬虫都不好意思说自己会Python了
![](https://img.haomeiwen.com/i2198224/f51044278d08a999.jpg)
- 不过多解释、适合有一点编程基础的来看,Python其实很好学,比iOS简单多了
import requests
import json
from lxml import etree
from openpyxl import Workbook
import time
wb = Workbook()
ws = wb.active
ws.title = "豆瓣电影专栏"
ws.cell(row=1,column=1).value = '电影'
ws.cell(row=1,column=2).value = '评分'
ws.cell(row=1,column=3).value = '演员'
ws.cell(row=1,column=4).value = '路径'
index = 2
for a in range(20):
url_visit = 'https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={}'.format(a * 20)
files = requests.get(url_visit).json()
time.sleep(2)
for i in range(20):
dict = files['data'][i]
urlname = dict['url']
titlename= dict['title']
rate = dict['rate']
cast = dict['casts']
print('电影:%s 评分:%s 演员:%s 播放路径:%s\n' % (titlename, rate, cast, urlname))
ws.cell(row=index, column=1).value = titlename
ws.cell(row=index, column=2).value = rate
ws.cell(row=index, column=3).value = ''.join(cast)
ws.cell(row=index, column=4).value = urlname
index += 1
wb.save('/Users/tusm/Desktop/douban.xlsx')
扒取豆瓣图书 用XPath去解析网址、当然也可以用正则表达式
可以用Chorme去检查网页
感觉这个东西和做APP一样去抓去别人的数据一样的。
这个还更简单
![](https://img.haomeiwen.com/i2198224/20b48fc68684e05c.png)
豆瓣网址
https://read.douban.com/columns/category/all?sort=hot&start=0
https://read.douban.com/columns/category/all?sort=hot&start=10
https://read.douban.com/columns/category/all?sort=hot&start=20
if __name__ == "__main__":
wb = Workbook()
ws = wb.active
ws.title = "豆瓣阅读全部专栏"
ws.cell(row=1, column=1).value = '书名'
ws.cell(row=1, column=2).value = '作者'
ws.cell(row=1, column=3).value = '类别'
ws.cell(row=1, column=4).value = '介绍'
index = 2
for i in range(0, 100, 10):
url = "https://read.douban.com/columns/category/all?sort=hot&start=%d" % i
print('requestUrl:', url)
content = requests.get(url).content
selector = etree.HTML(content)
intros = selector.xpath('//div[@class="intro"]/text()')
categorys = selector.xpath('//div[@class="category"]/text()')
titles = selector.xpath('//h4[@class="title"]/a/text()')
authors = selector.xpath('//div[@class="author"]/a/text()')
for j in range(0, len(intros)):
ws.cell(row=index, column=1).value = str(titles[j])
ws.cell(row=index, column=2).value = str(authors[j])
ws.cell(row=index, column=3).value = str(categorys[j])
ws.cell(row=index, column=4).value = str(intros[j])
index += 1
wb.save('/Users/tusm/Desktop/aaa.xlsx')
time.sleep(2)