爬虫思路及存储

2017-09-26 本文已影响20人 d1b0f55d8efb

爬取步骤：

获取源码（获取json的源码，json.loads）
xpath或Beautifulsoup
xpath：

from lxml import etree
root_url='https://www.huxiu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
}
html=requests.get(root_url,headers=headers).text
select=etree.HTML(html)
zixun_infos=select.xpath('//ul[@class="header-column header-column1 header-column-zx menu-box"]/li/a')

Beautifulsoup：

from bs4 import BeautifulSoup
url='http://www.51hao.cc'
req=requests.get(url)
req.encoding="gb2312"
soup=BeautifulSoup(req.text,'lxml')
fkts=soup.find_all("div",class_="fkt")

解析源码
多层爬取先获取每个分页的url，存入列表在循环请求，一层一层逐渐请求到自己想要爬取得

存入数据：
具体可以参考http://www.cnblogs.com/moye13/p/5291156.html
我主要用两个方法：

存入字典在写入：（定义字典然后存入列表）

存入双重列表：（列表存列表）

存入数据库：

建好表格
连接数据库 conn=pymysql.connect()
创建游标 cur=conn.cursor()
游标下执行 Cur=cur.execute(sql语句)
提交游标数据到数据库 conn.commit
关闭游标 cur.close()
关闭数据库连接conn.close()

爬虫思路及存储

猜你喜欢

热点阅读