京东商品的检索爬虫
2019-03-24 本文已影响0人
周周周__
主要是对京东页面的商品检索接口的抓取,进行返回数据
主要利用:
python
的request
库以及xpath
网页分析
进入首页面,打开f12,network进行网络抓包
输入商品名称:小米手机
分析抓包:通过response 进行查看
url
在地址:https://search.jd.com/Search?keyword=%E5%B0%8F%E7%B1%B3%E6%89%8B%E6%9C%BA&enc=utf-8&suggest=1.his.0.0&wq=&pvid=922913c28d854d979bb187fa68bfef4e
链接分析:https://search.jd.com/Search?keyword=%E5%B0%8F%E7%B1%B3%E6%89%8B%E6%9C%BA&enc=utf-8&
是有效部分
分析参数:keyword
是我们进行搜多的内容
爬虫编写
import requests
from lxml import etree
import chardet
import pymysql
# 链接数据库
conn = pymysql.connect(
host = 'localhost',
user = 'root',
password = '123456',
database = 'shopping',
charset = 'utf8'
)
cur = conn.cursor()
# 初始地址
url = 'https://search.jd.com/Search?keyword={}&enc=utf-8&'.format(n)
# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3724.8 Safari/537.36'
}
try:
# 开始请求
res = requests.get(url, headers=headers)
html = res.content
#进行解码
encoding = chardet.detect(html).get('encoding')
html = html.decode(encoding, 'ignore')
# print(html)
#创建节点对象,使用xpath进行清洗
docs = etree.HTML(html)
good_list = docs.xpath('//li[@class="gl-item"]/div') # 取出所有商品的列表
# print(good_list)
print(len(good_list))
#循环列表,从每个数据中清洗更干净的数据
for good in good_list:
# print(good.xpath('.//text()'))
title = (good.xpath('./div/a/em//text()'))[:3] #标题
price = good.xpath('.//div/strong/i/text()') # 价格
cover = good.xpath('./div[@class="p-img"]/a/img/@source-data-lazy-img') # 图片
intro = good.xpath('./div[@class="p-img"]/a/@title') #介绍
# 转化为字符串
title = ''.join(title)
price = ''.join(price)
cover = "https:" + ''.join(cover)
intro = ''.join(intro)
# 图片保存到本地
pic = requests.get(url=cover, headers=headers)
dir = "E:\\biyesheji\\mall1\\mall1\\static\\images\\goods\\{}".format(i) + '.jpg'
with open(dir,'wb') as f:
f.write(pic.content)
if price == '':
continue
price = float(price)
print(title)
print(price)
cover = 'static/images/goods/{}'.format(i) + '.jpg'
print(cover)
print(intro)
print("#"*100)
sql1 = "insert into goods_goods(id,name,price,stock, count, intro, goodstype_id, stores_id, creatTime) values(%s, %s, %s, %s, %s,%s, %s,%s,%s)"
cur.execute(sql1, (i, title, price, 10, 0, intro, m, 1, '2019-03-24 01:45:32.227014'))
sql2 = 'insert into goods_goodsimage(id, path, status, intro, goods_id)values(%s, %s, %s, %s, %s)'
cur.execute(sql2, (i, cover, 0, intro, i))
conn.commit()
except:
continue
只是对京东的界面进行简单分析
作者QQ群(非):832785950(备注来地)