不写代码,怎么用web scraper爬取京东商品多级页面的数据

2018-09-14  本文已影响394人  大王丽丽

最近打算做个关于手机推荐的分析,以京东在售手机为样本。话说以前也爬过京东的数据,但是二级页面选择器类型都是简单的text,本次想要抓取二级页面中的店铺名称、好评率和评价标签,页面需要滚动下拉才能显示完全的数据,因此涉及到在二级页面中element sroll dowm的使用。链接地址:【手机手机手机】价格_图片_品牌_怎么样-京东商城

一、分析网站规则

1、起始页面的数据可以显示完全

2、分页时,网址不变化,需要click点击翻页

3、从起始页面link进入二级页面后,需要滚动下拉才能显示完整数据

因此确定抓取数据的方法:element click+link+element sroll down+text

二、sitemap建立

从图中可看出,我设定了list、link、sroll down三个选择器为串联关系,其中scroll down是为了滚动下拉辅助显示数据,其余子选择器类型均为text,为真正抓取数据的子选择器,抓取数据维度有手机名称、价格、评价人数、店铺名、好评率、评价标签6个方面的信息。

需要注意的是:scroll down中必须设置delay,推荐2000ms,我刚开始这里没有设置delay导致好评率和评价标签没爬到,就跳转到下个页面了。

代码如下:

{"startUrl":"https://www.jd.com/chanpin/127371.html","selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"list","selector":"div.gl-i-wrap","delay":"2000","clickElementSelector":"a.pn-next em","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickMore","discardInitialElements":false},{"parentSelectors":["list"],"type":"SelectorText","multiple":false,"id":"price","selector":"div.p-price","regex":"","delay":""},{"parentSelectors":["list"],"type":"SelectorText","multiple":false,"id":"pingjianum","selector":"div.p-commit","regex":"","delay":""},{"parentSelectors":["list"],"type":"SelectorLink","multiple":false,"id":"link","selector":"div.p-name a","delay":""},{"parentSelectors":["link"],"type":"SelectorElementScroll","multiple":false,"id":"scroll down","selector":"div#J-global-toolbar","delay":"2000"},{"parentSelectors":["link"],"type":"SelectorText","multiple":false,"id":"store","selector":"div.popbox-inner div.mt","regex":"","delay":""},{"parentSelectors":["link"],"type":"SelectorText","multiple":false,"id":"percent","selector":"div.comment-percent","regex":"","delay":""},{"parentSelectors":["link"],"type":"SelectorText","multiple":false,"id":"label","selector":"div.tag-list","regex":"","delay":""}],"_id":"shouji2"}

三、数据预览

设定好参数后就可以坐等结果了,预览如下:

手机分析过程请关注后续发布哦~~~

上一篇下一篇

猜你喜欢

热点阅读