爬取当当数据

2018-02-26 本文已影响189人 whong736

目的：练习爬取当当网站特定关键词下图书数据，并将抓取到的数据存储在mysql数据库中

1.新建项目当当：

scrapy startproject dd

2.cd 到项目目录

cd dd

image.png

3.创建当当爬虫，用基本爬虫模板

scrapy genspider -t basic dd_spider dangdang.com

image.png

4.使用pycharm打开dd项目

image.png

5.打开当当，搜索特定的关键字的图书分析网页和需要抓取的字段

image.png

# -*- coding: utf-8 -*-

import scrapy

class DdItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

     title = scrapy.Field()
     link = scrapy.Field()
     now_price = scrapy.Field()
     comment_num = scrapy.Field()
     detail = scrapy.Field()

6.打开爬虫文件，导入刚编写的item，以及修改的开始的爬取网址

from dd.items import DdItem

定义Item

     item = DdItem()
        item["title"] = response.xpath("//p[@class='name']/a/@title").extract()
        item["link"] = response.xpath("//p[@class='name']/a/@href").extract()
        item["now_price"] = response.xpath("//p[@class='price']/span[@class='search_now_price']/text()").extract()
        item["comment_num"] = response.xpath("//p/a[@class='search_comment_num']/text()").extract()
        item["detail"] = response.xpath("//p[@class='detail']/text()").extract()
        yield item

定义循环爬取方法

     for i in range(2,27):
            url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
            yield Request(url, callback=self.parse())

完整的代码

# -*- coding: utf-8 -*-
import scrapy
from dd.items import DdItem
from scrapy.http import Request

class DdSpiderSpider(scrapy.Spider):
    name = 'dd_spider'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']

    def parse(self, response):
        item = DdItem()
        item["title"] = response.xpath("//p[@class='name']/a/@title").extract()
        item["link"] = response.xpath("//p[@class='name']/a/@href").extract()
        item["now_price"] = response.xpath("//p[@class='price']/span[@class='search_now_price']/text()").extract()
        item["comment_num"] = response.xpath("//p/a[@class='search_comment_num']/text()").extract()
        item["detail"] = response.xpath("//p[@class='detail']/text()").extract()
        yield item

        for i in range(2,27):
            url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
            yield Request(url, callback=self.parse())

image.png

7.在setting中，取消注释Pipeline的注释,以及将Robots协议设置为False

ITEM_PIPELINES = {
   'dd.pipelines.DdPipeline': 300,
}

ROBOTSTXT_OBEY = False

image.png

8.打开pipelines文件
通过for循环读取爬取到的itme的值,并打印测试抓取效果

class DdPipeline(object):
    def process_item(self, item, spider):

        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            print(title)
            print(link)
            print(now_price)
            print(comment_num)
            print(detail)
        return item

image.png

9.运行爬虫查看效果,使用pycharm的Terminal或mac终端，进入的dd的文件夹目录输入

scrapy crawl dd_spider --nolog

image.png

10.爬取没问题，接下来要将抓取到的数据，存入到Mysql的数据库中,使用的是第三方库PyMysql，提前安装好PyMysql，直接使用命令 pip install pymysql 来安装。

11.终端打开并连接上mysql ，输入创建数据库dd命令,并切换成dd数据库

create database dd;

use dd;

image.png

创建数据库表books，并创建需要存储的相应字段：
自动自增id，title，link，now_price，comment_num，detail

create table books(id int AUTO_INCREMENT PRIMARY KEY,title char(200),link char(100)unique,now_price int(10),comment_num char(100),detail char(255) );

12.导入pymysql

import pymysql

# -*- coding: utf-8 -*-

import pymysql

class DdPipeline(object):
    def process_item(self, item, spider):
        #创建连接
        conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd")
        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            #构建sql语句插入数据
            sql = "insert into books(title,link,now_price,comment_num,detail) VALUES ('"+title+"','"+link+"','"+now_price+"','"+comment_num+"','"+detail+"')"
            conn.query(sql)
        #关闭连接
        conn.close()
        return item

无法争取的写入写入数据库，报ModuleNotFoundError: No module named 'pymysql'
还没找到解决方案

image.png

解决办法：更换SQL语句的写法

     conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
        cursor = conn.cursor()
        cursor.execute('set names utf8')  # 固定格式
        cursor.execute('set autocommit=1')  # 设置自动提交

         sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
            param = (title,link,now_price,comment_num,detail)
            cursor.execute(sql,param )
            conn.commit()

完整的代码

# -*- coding: utf-8 -*-

import pymysql

class DdPipeline(object):
    def process_item(self, item, spider):
        #创建连接
        conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
        cursor = conn.cursor()
        cursor.execute('set names utf8')  # 固定格式
        cursor.execute('set autocommit=1')  # 设置自动提交
        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
            param = (title,link,now_price,comment_num,detail)
            cursor.execute(sql,param )
            conn.commit()
        cursor.close()
        #关闭连接
        conn.close()
        return item

image.png

心得，出现问题比较多的是数据的编码问题，数据表字段的编码如何存入的字段编码不符可能会存不进去，也可能是乱码

优化：

1.抓取的到当当的评论数和价格都是字符，需要转化成数字，这样方便进行排序
2.写入数据库的时候使用Try 代码更健壮

        def getNumber(string):
            newString = string.encode('UTF-8')
            lastStr = re.findall(r"\d+\.?\d*", newString)
            yield int(lastStr)

参考文章：http://blog.csdn.net/think_ma/article/details/78900218

爬取当当数据

解决办法：更换SQL语句的写法

心得，出现问题比较多的是数据的编码问题，数据表字段的编码如何存入的字段编码不符可能会存不进去，也可能是乱码

优化：

猜你喜欢

热点阅读