信息获取工具

2020-03-01  本文已影响0人  诺之林

工具

Requests

Requests: 让 HTTP 服务人类

import requests

res = requests.get(url='https://www.baidu.com/')
txt = res.text
print(txt)

Selenium

Selenium automates browsers

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https:www.baidu.com')
print(browser.page_source)
browser.close()

Chrome浏览器安装ChromeDriver Firefox浏览器安装geckodriver

Pyppeteer

Unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://www.baidu.com')
    await page.screenshot({'path': 'baidu.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Splash

Lightweight, scriptable browser as a service with an HTTP API

docker run --name py-splash -p 8050:8050 -d scrapinghub/splash
pipenv run scrapy startproject splash_demo

cd splash_demo

vim splash_demo/settings.py
ROBOTSTXT_OBEY = False

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
vim splash_demo/spiders/TaobaoSpider.py
import scrapy
from scrapy_splash import SplashRequest

class TaobaoSpider(scrapy.Spider):
    name = "taobao"
    allowed_domains = ["www.taobao.com"]
    start_urls = ['https://s.taobao.com/search?q=坚果&s=880&sort=sale-desc']

    def start_requests(self):
        for url in self.start_urls
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        print(response.text)
pipenv run scrapy crawl taobao

参考

上一篇下一篇

猜你喜欢

热点阅读