信息获取工具
2020-03-01 本文已影响0人
诺之林
工具
Requests
Requests: 让 HTTP 服务人类
import requests
res = requests.get(url='https://www.baidu.com/')
txt = res.text
print(txt)
Selenium
Selenium automates browsers
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https:www.baidu.com')
print(browser.page_source)
browser.close()
Chrome浏览器安装ChromeDriver Firefox浏览器安装geckodriver
Pyppeteer
Unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library
import asyncio
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('https://www.baidu.com')
await page.screenshot({'path': 'baidu.png'})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Splash
Lightweight, scriptable browser as a service with an HTTP API
docker run --name py-splash -p 8050:8050 -d scrapinghub/splash
pipenv run scrapy startproject splash_demo
cd splash_demo
vim splash_demo/settings.py
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
vim splash_demo/spiders/TaobaoSpider.py
import scrapy
from scrapy_splash import SplashRequest
class TaobaoSpider(scrapy.Spider):
name = "taobao"
allowed_domains = ["www.taobao.com"]
start_urls = ['https://s.taobao.com/search?q=坚果&s=880&sort=sale-desc']
def start_requests(self):
for url in self.start_urls
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
print(response.text)
pipenv run scrapy crawl taobao