爬虫:爬取猫眼电影榜单top100

2019-08-13 本文已影响0人楚岸

本文是我第一篇爬虫实战的代码,多有借鉴和参考,主要做学习用

import requests,json
import pandas as pd
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
    }
    response=requests.get(url,headers=headers)
    try:
        if response.status_code==200:  #确认状态码为200
            return response.text
        return None
    except RequestException:
        return print("页面访问错误")

def parse_one_page(html):
    soup = BeautifulSoup(html,'lxml')
    body = soup.select('.board-wrapper dd')
    for i in body:
        yield{
            'index':i.select('.board-index')[0].get_text()  ,
            #排名
            'name':i.select('.name a ')[0].get_text()      ,
            #电影名
            'img':i.select('.board-img')[0].attrs['data-src'],
            #图片链接
            'star':i.select('.star')[0].get_text().strip() ,     #strip去除两边空格
            #演员
            'releasetime':i.select('.releasetime')[0].get_text()[5:], #去除上映时间
            #上映时间
        }
def main(offset):
    url = 'https://maoyan.com/board/4?offset='+str(offset)
    html = get_one_page(url)
    dy = parse_one_page(html)
    with open('dianying.txt','a',encoding='utf8') as f:
        for item in dy:
            #禁用ascii码防止乱码
            f.write(json.dumps(item,ensure_ascii=False)+'\n')
            print(item)
if __name__ == '__main__':
    #offset的每次变化都是+10
    for i in range(10):
        offset=i*10
        main(offset)

运行结果如下:

jupyter notebook视图

未处理的txt文件

再通过excel导入外源文件(如上txt),经过操作处理为以下数据:

数据量较小,并未使用MySQL

总结

1.了解了requests,json,BeautifulSoup包的使用
2.了解爬虫基础流程以及运行原理
3.导入excel时有乱码,调整语言类型为-无-即可

爬虫:爬取猫眼电影榜单top100

总结

猜你喜欢

热点阅读