大数据 爬虫Python AI Sql每日爬虫

2019-11-14 爬虫 名人名言

2019-11-14  本文已影响0人  一只失去梦想的程序猿

目标网址:https://www.geyanw.com/
首先获取几个大的分类网址:

image.png
import requests
from lxml import html
url='https://www.geyanw.com/'
response=requests.get(url).text
selector=html.fromstring(response)
myArr=selector.xpath('//*[@id="p_left"]/div/dl/dt/strong/a/@href')
print(len(myArr))
for detail in myArr:
    arrUrl=url+detail
    print(arrUrl)

获取到9个分类,单独处理每个分类:

def getDetailUrl(arrUrl):
    response=requests.get(arrUrl).text
    selector = html.fromstring(response)
    page_two = selector.xpath('//*[@id="p_left"]/div/ul[2]/li[3]/a/@href')[0]
    print(page_two)
    page=1
    while 1:
        detailUrl=arrUrl+page_two[:-6]+'%s.html'%page
        print(detailUrl)
        page+=1

循环处理每个网址

        response = requests.get(detailUrl).text
        selector = html.fromstring(response)
        detailList = selector.xpath('//*[@id="p_left"]/div/ul[1]/li/h2/a/@href')
        print(len(detailList))
        if len(detailList)==0:
            break
        page+=1
image.png

获取到每个详情页里的名人名言:

        for articleUrl in detailList:
            print(articleUrl)
            response=requests.get(url+articleUrl)
            response.encoding = 'gb2312'
            selector = html.fromstring(response.text)
            P_element = selector.xpath('//*[@id="p_left"]/div[1]/div[4]/p')
            print(len(P_element))
            for p in P_element:
                print(p.text)
image.png

结果如图:


image.png

完整代码:https://github.com/Liangjianghao/everyDay_spider.git mingyan_11-14

上一篇 下一篇

猜你喜欢

热点阅读