利用requests和bs4爬贴吧的图片

2018-07-03 本文已影响22人 LiangJialin

我准备爬ps吧里的第一页的每一个帖子的图片，先找到第一页里指向每一个帖子的地址

可以发现，每一个帖子的都是/p/5775030343形式的，点进去就可以发现它的实际网址http://tieba.baidu.com/p/5775030343，所以，在获取首页看到的地址后，加上http://tieba.baidu.com，就可以组成每一个帖子的网址了

进入其中一个帖子后，可以发现，帖子里的图片都在<img class = 'BDE_Image'...>标签里，获取里面的具体图片网址后，就可以直接下载了

具体实现如下：

需要用到的库

pip install requests

pip install BeautifulSoup

导入需要用到的库

import requestsfrom bs4

import BeautifulSoup

import os

首先,获取需要的链接地址 http://tieba.baidu.com/f?ie=utf-8&kw=ps

#获取html

f = requests.get('http://tieba.baidu.com/f?ie=utf-8&kw=ps').text

#用BS解析html

s = BeautifulSoup(f,'lxml')

#获取每一个帖子的链接

pages = s.find_all('a',{'class':'j_th_tit'})

#新建目录

if os.path.exists('c:/photo') == True:

    print("-----directory exicts!-----") else:

    os.mkdir('c:/photo') os.chdir('c:/photo')

#准备下载

print("-----prepare to dowmload!-----")

i=1

#存放每一页

pages_url = []

for page in pages:

  page_url = 'http://tieba.baidu.com'+page['href']

  pages_url.append(page_url)

#获取每一个帖子内的图片地址

for each_url in pages_url:

  con = requests.get(each_url).text

  cons = BeautifulSoup(con,'lxml')

  imgs = cons.find_all('img',{'class':'BDE_Image'})

  for src_url in imgs:

      real_url = src_url['src']

      img_content = requests.get(real_url).content

      file_name = str(i) + '.jpg'

#开始下载图片

      with open(file_name, 'wb') as wf:

            wf.write(img_content)

  print("downloading",i, "photo")

  i += 1

  print("---download completely!---")

效果如下：

开始下载

下载完成

总共下载了326张图片

利用requests和bs4爬贴吧的图片

需要用到的库

导入需要用到的库

猜你喜欢

热点阅读