爬取图片网站所有标签并统计去重
2017-05-21 本文已影响258人
laotoutou
Pixabay网站含有大量图片,每张图片都有对应5-10个标签不等,项目需要爬取所有标签,可以按分类别爬取和全部爬取两种方式:
这里我们爬取所有图片的标签,以下为相关代码:
# -*- coding: utf-8 -*-
"""
Created on Sun May 07 19:51:16 2017
@author: Black Mamba
"""
import requests
from bs4 import BeautifulSoup
import os
max_span = 9419
f = open('0000.txt','a')
href = 'https://pixabay.com/zh/photos/?orientation=&image_type=&cat=&colors=&q=&order=latest&pagi='
for page in range(1,int(max_span)+1):
try:
print page
page_url = href + str(page)
lable_html = requests.get(page_url)
lable_Soup = BeautifulSoup(lable_html.text,'lxml')
lable_content = lable_Soup.find('div',class_='flex_grid credits').find_all('img')
for c in lable_content:
try:
f.write(c['alt'].encode('utf-8'))
f.write('\n')
except:
print page, '写入失败'
continue
if page % 1000 == 0:
f.close()
name = str(page) + '.txt'
f = open(name, 'a')
except:
print page, 'error'
continue
f.close()
print 'crawl done'
用max_span
设定爬取的页面数量,为防止爬取过程出现未捕获的异常导致程序异常终止,每1000个页面的图片标签存储到一个文件中。
爬取pixabay网站相对容易很多,可以根据url的参数发送get请求来分类别爬取和全部爬取。
cat参数表示类别,共有20个类别,以下给出该参数对应各个类别的值:
['animals', 'backgrounds', 'buildings', 'business', 'computer', 'education', 'fashion', 'feelings', 'food', 'health', 'industry', 'music', 'nature', 'people', 'places', 'religion', 'science', 'sports', 'transportation', 'travel']
由于该网站的中文标签是从英文标签直接翻译过来的,质量不如英文标签高,可以将url中的zh
改为en
即可爬取英文标签,如
https://pixabay.com/zh/photos/?min_height=&image_type=&cat=&q=&min_width=&order=popular
改为
https://pixabay.com/en/photos/?min_height=&image_type=&cat=&q=&min_width=&order=popular
接着我们得到了10个文件,现在对这10个文件的标签执行去重排序,以下给出相关代码:
# coding: utf-8
'''
对标签词条去重 并统计权重
'''
from collections import Counter
# 统计图片个数
global_var = 0
# 保存标签
result = []
def calculate(folder_name, raw_filename):
fp = open(folder_name + '/' + str(raw_filename) + '000.txt', 'r')
for line in fp:
# 统计图片个数 (一行 为一张图片)
global global_var
global_var += 1
labels = line.split(', ')
global result
for label in labels:
result.append(label.strip())
fp.close()
return result
def main(folder_name, file_names, suffix):
for filename in file_names:
'''
对每个文件去重/统计
'''
calculate(folder_name, filename)
global result
dic = Counter(result).items()
# 按出现次数排序 并写入文件
sorted_dic = sorted(dic, key=lambda item:item[1], reverse=True)
fp = open(folder_name + '/' + 'labels' + suffix + '.txt', 'a')
for word, count in sorted_dic:
fp.write(word + '\t' + str(count) + '\n')
fp.close()
fp_readme = open(folder_name + '/' + 'labels' + suffix + '_readme.txt', 'a')
# 输出图片个数
global global_var
tp1 = u'图片个数'
tp2 = u'标签个数'
fp_readme.write(tp1.encode('utf-8') + '\t' + str(global_var) + '\n')
# 统计标签个数
fp_readme.write(tp2.encode('utf-8') + '\t' + str(len(dic)) + '\n')
fp_readme.close()
if __name__ == '__main__':
folder_name = 'zh_labels'
names = range(10)
suffix = '_result'
main(folder_name, names, suffix)
print 'done'