知识图谱程序员Python语言与信息数据获取和机器学习

爬取图片网站所有标签并统计去重

2017-05-21  本文已影响258人  laotoutou

Pixabay网站含有大量图片,每张图片都有对应5-10个标签不等,项目需要爬取所有标签,可以按分类别爬取和全部爬取两种方式:



这里我们爬取所有图片的标签,以下为相关代码:

# -*- coding: utf-8 -*-
"""
Created on Sun May 07 19:51:16 2017

@author: Black Mamba
"""
import requests 
from bs4 import BeautifulSoup 
import os
 
max_span = 9419
f = open('0000.txt','a')
          
href = 'https://pixabay.com/zh/photos/?orientation=&image_type=&cat=&colors=&q=&order=latest&pagi='

for page in range(1,int(max_span)+1):
    
    try:
        print page
        page_url = href + str(page)
        lable_html = requests.get(page_url)
        lable_Soup = BeautifulSoup(lable_html.text,'lxml')
        lable_content = lable_Soup.find('div',class_='flex_grid credits').find_all('img')
        
        for c in lable_content:
                     try:
                         f.write(c['alt'].encode('utf-8'))
                         f.write('\n')                          
                     except:
                         print page, '写入失败'
                         continue
        
        if page % 1000 == 0:
            f.close()
            name = str(page) + '.txt'
            f = open(name, 'a')
        
    except:
        print page, 'error'
        continue

f.close()
print 'crawl done'

max_span设定爬取的页面数量,为防止爬取过程出现未捕获的异常导致程序异常终止,每1000个页面的图片标签存储到一个文件中。
爬取pixabay网站相对容易很多,可以根据url的参数发送get请求来分类别爬取和全部爬取。
cat参数表示类别,共有20个类别,以下给出该参数对应各个类别的值:
['animals', 'backgrounds', 'buildings', 'business', 'computer', 'education', 'fashion', 'feelings', 'food', 'health', 'industry', 'music', 'nature', 'people', 'places', 'religion', 'science', 'sports', 'transportation', 'travel']
由于该网站的中文标签是从英文标签直接翻译过来的,质量不如英文标签高,可以将url中的zh改为en即可爬取英文标签,如

https://pixabay.com/zh/photos/?min_height=&image_type=&cat=&q=&min_width=&order=popular

改为

https://pixabay.com/en/photos/?min_height=&image_type=&cat=&q=&min_width=&order=popular

接着我们得到了10个文件,现在对这10个文件的标签执行去重排序,以下给出相关代码:

# coding: utf-8

'''
对标签词条去重 并统计权重
'''

from collections import Counter


# 统计图片个数
global_var = 0
# 保存标签
result = []

def calculate(folder_name, raw_filename):

    fp = open(folder_name + '/' + str(raw_filename) + '000.txt', 'r')
    for line in fp:

        # 统计图片个数 (一行 为一张图片)
        global global_var 
        global_var += 1

        labels = line.split(', ')
        global result
        for label in labels:
            result.append(label.strip())
    fp.close()
    return result



def main(folder_name, file_names, suffix):

    for filename in file_names:
        '''
        对每个文件去重/统计
        '''
        calculate(folder_name, filename)

    global result
    dic = Counter(result).items()
    # 按出现次数排序 并写入文件
    sorted_dic = sorted(dic, key=lambda item:item[1], reverse=True)
    fp = open(folder_name + '/' + 'labels' + suffix + '.txt', 'a')
    for word, count in sorted_dic:
        fp.write(word + '\t' + str(count) + '\n')
    fp.close()

    fp_readme = open(folder_name + '/' + 'labels' + suffix + '_readme.txt', 'a')
    # 输出图片个数
    global global_var
    tp1 = u'图片个数'
    tp2 = u'标签个数'
    fp_readme.write(tp1.encode('utf-8') + '\t' + str(global_var) + '\n')
    # 统计标签个数
    fp_readme.write(tp2.encode('utf-8') + '\t' + str(len(dic)) + '\n')
    fp_readme.close()



if __name__ == '__main__':

    folder_name = 'zh_labels'
    names = range(10)
    suffix = '_result'
    main(folder_name, names, suffix)

    print 'done'
上一篇下一篇

猜你喜欢

热点阅读