实战-爬虫-1-获取演员的名单

2018-09-30 本文已影响0人 lip136

这是爬虫的第一篇，做数据开发一定要先有数据才可以进行一系列的算法研究，数据决定了算法的上限。

python模块

requests
lxml

安装方法

pip install requests, lxml

爬虫过程

请求网页

我想要得到的是演员名单，只需要名字即可。

# encoding:utf-8
import requests
from lxml import etree
url = 'http://g.manmankan.com/dy2013/mingxing/index_2.shtml'
response = requests.get(url)
html = response.text
print html #这是获取的html文件，你可以查看到你所需要的内容

获取单页所需信息

这里使用解析的方法是xpath，xpath使用方法可以看工具篇。浏览器点击右键检查，可以选择演员名字看到

图片

selector = etree.HTML(html)
person = selector.xpath('//li/span/a/text()')

循环多页信息

这个网站很简单，只需要改变index_2后面的数字即可。所以一个循环语句就好了。

保存到文档

通过open函数写入txt文件即可。

with open('./actors.txt', 'w') as f:
    f.write(person)
    f.close()

完整代码

# encoding:utf-8
import requests
from lxml import etree
actors = []
def get_person(index):
    url = 'http://g.manmankan.com/dy2013/mingxing/index_{}.shtml'.format(index)
    response = requests.get(url)
    html = response.text
    selector = etree.HTML(html)
    person = selector.xpath('//li/span/a/text()')
    return person

def read_txt(actors_china):
    with open('./actors.txt', 'w') as f:
        for i in actors_china:
            f.write(i.encode('utf8') + '\n')

def main():
    for index in range(2, 40):
        actor = get_person(index)
        actors.append(actor)
    actors_china = []
    for i in actors:
        for j in i:
            a = j.encode('iso-8859-1').decode('gbk') #编码问题，可以自己去掉试试
            actors_china.append(a)
    print len(actors_china)
    read_txt(actors_china)

if __name__ == '__main__':
    main()

总结

主要是学会两点

xpath的使用
python的编码