python lxml爬取页面的编码问题

2017-07-19 本文已影响0人 CMASTER

背景

最近开始结合python爬虫+flask+react做一个全栈项目，基本方案是flask提供api接口，以json形式返回爬取到的数据，前端ajax获取，跨域的问题后端解决（庆幸flask有造好的轮子flask-cros），在数据传输过程中出现了一些问题，这里记录一下。
待实现的需求是：爬取网页上某一篇文章，将其内容以HTML的返回给后端，后端再以json形式传给前段，前端运用react的dangerouslySetInnerHTML再将整篇文章渲染出来。

问题

在实现需求的过程中遇到过以下几个问题

爬虫获取到的内容无法转换成正常的字符串格式
后端返回给前端的数据无法正常渲染

话不多说，先上简化的代码，看看问题出在哪

爬虫

import requests
from lxml.html import fromstring,tostring

class J_Detail(object):
    def __init__(self, id):
        self.url = 'http://top.jobbole.com/' + id
        self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        self.headers = {'User-Agent': self.user_agent}
        self.records = ''

    def get_detail(self):
        r = requests.get(self.url, headers=self.headers)
        page_source = r.text
        root = fromstring(page_source)
        element = root.xpath('//div[@class="p-entry"]')[0]
        content = tostring(element)  ###log 1

        return content

后端

@app.route('/api/jobbole/content/<id>', methods=['GET'])
def get_detail_content(id):
    detail = J_Detail(id)
    content = detail.get_detail()

    return jsonify(
        message = 'OK',
        data = content
    )

看似没有毛病，但是运行之后会报一个错误，提示content无法被json序列化，因为content不是字符串。

解决

回到爬虫部分log1处，我们可以看到在获取指定的数据后，已经调用了lxml.html的tostring方法将其转换成字符串，既然这样，我们再用python内置方法转化一次，将log1处代码修改为：

content = str(tostring(element))

这回对了，不再报错，但是前端拿到的数据怪怪的...

{
{"data": "b '<div class="p-entry> ...\n ... </div>' \n \n \n "}
{"message": "OK"}
}

这样的data肯定是不能被dangerouslySetInnerHTML正常渲染的。
这时候，想到了对爬虫获取到的content做进一步处理，比如分隔字符串去除多余的字符，正则表达式替换\n，但是这样后端要处理的逻辑变多了，而且将事情复杂化了，还有没有简单的方法呢？
其实方法很简单，只是解决方向一直错了。
来看下官方文档对lxml.html的tostring的说明：

The result of tostring(encoding='unicode') can be treated like any other Python unicode string and then passed back into the parsers. However, if you want to save the result to a file or pass it over the network, you should use write() or tostring() with a byte encoding (typically UTF-8) to serialize the XML. The main reason is that unicode strings returned by tostring(encoding='unicode') are not byte streams and they never have an XML declaration to specify their encoding. These strings are most likely not parsable by other XML libraries.

嗯哼，找到原因了，原来是编码问题，那么很简单，把log1处代码替换为：

content = tostring(element,pretty_print=True, encoding='unicode')

输出json格式为

{
{"data": "<div class="p-entry">... ... </div>''},
{"message": "OK"}
}

完美解决！

总结

恩...出问题先别急着自己瞎整，多google，多看文档..(摊手

最后

这是这个项目的地址OnlyRead，一个信息聚合类网站，未来会包括伯乐在线，V2EX，SegmentFault， 36Kr，还有其他one，糗事百科在内的很多网站的内容 ~~大概就是给程序员上班划水消磨时间的~~。

python lxml爬取页面的编码问题

背景

问题

解决

总结

最后

猜你喜欢

热点阅读