python爬虫Python数据采集与爬虫

Python 爬取简书标题内容的源码

2016-10-07  本文已影响146人  Cocoa_Coder

很简单地一个爬取程序,适合初学者

源码如下:

from urllib.request import urlopen

from bs4 import BeautifulSoup


import pymysql


html = urlopen("http://www.jianshu.com")

bsobj = BeautifulSoup(html,"html.parser")


# print(bsobj.findAll("h4",{"class":"title"}))#打印获取的对象

SqlConnect = pymysql.connect(host = 'localhost',user = 'root',password = '123456',db = 'liusenTestURL',charset = 'utf8mb4')

cur = SqlConnect.cursor()#获取一个游标


#写入数据库函数

def writeDataBase(title,content,textURL):

    cur.execute("INSERT INTO jianshuTEXT (title,content,URL) VALUES (%s,%s,%s)", (title, content,textURL))

    cur.connection.commit()



 
#获取内容函数

def gainContent(contentHtml):
    contenthtml = urlopen(contentHtml)

    contentBsObj = BeautifulSoup(contenthtml,"html.parser")

    textTitle = contentBsObj.find('title').get_text()

    print('title : '+textTitle)


    print('----------------------')

    textContent = contentBsObj.find("div",{"class":"show-content"}).get_text()
    # print(textContent)

    writeDataBase(textTitle,textContent,contentHtml)





try:
    for title in bsobj.find("ul", {"class": "article-list thumbnails"}).findAll("h4", {"class": "title"}):

        # print(title.find("a"))
        if 'href' in title.find("a").attrs:
            contenthtml = 'http://www.jianshu.com' + title.find("a").attrs['href']

            print(contenthtml)

            gainContent(contenthtml)




finally:
    cur.close()
    SqlConnect.close()

欢迎一起交流学习
有时候网页编码不是utf-8,这就不太好弄了.假如现在第三方请求库用的是requests,那么请求下来的数据要做一个转化过程,针对gb2312网页编码,现在要做如下处理,否则会中文乱码

detailURL  = "http://xxx.xxx.xxxxxx.com/"

html = requests.session().get(detailURL, headers=headers)

jieshouText = html.text.encode('ISO-8859-1',"ignore").decode(requests.utils.get_encodings_from_content(html.text)[0],"ignore")

参考:python的requests类抓取中文页面出现乱码
http://www.zhetenga.com/view/python的requests类抓取中文页面出现乱码-0abbaa140.html

解释很详细

上一篇下一篇

猜你喜欢

热点阅读