Python 爬取简书标题内容的源码
2016-10-07 本文已影响146人
Cocoa_Coder
很简单地一个爬取程序,适合初学者
源码如下:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pymysql
html = urlopen("http://www.jianshu.com")
bsobj = BeautifulSoup(html,"html.parser")
# print(bsobj.findAll("h4",{"class":"title"}))#打印获取的对象
SqlConnect = pymysql.connect(host = 'localhost',user = 'root',password = '123456',db = 'liusenTestURL',charset = 'utf8mb4')
cur = SqlConnect.cursor()#获取一个游标
#写入数据库函数
def writeDataBase(title,content,textURL):
cur.execute("INSERT INTO jianshuTEXT (title,content,URL) VALUES (%s,%s,%s)", (title, content,textURL))
cur.connection.commit()
#获取内容函数
def gainContent(contentHtml):
contenthtml = urlopen(contentHtml)
contentBsObj = BeautifulSoup(contenthtml,"html.parser")
textTitle = contentBsObj.find('title').get_text()
print('title : '+textTitle)
print('----------------------')
textContent = contentBsObj.find("div",{"class":"show-content"}).get_text()
# print(textContent)
writeDataBase(textTitle,textContent,contentHtml)
try:
for title in bsobj.find("ul", {"class": "article-list thumbnails"}).findAll("h4", {"class": "title"}):
# print(title.find("a"))
if 'href' in title.find("a").attrs:
contenthtml = 'http://www.jianshu.com' + title.find("a").attrs['href']
print(contenthtml)
gainContent(contenthtml)
finally:
cur.close()
SqlConnect.close()
欢迎一起交流学习
有时候网页编码不是utf-8,这就不太好弄了.假如现在第三方请求库用的是requests,那么请求下来的数据要做一个转化过程,针对gb2312网页编码,现在要做如下处理,否则会中文乱码
detailURL = "http://xxx.xxx.xxxxxx.com/"
html = requests.session().get(detailURL, headers=headers)
jieshouText = html.text.encode('ISO-8859-1',"ignore").decode(requests.utils.get_encodings_from_content(html.text)[0],"ignore")
参考:python的requests类抓取中文页面出现乱码
http://www.zhetenga.com/view/python的requests类抓取中文页面出现乱码-0abbaa140.html
解释很详细