第一个python程序:抓取html表格存为本地xls文件
2017-08-06 本文已影响0人
迟客
需求:获取目标网址的中每一天的水情公报,并保存进excel进行下一步处理。
分析水情公报url知,http://www.sxmwr.gov.cn/gb-zxfw-news-3-list-79501中,list后数字并非序列,不规范。因此需要先抓取水情公报栏目每一页所有天的水情公报链接,然后,进入每一链接抓取公报内容存储为本地文件。
以下分4步完成:
1、获取每一天的公报链接(index.py)
from unittest import result
import requests
import os, sys
from bs4 import BeautifulSoup
for i in range(20,30):
url = 'http://www.sxmwr.gov.cn/List.action?classfyid=61¤tPage=' + str(i) + '&turnHref=a0909d3e2591f4bad185dcc317ed1305ecd1ce6e561ebba08b16ce55831d5ab8a287dfa4a19beda0c6bc579659449f995714cf0f1d6601099d111aa8b2a942c122565fccc10321a12fa3875b48a46949d5c36fb26f106d16e54a688e17199bd5c4e6b68a622d3b2792ba2c781a2d4e17fffe1f9e8c4d6cdf6348d9a80dbcf0bdaea67d6bcc745b348c230d59c63a6576131bcee30514c0527ad244d7662c1922'
res = requests.get(url)
res.encoding = 'utf-8'
html_sample = res.text
soup = BeautifulSoup(html_sample, 'html.parser')
for link in soup.select('.articlelist'):
s=(link)
soup2 = BeautifulSoup(str(s), 'html.parser')
alinks = soup2.select('a')
for links in alinks:
print(links['href'])
对于新手来说,感觉还是beautifulSoup抓取页面比较好用。(具体用法自行百度)
print结果如下
gb-zxfw-news-3-list-62331
gb-zxfw-news-3-list-62314
……
2、用excel处理上一步结果
由于下一步需要一个list,形如['a','b'],于是将上一步运行结果,通过复制转置为一行,在下一行插入公式(假设第一行第一列为A1,此公式是为上一行每一个数据加上引号)
="'"%A1%"'"
然后另存为带逗号的csv格式,可以用记事本打开复制到下一个python程序中。
3、存储每一天的公报为xls格式(content.py)
from unittest import result
import requests
import os, sys
from bs4 import BeautifulSoup
def getNewsDetail(newsurl):
result = {}
res = requests.get(newsurl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
result['title'] = soup.select('#mainNewsTitles')[0].text.strip()
result['artical'] = soup.select('#mainNewsContent')[0]
return result
A=['gb-zxfw-news-3-list-62331','gb-zxfw-news-3-list-62314','gb-zxfw-news-3-list-62276','gb-zxfw-news-3-list-62247','gb-zxfw-news-3-list-62218','gb-zxfw-news-3-list-62187','gb-zxfw-news-3-list-62156','gb-zxfw-news-3-list-62155','gb-zxfw-news-3-list-62133','gb-zxfw-news-3-list-62100','gb-zxfw-news-3-list-62076','gb-zxfw-news-3-list-62042','gb-zxfw-news-3-list-61998','gb-zxfw-news-3-list-61995','gb-zxfw-news-3-list-61994','gb-zxfw-news-3-list-55477','gb-zxfw-news-3-list-55385','gb-zxfw-news-3-list-55344','gb-zxfw-news-3-list-55306','gb-zxfw-news-3-list-55289','gb-zxfw-news-3-list-55165','gb-zxfw-news-3-list-55098','gb-zxfw-news-3-list-55047','gb-zxfw-news-3-list-55002','gb-zxfw-news-3-list-55000','gb-zxfw-news-3-list-54997','gb-zxfw-news-3-list-54995','gb-zxfw-news-3-list-54994','gb-zxfw-news-3-list-54991','gb-zxfw-news-3-list-54989','gb-zxfw-news-3-list-54940','gb-zxfw-news-3-list-54900','gb-zxfw-news-3-list-54850','gb-zxfw-news-3-list-54810','gb-zxfw-news-3-list-54772','gb-zxfw-news-3-list-54730','gb-zxfw-news-3-list-54695','gb-zxfw-news-3-list-54643','gb-zxfw-news-3-list-54584','gb-zxfw-news-3-list-54558','gb-zxfw-news-3-list-54556','gb-zxfw-news-3-list-54515','gb-zxfw-news-3-list-54468','gb-zxfw-news-3-list-54387','gb-zxfw-news-3-list-54325','gb-zxfw-news-3-list-54237','gb-zxfw-news-3-list-54188','gb-zxfw-news-3-list-54178','gb-zxfw-news-3-list-54132','gb-zxfw-news-3-list-54052','gb-zxfw-news-3-list-53983','gb-zxfw-news-3-list-53935','gb-zxfw-news-3-list-53892','gb-zxfw-news-3-list-53889','gb-zxfw-news-3-list-53887','gb-zxfw-news-3-list-53858','gb-zxfw-news-3-list-53825','gb-zxfw-news-3-list-53779','gb-zxfw-news-3-list-53729','gb-zxfw-news-3-list-53640','gb-zxfw-news-3-list-53589','gb-zxfw-news-3-list-53587','gb-zxfw-news-3-list-53561','gb-zxfw-news-3-list-53513','gb-zxfw-news-3-list-53465','gb-zxfw-news-3-list-53405','gb-zxfw-news-3-list-53338','gb-zxfw-news-3-list-53286','gb-zxfw-news-3-list-53284','gb-zxfw-news-3-list-53238','gb-zxfw-news-3-list-53187','gb-zxfw-news-3-list-53128','gb-zxfw-news-3-list-53072','gb-zxfw-news-3-list-53012','gb-zxfw-news-3-list-52957','gb-zxfw-news-3-list-52951','gb-zxfw-news-3-list-52897','gb-zxfw-news-3-list-52845','gb-zxfw-news-3-list-52801','gb-zxfw-news-3-list-52740','gb-zxfw-news-3-list-52632','gb-zxfw-news-3-list-52571','gb-zxfw-news-3-list-52564','gb-zxfw-news-3-list-52499','gb-zxfw-news-3-list-52407','gb-zxfw-news-3-list-52295','gb-zxfw-news-3-list-52196','gb-zxfw-news-3-list-52057','gb-zxfw-news-3-list-52003','gb-zxfw-news-3-list-51992','gb-zxfw-news-3-list-51935','gb-zxfw-news-3-list-51848','gb-zxfw-news-3-list-51751','gb-zxfw-news-3-list-51667','gb-zxfw-news-3-list-51612','gb-zxfw-news-3-list-51560','gb-zxfw-news-3-list-51550','gb-zxfw-news-3-list-51504','gb-zxfw-news-3-list-51346','gb-zxfw-news-3-list-51431','gb-zxfw-news-3-list-51246','gb-zxfw-news-3-list-51168','gb-zxfw-news-3-list-51134','gb-zxfw-news-3-list-51131','gb-zxfw-news-3-list-51100','gb-zxfw-news-3-list-51036','gb-zxfw-news-3-list-50933','gb-zxfw-news-3-list-50825','gb-zxfw-news-3-list-50760','gb-zxfw-news-3-list-50717','gb-zxfw-news-3-list-50710','gb-zxfw-news-3-list-50652','gb-zxfw-news-3-list-50578','gb-zxfw-news-3-list-50458','gb-zxfw-news-3-list-50401','gb-zxfw-news-3-list-50336','gb-zxfw-news-3-list-50285','gb-zxfw-news-3-list-50284','gb-zxfw-news-3-list-50243','gb-zxfw-news-3-list-50166','gb-zxfw-news-3-list-50068','gb-zxfw-news-3-list-49957','gb-zxfw-news-3-list-49893','gb-zxfw-news-3-list-49841','gb-zxfw-news-3-list-49840','gb-zxfw-news-3-list-49839','gb-zxfw-news-3-list-49793','gb-zxfw-news-3-list-49727','gb-zxfw-news-3-list-49653','gb-zxfw-news-3-list-49573','gb-zxfw-news-3-list-49475','gb-zxfw-news-3-list-49424','gb-zxfw-news-3-list-49421','gb-zxfw-news-3-list-49380','gb-zxfw-news-3-list-49273','gb-zxfw-news-3-list-49191','gb-zxfw-news-3-list-49125','gb-zxfw-news-3-list-49059','gb-zxfw-news-3-list-49017','gb-zxfw-news-3-list-49016','gb-zxfw-news-3-list-48960','gb-zxfw-news-3-list-48888','gb-zxfw-news-3-list-48820','gb-zxfw-news-3-list-48710','gb-zxfw-news-3-list-48673','gb-zxfw-news-3-list-48748','gb-zxfw-news-3-list-48644','gb-zxfw-news-3-list-48542','gb-zxfw-news-3-list-48502','gb-zxfw-news-3-list-48604','gb-zxfw-news-3-list-48427','gb-zxfw-news-3-list-48362','gb-zxfw-news-3-list-48360','gb-zxfw-news-3-list-48358','gb-zxfw-news-3-list-48249','gb-zxfw-news-3-list-48173','gb-zxfw-news-3-list-48108','gb-zxfw-news-3-list-48045','gb-zxfw-news-3-list-47986','gb-zxfw-news-3-list-47984','gb-zxfw-news-3-list-47923','gb-zxfw-news-3-list-47844','gb-zxfw-news-3-list-47771','gb-zxfw-news-3-list-47692','gb-zxfw-news-3-list-47585','gb-zxfw-news-3-list-47535','gb-zxfw-news-3-list-47533','gb-zxfw-news-3-list-47491','gb-zxfw-news-3-list-47410','gb-zxfw-news-3-list-47343','gb-zxfw-news-3-list-47221','gb-zxfw-news-3-list-47151','gb-zxfw-news-3-list-47123','gb-zxfw-news-3-list-47120','gb-zxfw-news-3-list-47053','gb-zxfw-news-3-list-47010','gb-zxfw-news-3-list-47057','gb-zxfw-news-3-list-46719','gb-zxfw-news-3-list-46641','gb-zxfw-news-3-list-46590','gb-zxfw-news-3-list-46550','gb-zxfw-news-3-list-47056','gb-zxfw-news-3-list-46546','gb-zxfw-news-3-list-38623','gb-zxfw-news-3-list-38512','gb-zxfw-news-3-list-38482','gb-zxfw-news-3-list-38412','gb-zxfw-news-3-list-38357','gb-zxfw-news-3-list-38169','gb-zxfw-news-3-list-38105','gb-zxfw-news-3-list-38103','gb-zxfw-news-3-list-38221','gb-zxfw-news-3-list-38101','gb-zxfw-news-3-list-38098','gb-zxfw-news-3-list-38096','gb-zxfw-news-3-list-38088','gb-zxfw-news-3-list-38050','gb-zxfw-news-3-list-37982','gb-zxfw-news-3-list-37914','gb-zxfw-news-3-list-37861']
for a in A:
zd = getNewsDetail('http://www.sxmwr.gov.cn/'+str(a))
s=str(zd['artical'])
names=zd['title']
f = open(str(names)+'.xls', 'w+', encoding="utf-8")
f.write(s)
这个还是用了BeautifulSoup来抓取页面,不过定义了一个函数。代码中列表A即时上一步处理的结果。此步结果可以得到所有所需页面的xls文件。(不会用html表格转excel的模块,不过发现存为html表格存为本地html后,重命名格式为xls后可以打开。)
4、合并所有xls为多sheet表格
在网上直接扒了一个vba代码:
Sub CombineWorkbooks()
Dim FilesToOpen, ft
Dim x As Integer
Application.ScreenUpdating = False
On Error GoTo errhandler
FilesToOpen = Application.GetOpenFilename _
(FileFilter:="Micrsofe Excel文件(*.xls), *.xls", _
MultiSelect:=True, Title:="要合并的文件")
If TypeName(FilesToOpen) = "boolean" Then
MsgBox "没有选定文件"
'GoTo errhandler
End If
x = 1
While x <= UBound(FilesToOpen)
Set wk = Workbooks.Open(Filename:=FilesToOpen(x))
wk.Sheets().Move after:=ThisWorkbook.Sheets _
(ThisWorkbook.Sheets.Count)
x = x + 1
Wend
MsgBox "合并成功完成!"
errhandler:
'MsgBox Err.Description
'Resume errhandler
End Sub
在excel开发工具vba下直接添加,运行。还挺好使的。
总结
作为python初级教程还没看完的新手来说,完成上面的工作,用了两三天,心态接近崩溃。以上可能是一个伪需求,只当在实战中学习。
上面步骤2和4是作为新手的无奈呀,纯粹搞笑。以后有能力看能否全都合并成一个python程序。
以上。
10.24日更新
将python部分代码整合为一个
import requests
import os, sys
import time
from bs4 import BeautifulSoup
from unittest import result
i=0
c=[0]
time_start=time.time()
for j in range(20,30):
url = 'http://www.sxmwr.gov.cn/List.action?classfyid=61¤tPage=' + str(j) + '&turnHref=a0909d3e2591f4bad185dcc317ed1305ecd1ce6e561ebba08b16ce55831d5ab8a287dfa4a19beda0c6bc579659449f995714cf0f1d6601099d111aa8b2a942c122565fccc10321a12fa3875b48a46949d5c36fb26f106d16e54a688e17199bd5c4e6b68a622d3b2792ba2c781a2d4e17fffe1f9e8c4d6cdf6348d9a80dbcf0bdaea67d6bcc745b348c230d59c63a6576131bcee30514c0527ad244d7662c1922'
res = requests.get(url)
res.encoding = 'utf-8'
html_sample = res.text
soup = BeautifulSoup(html_sample, 'html.parser')
for link in soup.select('.articlelist'):
s=(link)
soup2 = BeautifulSoup(str(s), 'html.parser')
alinks = soup2.select('a')
for links in alinks:
[a,i]=[[links['href']],i]
b=[a,i][0]
i=i+1
c=c+b
C=c[1:]
def get_xls(url_1):
result = {}
res = requests.get(url_1)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
result['title'] = soup.select('#mainNewsTitles')[0].text.strip()
result['artical'] = soup.select('#mainNewsContent')[0]
return result
for k in C:
dic = get_xls('http://www.sxmwr.gov.cn/' + str(k))
s=str(dic['artical'])
names=dic['title']
f = open(str(names)+'.xls', 'w+', encoding="utf-8")
f.write(s)
time_end=time.time()
print('已完成,总共耗时'+str(time_end-time_start)+'秒')