爬取中国天气网
2017-06-04 本文已影响119人
Mrchw
思路:
1.寻找入口
2.寻找数据所在url
3.获取源码,提取数据
4.数据保存输出
1.入口
入口通过40天预报可以追溯2016年和2017年全年的天气数据。
2.构造url
分析可知40天的天气数据是通过js异步加载,每个月份对应一个url
url
url由年份和月份组成,我们可以根据这个规律构造url,循环抓取数据
month = ['01','02','03','04','05','06','07','08','09','10','11','12']
for i in month:
url = 'http://d1.weather.com.cn/calendar_new/'+str(year)+'/101180101_'+str(year)+str(i)+'.html?_=1496558858156'
3.数据提取
请求url获取的数据是json格式,稍加处理转换为列表,方便提取所需数据。
数据html = requests.get(url,headers=headers).content
# print html
datas = json.loads(html[11:])
切片是为了去除var fc40 =
转换之后就可以用python列表的方法提取所需数据
4.数据保存
数据保存采用.csv
格式,用的py2.7,存入中文要用codecs
做编码处理
完整代码
# -*- coding:utf-8 -*-
import requests
from lxml import etree
import json
import csv
import codecs
import sys
reload(sys)
sys.setdefaultencoding('utf8')
# url = 'http://d1.weather.com.cn/calendar_new/2017/101010100_201705.html?_=1496558858156'
# headers = {
# 'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
# 'Referer':'http://www.weather.com.cn/weather40d/101010100.shtml'
# }
def getHtml(url):
html = requests.get(url,headers=headers).content
# print html
datas = json.loads(html[11:])
# print datas
return datas
def getData(url):
datas = getHtml(url)
for data in datas:
date = data['date']
hgl = data['hgl']
hmax = data['hmax']
hmin = data['hmin']
date2 = data['nlyf'] + data['nl']
alins = data['alins']
als = data['als']
print date,hgl,hmax,hmin,date2,alins,als
with open('weather_zz.csv','ab') as f:
writer = csv.writer(f)
writer.writerow([date,hgl,hmax,hmin,date2,alins,als])
# a = getData(url)
if __name__ == '__main__':
# url = 'http://d1.weather.com.cn/calendar_new/2017/101010100_201705.html?_=1496558858156'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
'Referer':'http://www.weather.com.cn/weather40d/101010100.shtml'
}
with open('weather_zz.csv', 'wb') as f:
f.write(codecs.BOM_UTF8)
writer = csv.writer(f)
writer.writerow(['日期', '降水概率', '最高温度', '最低温度', '农历日期', '宜', '不宜'])
year = input('请输入年份:')
month = ['01','02','03','04','05','06','07','08','09','10','11','12']
for i in month:
url = 'http://d1.weather.com.cn/calendar_new/'+str(year)+'/101180101_'+str(year)+str(i)+'.html?_=1496558858156'
getData(url)
结果
采用csv写入时,容易出现空行,采用binary模式可以避免,比如wb
,ab
with open('weather_zz.csv','ab') as f:
结果
可以获取到20180106的数据