python爬虫爬取中国天气网上海2017年天气

2017-05-30 本文已影响252人 chengcxy

一、构造url

这个作业本来端午节之前代码已经写完，当时观察页面请求的url'http://d1.weather.com.cn/calendar_new/2017/101020100_201701.html?_=1496154608683，这个url中数字101020100是城市在网站的城市id,201701是想查询的年月，1496154608683是一个时间戳，一开始构造url时候是将城市id固定，年月通过对12以内的数字遍历，时间戳通过time.time()转换为字符串，构造出12个月的url请求列表会遍历解析，后来发现城市id和时间戳写死，只需要变换年月就可以解析数据

二、解析数据

页面返回数据

返回里面看着是json格式，但又不是，而是一个var fc40 =开头，后面是一个列表存储着各月份每天天气数据的字典元素，我采取的是下面的处理办法，对字符串进行处理，转换为字典存储。

三、代码

#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import requests
from lxml import etree
import  json
import time
#测试将下面类注释掉
from class_mysql import Mysql

def get_data(url):
    try:
        html=requests.get(url,headers=headers).content
        html.decode()
        print html
        #将源码处理成字典形式字符串 json解析 将var fc40 = [ 和 ]去掉 
        #这样源码就是多个字典的字符串，这时候需要分割字符串时候以“}，”为标识#符分割
#判断结尾是否是}，不是的话将字典闭合符补全 再用json解析
        html_str=html.lstrip('var fc40 = [').rstrip(']'.strip())
        dict_str=html_str.split('},')
        for i in dict_str:
            item={}
            if not i.endswith('}'):
                json_str=i+'}'
            else :
                json_str=i
            print json_str
            #转化为字典
            weather_data=json.loads(json_str)
            #和数据库字段保持一致 插入数据库
            for k in dict_index2.keys():
                item[dict_index2[k]]=weather_data[k]
            print item
            project.insert(item)
    except Exception as e:
        print ("get_data函数解析错误 错误为 %s" % e)
if __name__ == '__main__':
    dict_index={'nlyf':'农历月份',"nl":'农历日期','date':'阳历','time':'时间点','wk':'周几','hmax':'最高温度','hmin':'最低温度','hgl':'降水概率','alins':'财运','als':'婚姻'}
    dict_index2={'nlyf':1,"nl":2,'date':3,'time':4,'wk':5,'hmax':6,'hmin':7,'hgl':8,'alins':9,'als':10}
    field_list=['农历月份','农历日期','阳历','时间点','周几','最高温度','最低温度','降水概率','财运','婚姻']
    project=Mysql('shanghai',field_list,len(field_list))
    project.create_table() 
    sh_url='http://d1.weather.com.cn/calendar_new/2017/101020100_2017%s.html?_=1496154608683'
    headers={'Accept':'*/*',
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Host':'d1.weather.com.cn',
    'Referer':'http://www.weather.com.cn/weather40d/101230808.shtml',
    'Cookie':'vjuids=-35cf735ec.15c3a674e60.0.ea0c2b4b0db22; f_city=%E5%8C%97%E4%BA%AC%7C101010100%7C; BIGipServerd1src_pool=1874396221.20480.0000; Hm_lvt_080dabacb001ad3dc8b9b9049b36d43b=1495628468,1495628495,1495642584; Hm_lpvt_080dabacb001ad3dc8b9b9049b36d43b=1495642621; vjlast=1495628468.1495642583.13'
    }
    month_list=range(1,13)
    for month in month_list:
        if month<10:
            page=str(0)+str(month)
        else:
            page=str(month)
        url=sh_url%page
        get_data(url)

四、解析存入数据库数据

如果数据有重复的 sql根据日期聚合去重一下处理
![数据]](http://upload-images.jianshu.io/upload_images/3888998-3e98c3925f8ba30d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

五、关于另外解析方法

二组同学用了 json 反序化,具体见这篇文章
Python作业 -- 天气预报爬虫