第一次爬虫实践
2020-11-11 本文已影响0人
1037号森林里一段干木头
-
简介: 这次爬虫爬取的是工业和信息化部网站上公示的车辆信息,爬取js动态生成的链接。
首页,总的有123页,打开其他页的时候地址栏里的地址不变,也就是说新的页面是用js动态生成的。
它对应的每一个子页打开都是一样的,只要解决一个子页面,再获取所有子页的url就可以解决了。

-
在主页上右键->检查,选择NetWork,然后在左边页面中点第二页,看它发送接收的信息,可以看到在标4的地方,有一个paramJson{"pageNo":2,...........},里面的pageNo就是页数
3.PNG
鼠标往上滑,在Genernal下面Request URL就是当前页面的请求地址,仔细观察里面包含了paramJson pageNo,那也就是说页面的信息页包含在里面了。
image.png
回到paramJson{"pageNo":2,...........}的地方4.PNG
点击view URL encoded,变成5.PNG
它就是general URL里面的一部分,也就是说general URL就是由下面的这些项目编码后生成的,对应一下它字符编码方式就是
{ ==> %7B
" ==> %22
英文字母 ==> 英文字母
, ==> %3A
2 ==> 2
想要获得全部页面就只需要一个循环,把pageNo:2换成其他的页码即可。
- 1.用request_html循环获取所有子页页面上的URL链接,在每一页上用r.html.links得到url,加上网页的头就可以得到所有车辆对应的url了。
from requests_html import HTMLSession
import csv
#import requests
from lxml import etree
session = HTMLSession()
url_list=[]#all url list
pageNo = 123# the number of all the pages
# url = ".....%7B%22pageNo%22%3A{}%2C%22pageSize......".format(num)
for num in range(1,pageNo):
url = "https://www.miit.gov.cn/api-gateway/jpaas-publish-server/front/page/build/unit?webId=b3eba6883f9240e2b51025f690afbae8&pageId=02621da61f2548b2ab3ca69e348d72bc&parseType=bulidstatic&pageType=column&tagId=%E4%BF%A1%E6%81%AF%E5%88%97%E8%A1%A8&tplSetId=9a9a7b87a4444169bdef99ff1f84e1aa&unitUrl=%2Fapi-gateway%2Fjpaas-publish-server%2Ffront%2Fpage%2Fbuild%2Funit¶mJson=%7B%22pageNo%22%3A{}%2C%22pageSize%22%3A20%2C%22loadEnabled%22%3Atrue%2C%22search%22%3A%22%7B%5C%22title%5C%22%3A%5C%22%5C%22%2C%5C%22PICI%5C%22%3A%5C%22338%5C%22%2C%5C%22QYMC%5C%22%3A%5C%22%5C%22%2C%5C%22CPSB%5C%22%3A%5C%22%5C%22%2C%5C%22CPMC%5C%22%3A%5C%22%5C%22%2C%5C%22CPXH%5C%22%3A%5C%22%5C%22%7D%22%7D".format(num)
r = session.get(url)
all_links = r.html.links
for item in all_links:
url_list.append("https://www.miit.gov.cn/"+item[2:-2])
[out]:all_links
{'\\"/datainfo/cpgg/xcpgs/art/2020/art_0665c69d3bc34a66ad7b30bfc98922cb.html\\"',
'\\"/datainfo/cpgg/xcpgs/art/2020/art_f45286e89b4f425cbd1db8597aa43cad.html\\"',
'\\"/datainfo/cpgg/xcpgs/art/2020/art_fcf6daa662134326a31294769adc94a1.html\\"'}
[out]:url_list[:3]
['https://www.miit.gov.cn//datainfo/cpgg/xcpgs/art/2020/art_882da80d6b1248548cb077c5f91d7237.html',
'https://www.miit.gov.cn//datainfo/cpgg/xcpgs/art/2020/art_f38cd9d91e6a431bb8c07677c3f3db42.html',
'https://www.miit.gov.cn//datainfo/cpgg/xcpgs/art/2020/art_f8aa6c4543564f7890c7506c7a55ac53.html']
- 现在所有子页的所有车辆信息都在url_list 中了,下面就是获取每车的页面上的信息了,以外形尺寸为例,右键-检查,选择外形尺寸所在的位置,右键-Copy-Copy full Xpath,然后用etree获取内容
full Xpath: /html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[1]/td

核心流程到这里也就完成了,下面的内容就是python的一些处理了,想深入了解lxml.etree的点我
-
保存下来的表格
image.png
-
完整代码
from requests_html import HTMLSession
import csv
#import requests
from lxml import etree
# get all url
session = HTMLSession()
url_list=[]#all url list
pageNo = 123# the number of all the pages
for num in range(1,pageNo):
url = "https://www.miit.gov.cn/api-gateway/jpaas-publish-server/front/page/build/unit?webId=b3eba6883f9240e2b51025f690afbae8&pageId=02621da61f2548b2ab3ca69e348d72bc&parseType=bulidstatic&pageType=column&tagId=%E4%BF%A1%E6%81%AF%E5%88%97%E8%A1%A8&tplSetId=9a9a7b87a4444169bdef99ff1f84e1aa&unitUrl=%2Fapi-gateway%2Fjpaas-publish-server%2Ffront%2Fpage%2Fbuild%2Funit¶mJson=%7B%22pageNo%22%3A{}%2C%22pageSize%22%3A20%2C%22loadEnabled%22%3Atrue%2C%22search%22%3A%22%7B%5C%22title%5C%22%3A%5C%22%5C%22%2C%5C%22PICI%5C%22%3A%5C%22338%5C%22%2C%5C%22QYMC%5C%22%3A%5C%22%5C%22%2C%5C%22CPSB%5C%22%3A%5C%22%5C%22%2C%5C%22CPMC%5C%22%3A%5C%22%5C%22%2C%5C%22CPXH%5C%22%3A%5C%22%5C%22%7D%22%7D".format(num)
r = session.get(url)
all_links = r.html.links
for item in all_links:
url_list.append("https://www.miit.gov.cn/"+item[2:-2])
names = ["产品商标","产品型号","产品名称","企业名称","外形尺寸(mm)","货箱栏板内尺寸(mm)",
"排放依据标准","燃料种类","最高车速(km/h)","整备质量(kg)","准拖挂车总质量(kg)",
"轴距(mm)","其他","发动机型号","发动机企业","排量(ml)","功率(kw)","油耗(L/100km)",
"底盘型号","总质量","额定载质量","轴数","轮胎数","轮胎规格","板簧片数",
"半挂鞍座最大允许承载质量","驾驶室准乘人数","接近角/离去角","防抱死制动系统",
"车辆识别代号(VIN)","前悬/后悬","底盘生产企业"]
b=["最高车速","总质量","额定载质量","轴数","轮胎数","轮胎规格","板簧片数","半挂鞍座最大允许承载质量",
"驾驶室准乘人数","接近角/离去角","防抱死制动系统","车辆识别代号(VIN)","前悬/后悬","底盘生产企业"]
#item xPath
item_list=[
"/html/body/div[1]/div/table[1]/tbody/tr[1]/td[0]/text()",#产品商标
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[1]/tbody/tr[1]/td[2]/text()",#产品型号
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[1]/tbody/tr[1]/td[3]/text()",#产品名称"
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]/text()",#企业名称"
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[1]/td/text()",#"外形尺寸(mm)"
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[2]/td/text()",#"货箱栏板内尺寸(mm)"
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[3]/td[1]/text()",#"排放依据标准
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[3]/td[2]/text()",#"燃料种类"
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[4]/td[1]/text()",#"最高车速(km/h)
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[6]/td[2]/text()",#"整备质量(kg)
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[7]/td[2]/text()",#"准拖挂车总质量(kg)
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[8]/td[1]/text()",#"轴距(mm)
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[16]/td/text()",#"其他
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[4]/tbody/tr[2]/td[1]/text()",#"发动机型号
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[4]/tbody/tr[2]/td[2]/text()",#"发动机企业
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[4]/tbody/tr[2]/td[3]/text()",#"排量(ml)
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[4]/tbody/tr[2]/td[4]/text()",#"功率(kw)
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[4]/tbody/tr[2]/td[5]/text()",#"油耗(L/100km)"
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[3]/tbody/tr[2]/td[3]/text()",#底盘型号"
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[4]/td[2]/text()",#总质量、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[5]/td[2]/text()",#额定载质量、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[7]/td[1]/text()",#轴数、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[10]/td[1]/text()",#轮胎数、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[8]/td[2]/text()",#轮胎规格、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[9]/td[1]/text()",#板簧片数、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[9]/td[2]/text()",#半挂鞍座最大允许承载质量、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[10]/td[2]/text()",#驾驶室准乘人数、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[12]/td[2]/text()",#接近角/离去角、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[14]/td[2]/text()",#防抱死制动系统、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[15]/td[1]/text()",#车辆识别代号(VIN)、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[2]/tbody/tr[15]/td[2]/text()",#前悬/后悬、
"/html/body/div[1]/div[3]/div[2]/div/div[2]/table[3]/tbody/tr[2]/td[4]/text()"#底盘生产企业
]
result = []#all result
rawdata=[]#a raw data
url_num=1
for item in url_list:
# if url_num>100:
# break
print("current url:",item)
print ('times:{}/{}'.format(url_num,len(url_list)))
response_new = session.get(item)
html_new = etree.HTML(response_new.content.decode('utf-8'))
for i in range(len(item_list)):
tag_str = html_new.xpath(item_list[i])
#print(tag_str)
if len(tag_str)!=0:
rawdata.append(tag_str[0])
else:
rawdata.append(" ")
result.append(rawdata)
url_num += 1
rawdata=[]#a raw data
#save data
with open("./mydata.csv","a+") as csvfile:
writer = csv.writer(csvfile)
#先写入columns_name
writer.writerow(names)
#写入多行用writerows
writer.writerows(result)