一个基于python的爬虫脚本
最近由于工作需要,接管了爬虫开发,用的开发语言是python。为了测试一些网站的流程,尤其是一些需要登录的网站,以便于后期面向这些网站爬虫的开发,因此决定开发一个脚本作为探测爬取流程的工具。
准备工作:
需要安装requests库和re库,具体安装请百度
概况:
发起一个http请求,大概需要method, headers, data, cookies, allow_redirects这几个关键要素,如下:
1. method指的是get或者post;
2. headers是http的请求头,例如:
X-Requested-With: XMLHttpRequest
Accept: application/json, text/javascript, /; q=0.01
Referer: https://xxx.cn/html/login/login.html
Accept-Language: zh-CN
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Content-Length: 0
Connection: Keep-Alive
3. data指的是post的数据,例如:
userName=34567890
4. cookies指的是请求中的cookie,例如:
Cookie: CaptchaCode=abcde; rdmdmd5=3CD2F62D7935C4BFB24495821462D153; lgToken=1e364d3d891846bd9cd65f2550cd62a4
5. allow_redirects指的是否允许request根据返回的http应答直接执行redirect操作
以上基本概念介绍完了,接下来就从以上内容人手了:
1. 首先构造设置method的函数:
def get_command():
command = ""
while True:
command = raw_input("Please input command[g/p/q], g as get, p as post, q as quit: ")
if command not in ["g", "p", "q"]:
print("command could be g, p or q")
continue
else:
break
return command
2. 设置headers的函数,headers的格式为Accept: image/png, image/svg+xml, image/;q=0.8, /;q=0.5|Accept-Language: zh-CN|User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko,以“|”作为不同key:value的分隔符*
# 默认的头
HEADERS = {
"Accept": "text/html, application/xhtml+xml, */*",
"Accept-Language": "zh-CN",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
"Accept-Encoding": "gzip, deflate",
"Connection": "Close"
}
def get_headers():
headers = HEADERS
header_text = raw_input("Please input headers:")
header_text = header_text.strip()
if len(header_text) > 0:
key_value_list = header_text.split('|')
headers = {}
for key_value in key_value_list:
#items = key_value.split(':')
items = re.split('\:+', key_value, 1)
if len(items) == 2:
headers[items[0]] = items[1].strip()
headers["Connection"] = "close"
return headers
3. 设置data的函数,数据格式为a=b&c=d,以“&”号作为不同key=value之间的分隔符:
def get_data():
data = {}
data_text = raw_input("Please input data:")
data_text = data_text.strip()
if len(data_text) > 0:
key_value_list = data_text.split('&')
for key_value in key_value_list:
items = key_value.split('=')
if len(items) == 2:
data[items[0]] = items[1]
return data
4. 设置cookie,cookie的数据格式例如CaptchaCode=bacd; rdmdmd5=3CD2F62D7935C4BFB24495821462D153; lgToken=1e364d3d891846bd9cd65f2550cd62a4,以“;”号作为不同key=value之间的分隔符:
def get_cookies():
data = {}
data_text = raw_input("Please input cookies:")
data_text = data_text.strip()
if len(data_text) > 0:
key_value_list = data_text.split(';')
for key_value in key_value_list:
items = key_value.split('=')
if len(items) == 2:
data[items[0]] = items[1]
return data
5. 发送请求并获取应答:
def get_response(session, command, headers, data, cookies, allow_redirects):
rsp = None
while True:
url = raw_input("Please input url: ")
commands = {"g": "GET", "p": "POST"}
print "request headers: ", headers
try:
if len(cookies) > 0:
rsp = session.request(method=commands[command], url=url, data=data, headers=headers, cookies=cookies, allow_redirects=allow_redirects)
else:
rsp = session.request(method=commands[command], url=url, data=data, headers=headers, allow_redirects=allow_redirects)
break
except Exception as e:
print(e.message)
next_step = raw_input("Retry [y/n]: ")
if next_step in ["y", "Y"]:
continue
else:
break
return rsp
6. 运行脚本
def run():
print("start crawler")
session = Session()
session.keep_alive = False
refer = ""
while True:
command = get_command()
if command == "q":
break
headers = get_headers()
data = get_data()
#allow_redirects = is_allow_redirects()
cookies = get_cookies()
rsp = get_response(session=session, command=command, headers=headers, data=data, cookies=cookies, allow_redirects=True)
# 如果返回为空,是否退出
if rsp is None:
is_continue = raw_input("response is none, continue[y/n]:")
if is_continue in ["y", "Y"]:
continue
else:
break
print "--- request cookie: ", session.cookies.__dict__
print_cookies(session.cookies)
print "--- response status code: ", rsp.status_code
print "--- response headers: ", rsp.headers
print "--- response cookie: ", rsp.cookies.__dict__
print_cookies(rsp.cookies)
if (rsp.history is not None) and (len(rsp.history) >= 1):
print "--- response history: "
for item in rsp.history:
print "redirect location: ", item.headers['Location']
print "redirect cookie: ", item.cookies.__dict__
for cookie in item.cookies:
session.cookies.set(cookie.name, cookie.value)
print "--- response html: ", rsp.text
# 是否包含图片,是则保存
if 'Content-Type' in rsp.headers:
if rsp.headers['Content-Type'] in ['image/jpeg', 'image/png']:
with open('image.jpg', 'ab') as f:
f.write(rsp.content)
f.close()
session.close()
print("quit crawler")
if __name__ =="__main__":
run()