一个基于python的爬虫脚本

2018-09-13  本文已影响0人  jerrylee529

最近由于工作需要,接管了爬虫开发,用的开发语言是python。为了测试一些网站的流程,尤其是一些需要登录的网站,以便于后期面向这些网站爬虫的开发,因此决定开发一个脚本作为探测爬取流程的工具。

准备工作:

需要安装requests库和re库,具体安装请百度

概况:

发起一个http请求,大概需要method, headers, data, cookies, allow_redirects这几个关键要素,如下:

1. method指的是get或者post;

2. headers是http的请求头,例如:

X-Requested-With: XMLHttpRequest

Accept: application/json, text/javascript, /; q=0.01

Referer: https://xxx.cn/html/login/login.html

Accept-Language: zh-CN

Accept-Encoding: gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko

Content-Length: 0

Connection: Keep-Alive

3. data指的是post的数据,例如:

userName=34567890

4. cookies指的是请求中的cookie,例如:

Cookie: CaptchaCode=abcde; rdmdmd5=3CD2F62D7935C4BFB24495821462D153; lgToken=1e364d3d891846bd9cd65f2550cd62a4

5. allow_redirects指的是否允许request根据返回的http应答直接执行redirect操作

以上基本概念介绍完了,接下来就从以上内容人手了:

1. 首先构造设置method的函数:


def get_command():

    command = ""

    while True:

        command = raw_input("Please input command[g/p/q], g as get, p as post, q as quit: ")

        if command not in ["g", "p", "q"]:

            print("command could be g, p or q")

            continue

        else:

            break

    return command

2. 设置headers的函数,headers的格式为Accept: image/png, image/svg+xml, image/;q=0.8, /;q=0.5|Accept-Language: zh-CN|User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko,以“|”作为不同key:value的分隔符*


# 默认的头

HEADERS = {

    "Accept": "text/html, application/xhtml+xml, */*",

    "Accept-Language": "zh-CN",

    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",

    "Accept-Encoding": "gzip, deflate",

    "Connection": "Close"

}

def get_headers():

    headers = HEADERS

    header_text = raw_input("Please input headers:")

    header_text = header_text.strip()

    if len(header_text) > 0:

        key_value_list = header_text.split('|')

        headers = {}

        for key_value in key_value_list:

            #items = key_value.split(':')

            items = re.split('\:+', key_value, 1)

            if len(items) == 2:

                headers[items[0]] = items[1].strip()

    headers["Connection"] = "close"

    return headers 

3. 设置data的函数,数据格式为a=b&c=d,以“&”号作为不同key=value之间的分隔符:


def get_data():

    data = {}

    data_text = raw_input("Please input data:")

    data_text = data_text.strip()

    if len(data_text) > 0:

        key_value_list = data_text.split('&')

        for key_value in key_value_list:

            items = key_value.split('=')

            if len(items) == 2:

                data[items[0]] = items[1]

    return data 

4. 设置cookie,cookie的数据格式例如CaptchaCode=bacd; rdmdmd5=3CD2F62D7935C4BFB24495821462D153; lgToken=1e364d3d891846bd9cd65f2550cd62a4,以“;”号作为不同key=value之间的分隔符:


def get_cookies():

    data = {}

    data_text = raw_input("Please input cookies:")

    data_text = data_text.strip()

    if len(data_text) > 0:

        key_value_list = data_text.split(';')

        for key_value in key_value_list:

            items = key_value.split('=')

            if len(items) == 2:

                data[items[0]] = items[1]

    return data 

5. 发送请求并获取应答:


def get_response(session, command, headers, data, cookies, allow_redirects):

    rsp = None

    while True:

        url = raw_input("Please input url: ")

        commands = {"g": "GET", "p": "POST"}

        print "request headers: ", headers

        try:

            if len(cookies) > 0:

                rsp = session.request(method=commands[command], url=url, data=data, headers=headers, cookies=cookies, allow_redirects=allow_redirects)

            else:

                rsp = session.request(method=commands[command], url=url, data=data, headers=headers, allow_redirects=allow_redirects)

            break

        except Exception as e:

            print(e.message)

            next_step = raw_input("Retry [y/n]: ")

            if next_step in ["y", "Y"]:

                continue

            else:

                break

    return rsp 

6. 运行脚本


def run():

    print("start crawler")

    session = Session()

    session.keep_alive = False

    refer = ""

    while True:

        command = get_command()

        if command == "q":

            break

        headers = get_headers()

        data = get_data()

        #allow_redirects = is_allow_redirects()

        cookies = get_cookies()

        rsp = get_response(session=session, command=command, headers=headers, data=data, cookies=cookies, allow_redirects=True)

        # 如果返回为空,是否退出

        if rsp is None:

            is_continue = raw_input("response is none, continue[y/n]:")

            if is_continue in ["y", "Y"]:

                continue

            else:

                break

        print "--- request cookie: ", session.cookies.__dict__

        print_cookies(session.cookies)

        print "--- response status code: ", rsp.status_code

        print "--- response headers: ", rsp.headers

        print "--- response cookie: ", rsp.cookies.__dict__

        print_cookies(rsp.cookies)

        if (rsp.history is not None) and (len(rsp.history) >= 1):

            print "--- response history: "

            for item in rsp.history:

                print "redirect location: ", item.headers['Location']

                print "redirect cookie: ", item.cookies.__dict__

                for cookie in item.cookies:

                    session.cookies.set(cookie.name, cookie.value)

        print "--- response html: ", rsp.text

        # 是否包含图片,是则保存

        if 'Content-Type' in rsp.headers:

            if rsp.headers['Content-Type'] in ['image/jpeg', 'image/png']:

                with open('image.jpg', 'ab') as f:

                    f.write(rsp.content)

                    f.close()

    session.close()

    print("quit crawler")   

if __name__ =="__main__":

    run()

运行结果:

image
上一篇 下一篇

猜你喜欢

热点阅读