python之爬取拉勾职位详情数据

2020-04-24  本文已影响0人  gooddaytoyou

相关准备

分析样例

https://www.lagou.com/jobs/7029660.html?source=position_rec&i=position_rec-1

目标

提取字段

实现思路

通过requests请求,获取数据,记得带headers(模拟浏览器请求)。

headers_data={
      'Host': 'www.lagou.com',
      'Connection': 'keep-alive',
      'Pragma': 'no-cache',
      'Cache-Control': 'no-cache',
      'Upgrade-Insecure-Requests': '1',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      #此处是压缩算法;不便于查看,要做解压
      #'Accept-Encoding': 'gzip, deflate, sdch',
      'Accept-Language': 'zh-CN,zh;q=0.8',
      'Cookie': 'pgv_pvi=7044633600; tencentSig=6792114176; IESESSION=alive; pgv_si=s3489918976; CNZZDATA4617777=cnzz_eid%3D768417915-1468987955-%26ntime%3D1470191347; _qdda=3-1.1; _qddab=3-dyl6uh.ireawgo0; _qddamta_800068868=3-0'
 }
response=requests.get('https://www.lagou.com/jobs/7029660.html?source=position_rec&i=position_rec-1',headers=headers_data)

从response获得数据,通过etree.HTML方法,将原数据转成html,代码如下

resHtml=response.text
html=etree.HTML(resHtml)

接着,就可以愉快的通过xpath来解析数据了。具体的xpath语法参考

首先我们分析源码来定位。

头部信息
 <div class="position-content-l">
            <div class="job-name" title="python开发工程师">
                                <h4 class="company">云智慧开发部招聘</h4>
                                <h1 class="name">python开发工程师</h1>
                                <div class="marEdit">
                                    </div>
            </div>
            <dd class="job_request">
                <h3>
                    <span class="salary">15k-30k </span>
                    <span>/北京 /</span>
                    <span>经验3-5年 /</span>
                    <span>本科及以上 /</span>
                    <span>全职</span>
                </h3>
                <!-- 职位标签 -->
                <ul class="position-label clearfix">
                                        <li class="labels">企业服务</li>
                                        <li class="labels">大数据</li>
                                    </ul>
                <p class="publish_time">15:20&nbsp; 发布于拉勾网</p>
            </dd>
        </div>
        

python源码


position_content=html.xpath('//*[@class="position-content-l"]')

job_name=position_content[0].xpath('.//*[@class="job-name"]')

print(job_name[0][0].text)
print(job_name[0][1].text)
print(job_name[0][2].text)

print(html.xpath('//dd[@class="job_request"]')[0][0][0].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][1].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][2].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][3].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][4].text.strip())

job_tag=html.xpath('//*[@class="position-label clearfix"]')[0]

for tag in job_tag:
    print(tag.text)


职位详情

网页源码

 <dd class="job-advantage">
        <span class="advantage">职位诱惑:</span>
        <p>六险一金、期权激励、免费健身房、生日会</p>
    </dd>
    <dd class="job_bt">
        <h3 class="description">职位描述:</h3>
        <div class="job-detail">
        <p>岗位职责:</p>
<p>1、参与团队内部相关产品的设计、研发和优化工作;</p>
<p>2、 处理业务数据的预处理、清洗和转换工作;</p>
<p><br></p>
<p>任职要求:</p>
<p>1. 计算机相关专业,学信网认可的本科及以上学历;扎实的计算机基础,5年及以上工作经验;</p>
<p>2. 精通Python开发,代码风格良好,有自动化运维相关开发经验;</p>
<p>3. 熟练掌握Django、Flask、Tornado等Web框架,研读过框架源代码者优先;</p>
<p>4. 熟练掌握Mysql/MongoDB/Redis或其它大型数据库开发以及相关工具;</p>
<p>5.了解异步框架、集群与负载均衡,消息中间件,容灾备份等技术;</p>
<p>6. 熟悉底层数据存储及服务分层架构设计等技能;</p>
<p>7. 熟悉Linux下开发及相关知识、熟悉Git版本管理;</p>
<p>8. 工作认真,细心,有条理;积极性高,对前沿技术有较强的追求和敏感性;具有较强的沟通能力及团队合作精神。</p>
        </div>
    </dd>
    
  <dd class="job-address clearfix">
                <h3 class="address">工作地址</h3>
        <div class="work_addr">
                                                <a rel="nofollow" href="https://www.lagou.com/jobs/list_?city=北京#filterBox">北京</a> -
                    <a rel="nofollow" href="https://www.lagou.com/jobs/list_?city=北京&district=朝阳区#filterBox">朝阳区</a>
                                            - 霞光里9号中电发展大厦A座16层
                                                            <a rel="nofollow" href="javascript:;" id="mapPreview">查看地图</a>
        </div>
        <div id="miniMap"></div>
                <input type="hidden" name="positionLng" value="116.46005" />
        <input type="hidden" name="positionLat" value="39.961172" />
        <input type="hidden" name="positionAddress" value="霞光里9号中电发展大厦A座16层" />
        <input type="hidden" name="workAddress" value="北京" />
        <div style="display: none;">
            <div id="mapPopup" class="popup">
                <div id="fullMap"></div>
            </div>
        </div>
    </dd>

pyhton源码

ob_advantage=html.xpath('//*[@class="job-advantage"]')

print(html.xpath('//*[@class="job-advantage"]/p')[0].text)
print(html.xpath('//*[@class="job_bt"]/h3')[0].text)
# print(html.xpath('//*[@class="job_bt"]/div')[0].text.strip())
for sub in html.xpath('//*[@class="job_bt"]/div')[0]:
    if(len(sub)>0):
        for item in sub:
            print(item.text)
    print(sub.text)

work_addr=html.xpath('//*[@class="work_addr"]')
job_address=""
for work in work_addr[0]:
    job_address=job_address+work.text
print(job_address)
公司信息

网页源码

  <dt>
        <a href="https://www.lagou.com/gongsi/5232.html" target="_blank" data-lg-tj-track-code="jobs_logo">
        <img class="b2" src="//www.lgstatic.com/thumbnail_160x160/image1/M00/00/0C/Cgo8PFTUWB2AFrjaAABh76C2FX8105.jpg" width="96" height="96" alt="云智慧(北京)科技有限公司" />
        <div class="job_company_content">
            <h3 class="fl">
                <em class="fl-cn">
                                    云智慧
                                </em>
                                    <i class="icon-approve icon-glyph-valid"></i>
                    <span class="dn">拉勾认证企业</span>
                            </h3>
        </div>
        </a>
    </dt>
    <dd>
        <ul class="c_feature">
            <li>
                <i class="icon-glyph-fourSquare"></i> <h4 class="c_feature_name">数据服务</h4>
                <span class="hovertips">领域</span>
            </li>
            <li>
                <i class="icon-glyph-trend"></i> <h4 class="c_feature_name">D轮及以上</h4>
                <span class="hovertips">发展阶段</span>
            </li>
                        <li>
                <i class="finance icon-glyph-doller"></i><p class="financeOrg"><h4 class="c_feature_name">获得2500万美元D轮和D轮+融资 全栈智能化IT解决方案发布(D轮及以上),获得2600万美元C轮融资 入选第九期微软加速器北京(C轮),获得红杉领投1500万美元 B轮融资,端到端APM产品透视宝问世(B轮),建成大规模主动监测网络, 获得红杉领投1230万美元B+轮融资(B轮),获得戈壁A轮数百万 美元融资(A轮)</h4></p>
                <span class="hovertips">投资机构</span>
            </li>
                        <li>
                <i class="icon-glyph-figure"></i> <h4 class="c_feature_name">150-500人</h4>
                <span class="hovertips">规模</span>
            </li>
            <li>
                <i class="icon-glyph-home"></i>
                                <a href="http://www.cloudwise.com" target="_blank" title="http://www.cloudwise.com" rel="nofollow"><h4 class="c_feature_name">http://www.cloudwise.com</h4></a>
                <span class="hovertips">公司主页</span>
                            </li>
        </ul>
    </dd>

python 代码

print(html.xpath('//*[@class="job_company_content"]/h3/em')[0].text.strip())
c_feature=html.xpath('//*[@class="c_feature"]')[0]

for fea in c_feature:
    print(fea.xpath('.//*[@class="c_feature_name"]')[0].text)

完整代码

from lxml import etree
import requests
import re
import json

headers_data={
      'Host': 'www.lagou.com',
      'Connection': 'keep-alive',
      'Pragma': 'no-cache',
      'Cache-Control': 'no-cache',
      'Upgrade-Insecure-Requests': '1',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      #此处是压缩算法;不便于查看,要做解压
      #'Accept-Encoding': 'gzip, deflate, sdch',
      'Accept-Language': 'zh-CN,zh;q=0.8',
      'Cookie': 'pgv_pvi=7044633600; tencentSig=6792114176; IESESSION=alive; pgv_si=s3489918976; CNZZDATA4617777=cnzz_eid%3D768417915-1468987955-%26ntime%3D1470191347; _qdda=3-1.1; _qddab=3-dyl6uh.ireawgo0; _qddamta_800068868=3-0'
 }
response=requests.get('https://www.lagou.com/jobs/7029660.html?source=position_rec&i=position_rec-1',headers=headers_data)
resHtml=response.text
html=etree.HTML(resHtml)

#company
#job
#salary
#address
#experience
#education
#worktype
#job_tag

position_content=html.xpath('//*[@class="position-content-l"]')

job_name=position_content[0].xpath('.//*[@class="job-name"]')

print(job_name[0][0].text)
print(job_name[0][1].text)
print(job_name[0][2].text)

print(html.xpath('//dd[@class="job_request"]')[0][0][0].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][1].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][2].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][3].text.strip())
print(html.xpath('//dd[@class="job_request"]')[0][0][4].text.strip())

job_tag=html.xpath('//*[@class="position-label clearfix"]')[0]

for tag in job_tag:
    print(tag.text)

#职位信息
##job_advantage
##job_description
job_advantage=html.xpath('//*[@class="job-advantage"]')

print(html.xpath('//*[@class="job-advantage"]/p')[0].text)
print(html.xpath('//*[@class="job_bt"]/h3')[0].text)

for sub in html.xpath('//*[@class="job_bt"]/div')[0]:
    if(len(sub)>0):
        for item in sub:
            print(item.text)
    print(sub.text)

#job_address
work_addr=html.xpath('//*[@class="work_addr"]')
job_address=""
for work in work_addr[0]:
    job_address=job_address+work.text
print(job_address)

#公司内容
##company_area 
##company_feature_name 
##company_scale 
##company_home_page

print(html.xpath('//*[@class="job_company_content"]/h3/em')[0].text.strip())
c_feature=html.xpath('//*[@class="c_feature"]')[0]

for fea in c_feature:
    print(fea.xpath('.//*[@class="c_feature_name"]')[0].text)

相关参考

https://piaosanlang.gitbooks.io/spiders/03day/section3.5.html

上一篇下一篇

猜你喜欢

热点阅读