关于购物网站及网页小游戏的robots协议

2017-07-04 本文已影响0人十三不好听

Robots协议（也称为爬虫协议、机器人协议等）的全称是“网络爬虫排除标准”（Robots Exclusion Protocol），网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取。(百度百科)

文件写法

User-agent: * 这里的 * 代表的所有的搜索引擎种类， * 是一个通配符
Disallow: /ABC/ 这里定义是禁止爬寻ABC目录下面的目录
Disallow:/ab/adc.html 禁止爬取ab文件夹下面的adc.html文件。
Allow: /cgi-bin/　这里定义是允许爬寻cgi-bin目录下面的目录
Allow: /tmp 这里定义是允许爬寻tmp的整个目录
Sitemap: 网站地图告诉爬虫这个页面是网站地图

购物网站

亚马逊中国

https://www.amazon.cn/robots.txt

User-agent: *
Disallow: /buycar
Disallow: /cart
Disallow: /checkout
Disallow: /class
Disallow: /com
Disallow: /common
Disallow: /css
Disallow: /dll
Disallow: /doc
Disallow: /dp/e-mail-friend/
Disallow: /dp/manual-submit/
Disallow: /dp/product-availability/
Disallow: /dp/rate-this-item/
Disallow: /dp/shipping/
Disallow: /dp/twister-update/
Disallow: /gp/aws/ssop
Disallow: /gp/cart
Disallow: /gp/css/homepage.html
Disallow: /gp/customer-reviews/common/du
Disallow: /gp/flex
Disallow: /gp/gfix
Disallow: /gp/history
Disallow: /gp/item-dispatch
Disallow: /gp/music/clipserve
Disallow: /gp/music/wma-pop-up
Disallow: /gp/offer-listing
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/recsradio
Disallow: /gp/slredirect
Disallow: /gp/twitter/
Disallow: /gp/vote
Disallow: /gp/voting/
Disallow: /gp/yourstore
Disallow: /inc
Disallow: /js
Disallow: /lib
Disallow: /mn/bookLookInsideApp
Disallow: /mn/checkInitApp
Disallow: /mn/checkoutAlertMsgApp
Disallow: /mn/checkoutredirectApp
Disallow: /mn/giftCardApp
Disallow: /mn/loginApplication
Disallow: /mn/loyaltyApp
Disallow: /mn/orderAddrApp
Disallow: /mn/orderCfmApp
Disallow: /mn/orderDetailApp
Disallow: /mn/orderFailApp
Disallow: /mn/orderHistoryApp
Disallow: /mn/orderModifyApp
Disallow: /mn/orderSummaryApp
Disallow: /mn/paymentRedriveApp
Disallow: /mn/recommendReviewApp
Disallow: /mn/releaseReviewApp
Disallow: /mn/reviewVoteApplication
Disallow: /mn/selectPaymentMethodApp
Disallow: /mn/selectShippingOpptionApplication
Disallow: /mn/shipmentTraceApp
Disallow: /mn/shoppingCartApplication
Disallow: /mn/tellFriend
Disallow: /mn/thankYouApplication
Disallow: /mn/virtualAccountApp
Disallow: /mn/yourAccountApp
Disallow: /paper
Disallow: /xml
Disallow: /youraccount
Disallow: /ap/signin
Disallow: /gp/registry/wishlist/
Disallow: /wishlist/
Allow: /wishlist/universal*
Allow: /wishlist/vendor-button*
Allow: /wishlist/get-button*
Disallow: /gp/wishlist/
Allow: /gp/wishlist/universal*
Allow: /gp/wishlist/vendor-button*
Allow: /gp/wishlist/ipad-install*
Disallow: /registry/wishlist/
Disallow: /gp/help/contact-us/general-questions.html*?type&email&skip=true
Disallow: /gp/help/customer/accessibility?ie=UTF8&initialIssue=forgotpw&skip=true
Disallow: /gp/registry/search.html
Disallow: /gp/orc/rml/
Disallow: /gp/digital/fiona/manage
Disallow: /gp/entity-alert/external
Disallow: /gp/customer-reviews/dynamic/sims-box
Disallow: /review/dynamic/sims-box
Disallow: /gp/redirect.html
Disallow: /gp/customer-media/upload/
Disallow: /gp/customer-media/actions/delete/
Disallow: /gp/customer-media/actions/edit-caption/
Disallow: /gp/dmusic/
Disallow: /registry
Disallow: /*/wishlist
Disallow: /gp/registry
Disallow: /gp/aag
Disallow: /gp/socialmedia/giveaways
Disallow: /gp/aw/so.html
Disallow: /gp/pdp/profile/
Disallow: /gp/help/customer/display.html*nodeId=200843370
Disallow: /gp/help/customer/display.html*nodeId=200877580
Disallow: /gp/help/customer/display.html*nodeId=200877590
Disallow: /gp/help/customer/display.html*nodeId=200879080
Disallow: /gp/help/customer/display.html*nodeId=200879100
Disallow: /gp/help/customer/display.html*nodeId=200879120
Disallow: /gp/help/customer/display.html*nodeId=200879160
Disallow: /gp/help/customer/display.html*nodeId=200879140
Disallow: /gp/help/customer/display.html*nodeId=200877610
Disallow: /gp/help/customer/display.html*nodeId=200878960
Disallow: /gp/help/customer/display.html*nodeId=200878980
Disallow: /gp/help/customer/display.html*nodeId=200879000
Disallow: /gp/help/customer/display.html*nodeId=200879040
Disallow: /gp/help/customer/display.html*nodeId=200879020
Disallow: /gp/help/customer/display.html*nodeId=200877630
Disallow: /gp/help/customer/display.html*nodeId=200879200
Disallow: /gp/help/customer/display.html*nodeId=200879220
Disallow: /gp/help/customer/display.html*nodeId=200879240
Disallow: /gp/help/customer/display.html*nodeId=200879280
Disallow: /gp/help/customer/display.html*nodeId=200879260
Disallow: /gp/help/customer/display.html*nodeId=200877650
Disallow: /gp/help/customer/display.html*nodeId=200879320
Disallow: /gp/help/customer/display.html*nodeId=200879340
Disallow: /gp/help/customer/display.html*nodeId=200879360
Disallow: /gp/help/customer/display.html*nodeId=200879400
Disallow: /gp/help/customer/display.html*nodeId=200879380
Disallow: /gp/help/customer/display.html*nodeId=200877560
Disallow: /gp/help/customer/display.html*nodeId=200843460
Disallow: /gp/help/customer/display.html*nodeId=200843440
Disallow: /gp/help/customer/display.html*nodeId=200899270
Disallow: /gp/help/customer/display.html*nodeId=200879440
Disallow: /gp/help/customer/display.html*nodeId=200899330
Disallow: /gp/help/customer/display.html*nodeId=200899350
Disallow: /gp/help/customer/display.html*nodeId=200899390
Disallow: /gp/help/customer/display.html*nodeId=200899410
Disallow: /gp/help/customer/display.html*nodeId=200899430
Disallow: /gp/help/customer/display.html*nodeId=200899220
Disallow: /gp/help/customer/display.html*nodeId=200899450
Disallow: /gp/help/customer/display.html*nodeId=200899670
Disallow: /gp/help/customer/display.html*nodeId=200899530
Disallow: /gp/help/customer/display.html*nodeId=200899470
Disallow: /gp/help/customer/display.html*nodeId=200899550
Disallow: /gp/help/customer/display.html*nodeId=200899570
Disallow: /gp/help/customer/display.html*nodeId=200899510
Disallow: /gp/help/customer/display.html*nodeId=200899610
Disallow: /gp/help/customer/display.html*nodeId=200899630
Disallow: /gp/help/customer/display.html*nodeId=200899650
Disallow: /gp/help/customer/display.html*nodeId=200879180
Disallow: /gp/help/customer/display.html*nodeId=200879060
Disallow: /gp/help/customer/display.html*nodeId=200879300
Disallow: /gp/help/customer/display.html*nodeId=200879420
Disallow: /gp/help/customer/display.html*nodeId=200899290
Disallow: /gp/help/customer/display.html*nodeId=200899310
Disallow: /gp/help/customer/display.html*nodeId=200843380
Disallow: /gp/help/customer/display.html*nodeId=200843420
Disallow: /gp/help/customer/display.html*nodeId=200899230
Disallow: /gp/help/customer/display.html*nodeId=200899250
Disallow: /gp/help/customer/display.html*nodeId=200899370
Disallow: /reviews/iframe
Disallow: /gp/help/reports/infringement/jquery/handle-notice-submit.html
Disallow: /gp/help/customer/handler/handle-email-submit.html

不可爬取的页面中可显示的页面包括：购物车，登录，分类列表，个人账户页面，购物历史记录，官方信息，首页，心愿单，联系客服，联系我们，我的电子书，帮助。

亚马逊主要禁止抓取的内容是一些商业信息以及用户的个人信息，如今信息泄露现象越发普遍，作为一个线上购物平台保护用户的隐私显得尤为重要，这不仅是对用户个人财产安全的保护，也是对用户本身安全的保护。不过，同时，亚马逊也存在一些允许爬取的内容。

淘宝

https://www.taobao.com/robots.txt

User-agent: Baiduspider
Allow: /article
Allow: /oshtml
Allow: /wenzhang
Disallow: /product/
Disallow: /
User-Agent: Googlebot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /wenzhang
Allow: /oversea
Disallow: /
User-agent: Bingbot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /wenzhang
Allow: /oversea
Disallow: /
User-Agent: 360Spider
Allow: /article
Allow: /oshtml
Allow: /wenzhang
Disallow: /
User-Agent: Yisouspider
Allow: /article
Allow: /oshtml
Allow: /wenzhang
Disallow: /
User-Agent: Sogouspider
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /wenzhang
Disallow: /
User-Agent: Yahoo! Slurp
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /wenzhang
Allow: /oversea
Disallow: /
User-Agent: *
Disallow: /

Baiduspider：百度蜘蛛，是百度搜索引擎的一个自动程序。它的作用是访问收集整理互联网上的网页、图片、视频等内容，然后分门别类建立索引数据库，使用户能在百度搜索引擎中搜索到您网站的网页、图片、视频等内容。(百度百科)

Googlebot:谷歌的网页抓取机器人(百度百科)

Bingbot是必应搜索引擎的爬虫名称，会在各个网站抓取内容时候留下脚印。(百度贴吧)

现在，如果在百度里搜索淘宝网，会看到的结果是“由于该网站的robots.txt文件存在限制指令，系统无法提供该页面的内容描述”。事实上，百度和淘宝都试图将中国网民培育出一种最符合自己利益用户的习惯：就是尽量让用户用自己的搜索引擎完成消费选择，如果自己能够控制用户端口，那么针对排名就可以做出多种付费推广，而淘宝如果对百度蜘蛛开放robots.txt，作为中国最大的搜索引擎，百度很可能会针对淘宝开发出相应的开放平台，蚕食淘宝的付费市场。如果强势品牌能够打造独立商城分流淘宝店铺的流量，一是可以避免身家性命全押在淘宝上需要通过竞价系统购买昂贵的首页广告（百度同理），二是可以加强品牌优势，培养用户主动搜索品牌的消费习惯。

网页小游戏

4399

http://www.4399.com/robots.txt

User-agent: *
Disallow: /upload_pic/
Disallow: /upload_swf/
Disallow: /360/
Disallow: /public/
Disallow: /yxbox/
Disallow: /360game/
Disallow: /loadimg/
Disallow: /index_pc.htm
Disallow: /flash/32979_pc.htm
Disallow: /flash/35538_pc.htm
Disallow: /flash/48399_pc.htm
Disallow: /flash/seer_pc.htm
Disallow: /flash/58739_pc.htm
Disallow: /flash/78072_pc.htm
Disallow: /flash/130396_pc.htm
Disallow: /flash/80727_pc.htm
Disallow: /flash/151038_pc.htm
Disallow: /flash/10379_pc.htm
Disallow: /index_old.htm

不可爬取的页面中可显示的页面包括：游戏列表，最新好玩小游戏列表，首页，洛克王国，奥拉星，赛尔号，龙战士，造梦西游3之大闹天庭篇，爆枪英雄，勇士的信仰(正式版)，造梦西游4洪荒大劫篇，奥比岛，老版首页。

7k7k

http://www.7k7k.com/robots.txt

User-agent: *
Disallow: /doyo/
Disallow: /doyoweb/
Disallow: /yy/
Disallow: /data/
Disallow: /widget/
Disallow: /api/
Disallow: /classic
Disallow: /classic/
Disallow: /classic/tag/
Disallow: /classic/swf/
Disallow: /classic/flash_fl/
Disallow: /classic/top/
Disallow: /classic/flash/
Disallow: /classic/index.htm
Disallow: /new/
Disallow: /m-iphone/art/
Disallow: /m-ipad/art/
Disallow: /m-android/art/

不可爬取的页面中可显示的页面包括：每日最新Flash游戏列表，游戏分类列表，游戏列表，游戏分类标签列表，游戏排行榜，首页。

2144

http://www.2144.cn/robots.txt

User-agent:Mediapartners-Google
Disallow:
User-agent: *
Allow: /girls/?
Disallow: /tuan
Disallow: /v3
Disallow: /hz/cntv
Disallow: /testdsadsa21321
Disallow: /xxx
Disallow: /api
Disallow: /game.htm
Disallow: /index_test.htm
Disallow: /webgame.htm
Disallow: /index1.htm
Disallow: /index_old.htm
Disallow: /index_2010.htm
Disallow: /index_2011.htm
Disallow: /index_2012.htm
Disallow: /game_test.php
Disallow: /listgame.php
Disallow: /cj.php
Disallow: /sdogame.php
Disallow: /archiver
Disallow: /YouXi
Disallow: /sdo
Disallow: /Archives
Disallow: /public
Disallow: /html/26/51653/
Disallow: /html/14/51654/
Disallow: /html/14/51655/
Disallow: /html/26/51857/
Disallow: /html/14/51863/
Disallow: /html/14/51862/
Disallow: /html/14/51861/
Disallow: /html/26/51858/
Disallow: /html/26/51859/
Disallow: /2345/
Disallow: /2144com/
Disallow: /xyx/
Disallow: /xiaoyouxi/
Disallow: /2015/
Disallow: /2016/

不可爬取的页面中可显示的页面包括：女生游戏列表，首页，老版首页，三国战纪，战神盟，三国志，三国战，游戏列表。

大部分网页小游戏网站都禁止爬取首页，游戏列表，游戏分类列表以及部分小游戏网页。

小结

购物网站大都将注意力放在用户信息保护以及网站流量上面，网页小游戏网站在关注网站流量的同时，也会着重保护团队的创作成果。

Robots协议是网站出于安全和隐私考虑，防止搜索引擎抓取敏感信息而设置的。Robots协议代表了一种契约精神，互联网企业只有遵守这一规则，才能保证网站及用户的隐私数据不被侵犯。Robots协议是维护互联网世界隐私安全的重要规则，是一种目前为止最有效的方式，用自律维持着网站与搜索引擎之间的平衡，让两者之间的利益不至于过度倾斜。