亚马逊robots协议解析
1. robots协议
Robots协议(也称为爬虫协议、机器人协议等)的全称是“网络爬虫排除标准”(Robots Exclusion Protocol),网站通过Robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取。robots.txt文件是一个文本文件,使用任何一个常见的文本编辑器,就可以创建和编辑它。robots.txt是一个协议,而不是一个命令。robots.txt是搜索引擎中访问网站的时候要查看的第一个文件。robots.txt文件告诉蜘蛛程序在服务器上什么文件是可以被查看的。
————robots协议百度百科
2.亚马逊的robots文件
User-agent: * #针对所有爬虫
Disallow: /buycar
Disallow: /cart
Disallow: /checkout
Disallow: /class
Disallow: /com
Disallow: /common
Disallow: /css
Disallow: /dll
Disallow: /doc
#禁止访问爬取buycar、cart、checkout、class、com、common、css、dll、doc这些目录
Disallow: /dp/e-mail-friend/
Disallow: /dp/manual-submit/
Disallow: /dp/product-availability/
Disallow: /dp/rate-this-item/
Disallow: /dp/shipping/
Disallow: /dp/twister-update/
#禁止访问爬取dp目录下指定的e-mail-friend、manual-submit、product-availability、rate-this-item、shipping、twister-update目录(应该是给商品评分、提交等页面信息)
Disallow: /gp/aws/ssop
Disallow: /gp/cart
Disallow: /gp/css/homepage.html
Disallow: /gp/customer-reviews/common/du
Disallow: /gp/flex
Disallow: /gp/gfix
Disallow: /gp/history
Disallow: /gp/item-dispatch
Disallow: /gp/music/clipserve
Disallow: /gp/music/wma-pop-up
Disallow: /gp/offer-listing
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/recsradio
Disallow: /gp/slredirect
Disallow: /gp/twitter/
Disallow: /gp/vote
Disallow: /gp/voting/
Disallow: /gp/yourstore
#禁止访问爬取gp目录下指定文件(顾客评论、历史浏览、商品目录下的评分、邮件、分享至Twitter等)
Disallow: /inc
Disallow: /js
Disallow: /lib
#禁止访问爬取inc、js、lib目录
Disallow: /mn/bookLookInsideApp
Disallow: /mn/checkInitApp
Disallow: /mn/checkoutAlertMsgApp
Disallow: /mn/checkoutredirectApp
Disallow: /mn/giftCardApp
Disallow: /mn/loginApplication
Disallow: /mn/loyaltyApp
Disallow: /mn/orderAddrApp
Disallow: /mn/orderCfmApp
Disallow: /mn/orderDetailApp
Disallow: /mn/orderFailApp
Disallow: /mn/orderHistoryApp
Disallow: /mn/orderModifyApp
Disallow: /mn/orderSummaryApp
Disallow: /mn/paymentRedriveApp
Disallow: /mn/recommendReviewApp
Disallow: /mn/releaseReviewApp
Disallow: /mn/reviewVoteApplication
Disallow: /mn/selectPaymentMethodApp
Disallow: /mn/selectShippingOpptionApplication
Disallow: /mn/shipmentTraceApp
Disallow: /mn/shoppingCartApplication
Disallow: /mn/tellFriend
Disallow: /mn/thankYouApplication
Disallow: /mn/virtualAccountApp
Disallow: /mn/yourAccountApp
#禁止访问爬取mn目录下的指定文件(登录账户、注销账户、选择支付方式、订单详情、失败订单、历史订单、全部订单、选择物流、物流追踪等)
Disallow: /paper
Disallow: /xml
Disallow: /youraccount
Disallow: /ap/signin
Disallow: /gp/registry/wishlist/
Disallow: /wishlist/
#禁止访问爬取用户账户、登录、心愿单等目录
Allow: /wishlist/universal*
Allow: /wishlist/vendor-button*
Allow: /wishlist/get-button*
#允许访问wishlist目录下的指定文件
Disallow: /gp/wishlist/
Allow: /gp/wishlist/universal*
Allow: /gp/wishlist/vendor-button*
Allow: /gp/wishlist/ipad-install*
#禁止访问gp目录下的wishlist中除了三个指定文件外的其他所有文件
Disallow: /registry/wishlist/
Disallow:/gp/help/contact-us/general-questions.html*?type&email&skip=true
Disallow:/gp/help/customer/accessibility?ie=UTF8&initialIssue=forgotpw&skip=true
Disallow: /gp/registry/search.html
Disallow: /gp/orc/rml/
Disallow: /gp/digital/fiona/manage
Disallow: /gp/entity-alert/external
Disallow: /gp/customer-reviews/dynamic/sims-box
Disallow: /review/dynamic/sims-box
Disallow: /gp/redirect.html
Disallow: /gp/customer-media/upload/
Disallow: /gp/customer-media/actions/delete/
Disallow: /gp/customer-media/actions/edit-caption/
Disallow: /gp/dmusic/
Disallow: /registry
Disallow: /*/wishlist
Disallow: /gp/registry
Disallow: /gp/aag
Disallow: /gp/socialmedia/giveaways
Disallow: /gp/aw/so.html
Disallow: /gp/pdp/profile/
#禁止访问以上指定目录文件
Disallow: /gp/help/customer/display.html*nodeId=200843370
Disallow: /gp/help/customer/display.html*nodeId=200877580
Disallow: /gp/help/customer/display.html*nodeId=200877590
Disallow: /gp/help/customer/display.html*nodeId=200879080
Disallow: /gp/help/customer/display.html*nodeId=200879100
Disallow: /gp/help/customer/display.html*nodeId=200879120
Disallow: /gp/help/customer/display.html*nodeId=200879160
Disallow: /gp/help/customer/display.html*nodeId=200879140
Disallow: /gp/help/customer/display.html*nodeId=200877610
Disallow: /gp/help/customer/display.html*nodeId=200878960
Disallow: /gp/help/customer/display.html*nodeId=200878980
Disallow: /gp/help/customer/display.html*nodeId=200879000
Disallow: /gp/help/customer/display.html*nodeId=200879040
Disallow: /gp/help/customer/display.html*nodeId=200879020
Disallow: /gp/help/customer/display.html*nodeId=200877630
Disallow: /gp/help/customer/display.html*nodeId=200879200
Disallow: /gp/help/customer/display.html*nodeId=200879220
Disallow: /gp/help/customer/display.html*nodeId=200879240
Disallow: /gp/help/customer/display.html*nodeId=200879280
Disallow: /gp/help/customer/display.html*nodeId=200879260
Disallow: /gp/help/customer/display.html*nodeId=200877650
Disallow: /gp/help/customer/display.html*nodeId=200879320
Disallow: /gp/help/customer/display.html*nodeId=200879340
Disallow: /gp/help/customer/display.html*nodeId=200879360
Disallow: /gp/help/customer/display.html*nodeId=200879400
Disallow: /gp/help/customer/display.html*nodeId=200879380
Disallow: /gp/help/customer/display.html*nodeId=200877560
Disallow: /gp/help/customer/display.html*nodeId=200843460
Disallow: /gp/help/customer/display.html*nodeId=200843440
Disallow: /gp/help/customer/display.html*nodeId=200899270
Disallow: /gp/help/customer/display.html*nodeId=200879440
Disallow: /gp/help/customer/display.html*nodeId=200899330
Disallow: /gp/help/customer/display.html*nodeId=200899350
Disallow: /gp/help/customer/display.html*nodeId=200899390
Disallow: /gp/help/customer/display.html*nodeId=200899410
Disallow: /gp/help/customer/display.html*nodeId=200899430
Disallow: /gp/help/customer/display.html*nodeId=200899220
Disallow: /gp/help/customer/display.html*nodeId=200899450
Disallow: /gp/help/customer/display.html*nodeId=200899670
Disallow: /gp/help/customer/display.html*nodeId=200899530
Disallow: /gp/help/customer/display.html*nodeId=200899470
Disallow: /gp/help/customer/display.html*nodeId=200899550
Disallow: /gp/help/customer/display.html*nodeId=200899570
Disallow: /gp/help/customer/display.html*nodeId=200899510
Disallow: /gp/help/customer/display.html*nodeId=200899610
Disallow: /gp/help/customer/display.html*nodeId=200899630
Disallow: /gp/help/customer/display.html*nodeId=200899650
Disallow: /gp/help/customer/display.html*nodeId=200879180
Disallow: /gp/help/customer/display.html*nodeId=200879060
Disallow: /gp/help/customer/display.html*nodeId=200879300
Disallow: /gp/help/customer/display.html*nodeId=200879420
Disallow: /gp/help/customer/display.html*nodeId=200899290
Disallow: /gp/help/customer/display.html*nodeId=200899310
Disallow: /gp/help/customer/display.html*nodeId=200843380
Disallow: /gp/help/customer/display.html*nodeId=200843420
Disallow: /gp/help/customer/display.html*nodeId=200899230
Disallow: /gp/help/customer/display.html*nodeId=200899250
Disallow: /gp/help//display.html*nodeId=200899370
#禁止访问爬取gp/help下的指定文件(感觉像是联系亚马逊客服时特定问题的自动回复)
Disallow: /reviews/iframe
Disallow:/gp/help/reports/infringement/jquery/handle-notice-submit.html
Disallow: /gp/help/customer/handler/handle-email-submit.html
Disallow: /ss/customer-reviews/lighthouse/
Disallow: /gp/aw/ol/
#禁止访问爬取以上目录文件
亚马逊的robots协议相当详细,禁止了相当多有关顾客、商品等的访问,在此robots.txt中,仅允许访问部分wishlist指定文件,个人猜测是通过这些允许爬取的文件,通过浏览器,从浏览器向用户推送相关商品信息,引导用户访问。