解析robots協議及meta標籤
- robots.txt協議
# 檔名必須為rebots.txt而不是Robots.TXT
# 針對所有rebots,允許訪問任何部分
# 原文解釋為To allow all robots complete access:
User-agent: *
Disallow:
# 也可以使用 (【來自文件】or just create an empty "/robots.txt" file, or don't use one at all)
# 針對所有rebots,禁止訪問任何部分
User-agent: *
Disallow: /
# 針所有rebots,禁止訪問某些部分
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
# 針對所有rebots,禁止訪問某一特殊檔案
User-agent: *
Disallow: /directory/file.html
# 針對兩個rebots,禁止訪問特殊資料夾
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
User-agent: Googlebot
Disallow: /private/
# 允許某一個rebots,訪問所有部分
User-agent: Google
Disallow:
User-agent: *
Disallow: /
# 對於某些rebots
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
l Example demonstrating multiple user-agents:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory
User-agent: googlebot-news # only the news service
Disallow: / # disallow everything
User-agent: * # any robot
Disallow: /something/ # disallow this directory
除此之外,還存在一些不標準的擴充套件:
訪問
並透過ctrl+f查詢Nonstandard extensions
(下敘內容待考證){
#robot版本號
Robot-version: Version 2.0
#允許在0點到7點之間訪問(本地伺服器時間?)
Visit-time: 0000-0700
# 限制url的讀取速率,即在2點到6點之間訪問速率只可是每分鐘10次。
Request-rate: 10/1m 0200-0600
}
- meta標籤
格式【來自文件】:
...
……
作用:【來自文件】You can use a special HTML tag to tell robots not to index the content of a page, and/or not scan it for links to follow
# NAME的屬性必須是ROBOTS
# CONTENT=""的備選的屬性有:index,noindex,follow,nofollow
# 其中index代表告訴爬蟲抓取該頁面,follow允許爬蟲可以沿該頁面上的連結繼續抓取
# CONTENT的預設值為INDEX和FOLLOW,【原文】the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:
# 其他三種分別為:
# 組合1
# 組合2
# 組合3
There are two important considerations when using the robots tag:
l robots can ignore your tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
l the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.
(下敘內容待考證){
# google的搜尋引擎完全支援上述meta標籤,還新加了archive屬性,代表是否允許保留網頁快照
# 如下述代表允許所有搜尋引擎儲存該站點網頁快照
}
在FAQ中有一個關於more about robots的網頁,但是已經404了。
中提到
“
There is a Web robots home page on:
“
關於rebots.txt和meta的區別及關係
the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Also, the robots meta tag only works on HTML pages, not images, text files, PDF documents, etc. Finally, if the pages/resources have already been excluded by a robots.txt file, then they will not be crawled and the meta tags and headers will have no effect. This can have the counterintuitive effect that a web address is indexed by a search engine such as Google if it honors the site's robots.txt, stops crawling and never receives the advice not to index the site。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29773961/viewspace-1377335/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- meta標籤的作用及整理
- 前端meta標籤內容定義及使用說明,meta詳細說明,meta標籤使用前端
- meta標籤的http-equiv與content解析HTTPUI
- HTML <meta>標籤HTML
- Html網頁中meta標籤及用法詳解HTML網頁
- MPLS標籤分發協議正確方式——Vecloud協議Cloud
- 2.爬蟲 urlib庫講解 異常處理、URL解析、分析Robots協議爬蟲協議
- syslog協議及rsyslog服務全解析協議
- HTML規範——標籤,資源,meta梳理HTML
- 那些你不知道的meta標籤
- 說說你對`<meta>`標籤的理解
- ARP(地址解析協議)和RARP(逆地址解析協議)協議
- Spring Bean 標籤解析SpringBean
- 好程式設計師web前端分享html中meta標籤及用法詳解程式設計師Web前端HTML
- 移動端的頭部標籤和meta,[html5]移動端的頭部標籤和meta屬性詳解HTML
- Spring原始碼分析(六)SpringAOP例項及標籤的解析Spring原始碼
- Spring原始碼之預設標籤解析及BeanDefinition註冊Spring原始碼Bean
- ARP 地址解析協議協議
- HTTP 協議完全解析HTTP協議
- CAP一致性協議及應用解析協議
- 音視訊同步!RTCP 協議解析及程式碼實現TCP協議
- WebSocket的Frame協議解析Web協議
- 淺談WebSocket協議、WS協議和WSS協議原理及關係Web協議
- Spring原始碼解析—— IOC預設標籤解析(下)Spring原始碼
- 好程式設計師Java教程分享meta標籤什麼作用程式設計師Java
- meta標籤到底是做什麼的(拿來吧你)
- 超詳細 DNS 協議解析DNS協議
- 基於Netty實現自定義訊息通訊協議(協議設計及解析應用實戰)Netty協議
- ARP協議 地址解析協議:IP地址轉換為MAC地址協議Mac
- 認識流媒體協議,從 RTSP 協議解析開始!協議
- ipad協議及原始碼iPad協議原始碼
- 國密SSL協議與標準TLS協議的區別協議TLS
- 建議把OneThink1.0開發手冊完善下,特別的標籤庫應用及標籤庫例項等等
- spring原始碼深度解析— IOC 之 自定義標籤解析Spring原始碼
- Spring 原始碼閱讀之標籤解析Spring原始碼
- JSTL各個標籤的解析以及使用JS
- 利用 NetBIOS 協議名稱解析及 WPAD 進行內網滲透iOS協議內網
- 移動終端H5頁面meta標籤的設定H5
- 說說如果meta標籤沒有寫charset屬性,將會如何?