解析robots協議及meta標籤
- robots.txt協議
# 檔名必須為rebots.txt而不是Robots.TXT
# 針對所有rebots,允許訪問任何部分
# 原文解釋為To allow all robots complete access:
User-agent: *
Disallow:
# 也可以使用 (【來自文件】or just create an empty "/robots.txt" file, or don't use one at all)
# 針對所有rebots,禁止訪問任何部分
User-agent: *
Disallow: /
# 針所有rebots,禁止訪問某些部分
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
# 針對所有rebots,禁止訪問某一特殊檔案
User-agent: *
Disallow: /directory/file.html
# 針對兩個rebots,禁止訪問特殊資料夾
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
User-agent: Googlebot
Disallow: /private/
# 允許某一個rebots,訪問所有部分
User-agent: Google
Disallow:
User-agent: *
Disallow: /
# 對於某些rebots
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
l Example demonstrating multiple user-agents:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory
User-agent: googlebot-news # only the news service
Disallow: / # disallow everything
User-agent: * # any robot
Disallow: /something/ # disallow this directory
除此之外,還存在一些不標準的擴充套件:
訪問
並透過ctrl+f查詢Nonstandard extensions
(下敘內容待考證){
#robot版本號
Robot-version: Version 2.0
#允許在0點到7點之間訪問(本地伺服器時間?)
Visit-time: 0000-0700
# 限制url的讀取速率,即在2點到6點之間訪問速率只可是每分鐘10次。
Request-rate: 10/1m 0200-0600
}
- meta標籤
格式【來自文件】:
...
……
作用:【來自文件】You can use a special HTML tag to tell robots not to index the content of a page, and/or not scan it for links to follow
# NAME的屬性必須是ROBOTS
# CONTENT=""的備選的屬性有:index,noindex,follow,nofollow
# 其中index代表告訴爬蟲抓取該頁面,follow允許爬蟲可以沿該頁面上的連結繼續抓取
# CONTENT的預設值為INDEX和FOLLOW,【原文】the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:
# 其他三種分別為:
# 組合1
# 組合2
# 組合3
There are two important considerations when using the robots tag:
l robots can ignore your tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
l the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.
(下敘內容待考證){
# google的搜尋引擎完全支援上述meta標籤,還新加了archive屬性,代表是否允許保留網頁快照
# 如下述代表允許所有搜尋引擎儲存該站點網頁快照
}
在FAQ中有一個關於more about robots的網頁,但是已經404了。
中提到
“
There is a Web robots home page on:
“
關於rebots.txt和meta的區別及關係
the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Also, the robots meta tag only works on HTML pages, not images, text files, PDF documents, etc. Finally, if the pages/resources have already been excluded by a robots.txt file, then they will not be crawled and the meta tags and headers will have no effect. This can have the counterintuitive effect that a web address is indexed by a search engine such as Google if it honors the site's robots.txt, stops crawling and never receives the advice not to index the site。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29773961/viewspace-1377335/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 前端meta標籤內容定義及使用說明,meta詳細說明,meta標籤使用前端
- HTML <meta>標籤HTML
- html meta標籤HTML
- meta標籤設定
- meta標籤總結
- meta標籤的http-equiv與content解析HTTPUI
- Html網頁中meta標籤及用法詳解HTML網頁
- HTML Meta標籤知多少HTML
- HTML5 <meta> 標籤屬性,所有meta用法HTML
- MPLS標籤分發協議正確方式——Vecloud協議Cloud
- 2.爬蟲 urlib庫講解 異常處理、URL解析、分析Robots協議爬蟲協議
- HTML規範——標籤,資源,meta梳理HTML
- HTML meta 標籤總結與屬性HTML
- Meta 標籤與搜尋引擎優化優化
- 神奇的meta標籤處理相容性
- 移動端meta標籤設定總結
- meta標籤的viewport用法簡單介紹View
- html的meta總結,html標籤中meta屬性使用介紹HTML
- 好程式設計師web前端分享html中meta標籤及用法詳解程式設計師Web前端HTML
- Asp.net 後臺新增CSS、JS、Meta標籤ASP.NETCSSJS
- syslog協議及rsyslog服務全解析協議
- 移動端的頭部標籤和meta,[html5]移動端的頭部標籤和meta屬性詳解HTML
- ARP(地址解析協議)和RARP(逆地址解析協議)協議
- HTTP 協議完全解析HTTP協議
- Spring Bean 標籤解析SpringBean
- html meta 標籤和瀏覽器快取關係HTML瀏覽器快取
- Meta標籤中name和http-equiv的區別HTTPUI
- CAP一致性協議及應用解析協議
- 瞭解開源協議:常用開源協議解析協議
- WebSocket的Frame協議解析Web協議
- ARP 地址解析協議協議
- 地址解析協議(ARP)協議
- SMB/CIFS協議解析(一)協議
- [php]HTTP協議頭解析PHPHTTP協議
- 使用<meta>標籤實現指定時間後跳轉效果
- 淺談WebSocket協議、WS協議和WSS協議原理及關係Web協議
- 音視訊同步!RTCP 協議解析及程式碼實現TCP協議
- Spring原始碼之預設標籤解析及BeanDefinition註冊Spring原始碼Bean