解析robots協議及meta標籤

神諭丶發表於2014-12-23


  •   robots.txt協議


 

# 檔名必須為rebots.txt而不是Robots.TXT

# 針對所有rebots,允許訪問任何部分

# 原文解釋為To allow all robots complete access

User-agent: *

Disallow:

# 也可以使用 (【來自文件】or just create an empty "/robots.txt" file, or don't use one at all)

 

# 針對所有rebots,禁止訪問任何部分

User-agent: *

Disallow: /

 

# 針所有rebots,禁止訪問某些部分

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /junk/

# 針對所有rebots,禁止訪問某一特殊檔案

User-agent: *

Disallow: /directory/file.html

 

# 針對兩個rebots,禁止訪問特殊資料夾

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
User-agent: Googlebot
Disallow: /private/

 

# 允許某一個rebots,訪問所有部分

User-agent: Google

Disallow:

User-agent: *

Disallow: /

 

# 對於某些rebots

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.

l  Example demonstrating multiple user-agents:

User-agent: googlebot        # all Google services

Disallow: /private/          # disallow this directory

 

User-agent: googlebot-news   # only the news service

Disallow: /                  # disallow everything

 

User-agent: *                # any robot

Disallow: /something/        # disallow this directory

 

除此之外,還存在一些不標準的擴充套件:

訪問

並透過ctrl+f查詢Nonstandard extensions

 

(下敘內容待考證){

#robot版本號

Robot-version: Version 2.0

 

#允許在0點到7點之間訪問(本地伺服器時間?)

Visit-time: 0000-0700

 

# 限制url的讀取速率,即在2點到6點之間訪問速率只可是每分鐘10次。

Request-rate: 10/1m 0200-0600

}

 

 

 

 


  •   meta標籤


格式【來自文件】:

...

……

 

 

作用:【來自文件】You can use a special HTML   tag to tell robots not to index the content of a page, and/or not scan it for links to follow

 

# NAME的屬性必須是ROBOTS

# CONTENT=""的備選的屬性有:index,noindex,follow,nofollow

# 其中index代表告訴爬蟲抓取該頁面,follow允許爬蟲可以沿該頁面上的連結繼續抓取

 

# CONTENT的預設值為INDEXFOLLOW,【原文】the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:

# 其他三種分別為:

# 組合1

# 組合2

 

# 組合3

 

There are two important considerations when using the robots tag:

l  robots can ignore your tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

l  the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

 

(下敘內容待考證){

# google的搜尋引擎完全支援上述meta標籤,還新加了archive屬性,代表是否允許保留網頁快照

# 如下述代表允許所有搜尋引擎儲存該站點網頁快照

}

 

 

FAQ中有一個關於more about robots的網頁,但是已經404了。

中提到

There is a Web robots home page on:

 

關於rebots.txtmeta的區別及關係

the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Also, the robots meta tag only works on HTML pages, not images, text files, PDF documents, etc. Finally, if the pages/resources have already been excluded by a robots.txt file, then they will not be crawled and the meta tags and headers will have no effect. This can have the counterintuitive effect that a web address is indexed by a search engine such as Google if it honors the site's robots.txt, stops crawling and never receives the advice not to index the site。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29773961/viewspace-1377335/,如需轉載,請註明出處,否則將追究法律責任。

相關文章