揭開正規表示式語法的神秘面紗 (轉)[@more@]

揭開正則語法的神秘面紗:namespace prefix = o ns = "urn:schemas--com::office" />

正規表示式(REs)通常被錯誤地認為是隻有少數人理解的一種神秘語言。在表面上它們確實看起來雜亂無章，如果你不知道它的語法，那麼它的程式碼在你眼裡只是一堆文字垃圾而已。實際上，正規表示式是非常簡單並且可以被理解。讀完這篇文章後，你將會通曉正規表示式的通用語法。

支援多種平臺

正規表示式最早是由數學家Stephen Kleene於1956年提出，他是在對自然語言的遞增研究成果的基礎上提出來的。具有完整語法的正規表示式使用在字元的格式匹配方面上，後來被應用到熔融資訊科技領域。自從那時起，正規表示式經過幾個時期的發展，現在的標準已經被ISO(國際標準組織)批准和被Open Group組織認定。

正規表示式並非一門專用語言，但它可用於在一個或字元裡查詢和替代文字的一種標準。它具有兩種標準：基本的正規表示式(BRE)，擴充套件的正規表示式(ERE)。ERE包括BRE功能和另外其它的概念。

許多中都使用了正規表示式，包括xsh,egrep,sed,vi以及在平臺下的程式。它們可以被很多語言採納，如HTML 和XML，這些採納通常只是整個標準的一個子集。

比你想象的還要普通

隨著正規表示式移植到交叉平臺的程式語言的發展，這的功能也日益完整，使用也逐漸廣泛。上的搜擎使用它，e-程式也使用它，即使你不是一個UNIX程式設計師，你也可以使用規則語言來簡化你的程式而縮短你的開發時間。

正規表示式101

很多正規表示式的語法看起來很相似，這是因為你以前你沒有研究過它們。萬用字元是RE的一個結構型別，即重複操作。讓我們先看一看ERE標準的最通用的基本語法型別。為了能夠提供具有特定用途的範例，我將使用幾個不同的程式。

字元匹配

正規表示式的關鍵之處在於確定你要搜尋匹配的東西，如果沒有這一概念，Res將毫無用處。

每一個表示式都包含需要查詢的指令，如表A所示。

Table A: Character-matching regular expressions

操作

解釋

例子

結果

.

Match any one character

grep .ord sample.txt

Will match “ford”, “lord”, “2ord”, etc. in the file sample.txt.

[ ]

Match any one character listed between the brackets

grep [cng]ord sample.txt

Will match only “cord”, “nord”, and “gord”

[^ ]

Match any one character not listed between the brackets

grep [^cn]ord sample.txt

Will match “lord”, “2ord”, etc. but not “cord” or “nord”

grep [a-zA-Z]ord sample.txt

Will match “aord”, “bord”, “Aord”, “Bord”, etc.

grep [^0-9]ord sample.txt

Will match “Aord”, “aord”, etc. but not “2ord”, etc.

重複運算子

重複運算子，或數量詞，都描述了查詢一個特定字元的次數。它們常被用於字元匹配語法以查詢多行的字元，可參見表B。

Table B: Regular expression repetition operators

操作

解釋

例子

結果

?

Match any character one time, if it exists

egrep “?erd” sample.txt

Will match “berd”, “herd”, etc. and “erd”

*

Match declared element multiple times, if it exists

egrep “n.*rd” sample.txt

Will match “nerd”, “nrd”, “neard”, etc.

+

Match declared element one or more times

egrep “[n]+erd” sample.txt

Will match “nerd”, “nnerd”, etc., but not “erd”

{n}

Match declared element exactly n times

egrep “[a-z]{2}erd” sample.txt

Will match “cherd”, “blerd”, etc. but not “nerd”, “erd”, “buzzerd”, etc.

{n,}

Match declared element at least n times

egrep “.{2,}erd” sample.txt

Will match “cherd” and “buzzerd”, but not “nerd”

{n,N}

Match declared element at least n times, but not more than N times

egrep “n[e]{1,2}rd” sample.txt

Will match “nerd” and “neerd”

錨

錨是指它所要匹配的格式，如圖C所示。使用它能方便你查詢通用字元的合併。例如，我用vi行編輯器命令:s來代表substitute，這一命令的基本語法是：

s/pattern_to_match/pattern_to_substitute/

Table C: Regular expression anchors

操作

解釋

例子

結果

^

Match at the beginning of a line

s/^/blah /

Inserts “blah “ at the beginning of the line

$

Match at the end of a line

s/$/ blah/

Inserts “ blah” at the end of the line

<

Match at the beginning of a

s/

Inserts “blah” at the beginning of the word

egrep “

Matches “blahfield”, etc.

>

Match at the end of a word

s/>/blah/

Inserts “blah” at the end of the word

egrep “>blah” sample.txt

Matches “soupblah”, etc.

b

Match at the beginning or end of a word

egrep “bblah” sample.txt

Matches “blahcake” and “countblah”

B

Match in the middle of a word

egrep “Bblah” sample.txt

Matches “sublahper”, etc.

間隔

Res中的另一可便之處是間隔(或插入)符號。實際上，這一符號相當於一個OR語句並代表|符號。下面的語句返回檔案sample.txt中的“nerd” 和 “merd”的控制程式碼：

egrep “(n|m)erd” sample.txt

間隔功能非常強大，特別是當你尋找檔案不同拼寫的時候，但你可以在下面的例子得到相同的結果：

egrep “[nm]erd” sample.txt

當你使用間隔功能與Res的高階特性連線在一起時，它的真正用處更能體現出來。

一些保留字元

Res的最後一個最重要特性是保留字元(也稱特定字元)。例如，如果你想要查詢“ne*rd”和“ni*rd”的字元，格式匹配語句“n[ei]*rd”與“neeeeerd” 和 “nieieierd”相符合，但並不是你要查詢的字元。因為‘*’(星號)是個保留字元，你必須用一個反斜線符號來替代它，即：“n[ei]*rd”。其它的保留字元包括：

^ (carat)
. (period)
[ (left bracket}
$ (dollar sign)
( (left parenthesis)
) (right parenthesis)
| (pipe)
* (asterisk)
+ (plus symbol)
? (question mark)
{ (left curly bracket, or left brace)
backslash

一旦你把以上這些字元包括在你的字元搜尋中，毫無疑問Res變得非常的難讀。比如說以下的中的eregi搜尋引擎程式碼就很難讀了。

eregi("^[_a-z0-9-]+(.[_a-z0-9-]+)*@[a-z0-9-]+(.[a-z0-9-]+)*$",$sendto)

你可以看到，程式的意圖很難把握。但如果你拋開保留字元，你常常會錯誤地理解程式碼的意思。

總結

在本文中，我們揭開了正規表示式的神秘面紗，並列出了ERE標準的通用語法。如果你想閱覽Open Group組織的規則的完整描述，你可以參見：，歡迎你在其中的討論區發表你的問題或觀點。

揭開正規表示式語法的神秘面紗 (轉)