一起學習PHP中的Tidy擴充套件庫

這個擴充套件估計很多同學可能都沒聽說過，這可不是泰迪熊呀，而是一個處理 HTML 相關操作的擴充套件，主要是可以用於 HTML 、 XHTML 、 XML 這類資料格式內容的格式化及展示。

關於 Tidy 庫

Tidy 庫擴充套件是隨 PHP 一起釋出的，也就是說，我們可以在編譯安裝 PHP 時加上 --with-tidy 來一起安裝這個擴充套件，也可以在事後通過原始碼包中 ext/ 資料夾下的 tidy 目錄中的原始碼來進行安裝。同時，Tidy 擴充套件還需要依賴一個 tidy 函式庫，我們需要在作業系統上安裝，如果是 CentOS 的話，直接 yum install libtidy-devel 就可以了。

Tidy 格式化

首先我們來看一下如何通過這個 Tidy 擴充套件庫來格式化一段 HTML 程式碼。

$content = <<<EOF
<html><head><title>test</title></head> <body><p>error<br>another line</i></body>
</html>
EOF;

$tidy = new Tidy();
$config = [
        'indent'=>true,
        'output-xhtml'=>true,
];
$tidy->parseString($content, $config);
$tidy->cleanRepair();

echo $tidy, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

我們定義的 $content 中的這段 HTML 程式碼是沒有任何格式的非常不規範的一段 HTML 程式碼。通過例項化一個 Tidy 物件之後，使用 parseString() 方法，並執行 cleanRepair() 方法之後，再直接列印 $tidy 物件，我們就獲得了格式化之後的 HTML 程式碼。看起來是不是非常地規範，不管是 xmlns 還是縮排格式都非常標準。

parseString() 方法有兩個引數，第一個引數就是需要格式化的字串。第二個引數是格式化的配置，這個配置接收的是一個陣列，同時它內部的內容也必須是 Tidy 元件中所定義的那些配置資訊。這些配置資訊我們可以在文後的第二條連結中進行查詢。這裡我們只配置了兩個內容， indent 表示是否應用縮排塊級，output-xhtml 表示是否輸出為 xhtml 。

cleanRepair() 方法用於對已解析的內容執行清除和修復的操作，其實也就是格式化的清理工作。

注意我們在測試程式碼中是直接列印的 Tidy 物件，也就是說，這個物件實現了 \_\_toString() ，而它真正的樣子其實是這樣的。

var_dump($tidy);
// object(tidy)#1 (2) {
//     ["errorBuffer"]=>
//     string(112) "line 1 column 1 - Warning: missing <!DOCTYPE> declaration
//   line 1 column 70 - Warning: discarding unexpected </i>"
//     ["value"]=>
//     string(195) "<html xmlns="http://www.w3.org/1999/xhtml">
//     <head>
//       <title>
//         test
//       </title>
//     </head>
//     <body>
//       <p>
//         error<br />
//         another line
//       </p>
//     </body>
//   </html>"
//   }

各種屬性資訊獲取

var_dump($tidy->isXml()); // bool(false)

var_dump($tidy->isXhtml()); // bool(false)

var_dump($tidy->getStatus()); // int(1)

var_dump($tidy->getRelease());  // string(10) "2017/11/25"

var_dump($tidy->getHtmlVer()); // int(500)

我們可以通過 Tidy 物件的屬性獲取一些關於待處理文件的資訊，比如是否是 XML ，是否是 XHTML 內容。

getStatus() 返回的是 Tidy 物件的狀態資訊，當前這個 1 表示的是有警告或輔助功能錯誤的資訊，從上面列印的 Tidy 物件的內容我們就可以看出，在這個物件的 errorBuffer 屬性中是有 warning 報警資訊的。

getRelease() 返回的是當前 Tidy 元件的版本資訊，也就是你在作業系統上安裝的那個 tidy 元件的資訊。getHtmlVer() 返回的是檢測到的 HTML 版本，這裡的 500 沒有更多的說明和介紹資料，不知道這個 500 是什麼意思。

除了上面的這些內容之後，我們還可以獲得前面 $config 中的配置資訊及相關的說明。

var_dump($tidy->getOpt('indent')); // int(1)

var_dump($tidy->getOptDoc('output-xhtml'));
// string(489) "This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. <br/>This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML, and will use the corrected value in output regardless of other sources. <br/>For XHTML, entities can be written as named or numeric entities according to the setting of <code>numeric-entities</code>. <br/>The original case of tags and attributes will be preserved, regardless of other options. "

getOpt() 方法需要一個引數，也就是需要查詢的 $config 中配置的資訊內容，如果是檢視我們沒有在 $config 中配置的引數的話，那麼返回就都是預設的配置值。getOptDoc() 非常貼心，它返回的是關於某個引數的說明文件。

最後，是更加乾貨的一些方法，可以直接操作節點。

echo $tidy->head(), PHP_EOL;
// <head>
//   <title>
//   test
// </title>
// </head>

$body = $tidy->body();

var_dump($body);
// object(tidyNode)#2 (9) {
//     ["value"]=>
//     string(60) "<body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>"
//     ["name"]=>
//     string(4) "body"
//     ["type"]=>
//     int(5)
//     ["line"]=>
//     int(1)
//     ["column"]=>
//     int(40)
//     ["proprietary"]=>
//     bool(false)
//     ["id"]=>
//     int(16)
//     ["attribute"]=>
//     NULL
//     ["child"]=>
//     array(1) {
//       [0]=>
//       object(tidyNode)#3 (9) {
//         ["value"]=>
//         string(37) "<p>
// ………………
// ………………

echo $tidy->html(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

echo $tidy->root(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

相信不需要過多地解釋就能夠看出，head() 返回的就是 <head> 標籤裡面的內容，而 body() 、html() 也都是對應的相關標籤，root() 返回的則是根結點的全部內容，可以看作是整個文件內容。

這些方法函式返回的內容其實都是一個 TidyNode 物件，這個我們在後面再詳細地說明。

直接轉換為字串

上面的操作程式碼我們都是基於 parseString() 這個方法。它沒有返回值，或者說返回的只是一個布林型別的成功失敗標識。如果我們需要獲取格式化之後的內容，只能直接將物件當做字串或者使用 root() 來獲得所有的內容。其實，還有一個方法直接就是返回一個格式化後的字串的。

$tidy = new Tidy();
$repair = $tidy->repairString($content, $config);

echo $repair, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

repairString() 方法的引數和 parseString() 是一模一樣的，唯一不同的就是它是返回的一個字串，而不是在 Tidy 物件內部進行操作。

轉換錯誤資訊

在最開始的測試程式碼中，我們使用 var_dump() 列印 Tidy 物件時就看到了 errorBuffer 這個變數裡是有錯誤資訊的。這回我們再來一個有更多問題的 HTML 程式碼片斷。

$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<p>paragraph</p>
HTML;
$tidy = new Tidy();
$tidy->parseString($html);
$tidy->cleanRepair();

echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element

$tidy ->diagnose();
echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element
// Info: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
// Info: Document content looks like XHTML 1.0 Strict
// Tidy found 3 warnings and 0 errors!

在這段測試程式碼中，我們又使用了一個新的 diagnose() 方法，它的作用是對文件進行診斷測試，並且在 errorBuffer 這個物件變數中新增有關文件的更多資訊。

TidyNode 操作

之前我們說到過，head()、html()、body()、root() 這幾個方法返回的都是一個 TidyNode 物件，那麼這個物件有什麼特殊的地方嗎？

$html = <<<EOF
<html><head>
<?php echo '<title>title</title>'; ?>
<#
  /* JSTE code */
  alert('Hello World');
#>
</head>
<body>

<?php
  // PHP code
  echo 'hello world!';
?>

<%
  /* ASP code */
  response.write("Hello World!")
%>

<!-- Comments -->
Hello World
</body></html>
Outside HTML
EOF;

$tidy = new Tidy();
$tidy->parseString($html);

$tidyNode = $tidy->html();

showNodes($tidyNode);

function showNodes($node){

    if($node->isComment()){
        echo '========', PHP_EOL,'This is Comment Node :"', $node->value, '"', PHP_EOL;
    }
    if($node->isText()){
        echo '--------', PHP_EOL,'This is Text Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isAsp()){
        echo '++++++++', PHP_EOL,'This is Asp Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isHtml()){
        echo '********', PHP_EOL,'This is HTML Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isPhp()){
        echo '########', PHP_EOL,'This is PHP Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isJste()){
        echo '@@@@@@@@', PHP_EOL,'This is JSTE Script :"', $node->value, '"', PHP_EOL;
    }

    if($node->name){
        // getParent()
        if($node->getParent()){
            echo '&&&&&&&& ', $node->name ,' getParent is : ', $node->getParent()->name, PHP_EOL;
        }

        // hasSiblings
        echo '^^^^^^^^ ', $node->name, ' has siblings is : ';
        var_dump($node->hasSiblings());
        echo PHP_EOL;
    }

    if($node->hasChildren()){
        foreach($node->child as $child){
            showNodes($child);
        }
    }
}

// ………………
// ………………
// ********
// This is HTML Node :"<head>
// <?php echo '<title>title</title>'; ><#
//   /* JSTE code */
//   alert('Hello World');
// #>
// <title></title>
// </head>
// "
// &&&&&&&& head getParent is : html
// ^^^^^^^^ head has siblings is : bool(true)
// ………………
// ………………
// ++++++++
// This is Asp Script :"<%
//   /* ASP code */
//   response.write("Hello World!")
// %>" 
// ………………
// ………………

這段程式碼具體的測試步驟和各個函式的解釋就不詳細地一一列舉說明了。大家通過程式碼就可以看出來，我們的 TidyNode 物件可以判斷各個節點的內容，比如是否還有子結點、是否有兄弟結點。物件結點內容，可以判斷結點的格式，是否是註釋、是否是文字、是否是 JS 程式碼、是否是 PHP 程式碼、是否是 ASP 程式碼之類的內容。不知道看到這裡的你是什麼感覺，反正我是覺得這個玩意就非常有意思了，特別是判斷 PHP 程式碼這些的方法。

資訊統計函式

最後我們再來看一下 Tidy 擴充套件庫中的一些統計函式。

$html = <<<EOF
<p>test</i>
<bogustag>bogus</bogustag>
EOF;
$config = array('accessibility-check' => 3,'doctype'=>'bogus');
$tidy = new Tidy();
$tidy->parseString($html, $config);

echo 'tidy access count: ', tidy_access_count($tidy), PHP_EOL;
echo 'tidy config count: ', tidy_config_count($tidy), PHP_EOL;
echo 'tidy error count: ', tidy_error_count($tidy), PHP_EOL;
echo 'tidy warning count: ', tidy_warning_count($tidy), PHP_EOL;

// tidy access count: 4
// tidy config count: 2
// tidy error count: 1
// tidy warning count: 6

其實它們返回的這些數量都是一些錯誤資訊的數量。tidy_access_count() 表示的是遇到的輔助功能警告數量，tidy_config_count() 是配置資訊錯誤的數量，另外兩個從名字就看出來了，也就不用我多說了。