將HTML轉成XHTML並清除一些無用的標籤和屬性
介紹
這是一個能幫你從HTML生成有效XHTML的經典庫。它還提供對標籤以及屬性過濾的支援。你可以指定允許哪些標籤和屬性可在出現在輸出中,而其他的標籤過濾掉。你也可以使用這個庫清理Microsoft Word文件轉化成HTML時生成的臃腫的HTML。你也在將HTML釋出到部落格網站前清理一下,否則像WordPress、b2evolution等部落格引擎會拒絕的。
它是如何工作的
裡面有兩個類:HtmlReader和HtmlWriter
HtmlReader擴充了著名的由Chris Clovett開發的SgmlReader。當它讀取HTML時,它跳過所有有字首的節點。其中,所有像<o:p>、<o:Document>、<st1:personname>等上百的無用標籤被濾除了。這樣你讀取的HTML就剩下核心的HTML標籤了。
HtmlWriter擴充了常規的XmlWriter,XmlWriter生成XML。XHTML本質上是XML格式的HTML。所有你熟悉使用的標籤——比如<img>、<br>和<hr>,都不是閉合的標籤——在XHTML中必需是空元素形式,像<img .. />、<br/>和<hr/>。由於XHTML是常見的XML格式,你可以方便的使用XML解析器讀取XHTML文件。這使得有了應用XPath搜尋的機會。
HtmlReader
HtmlReader很簡單,下面是完整的類:
////// This class skips all nodes which has some /// kind of prefix. This trick does the job /// to clean up MS Word/Outlook HTML markups. ///public class HtmlReader : Sgml.SgmlReader { public HtmlReader( TextReader reader ) : base( ) { base.InputStream = reader; base.DocType = "HTML"; } public HtmlReader( string content ) : base( ) { base.InputStream = new StringReader( content ); base.DocType = "HTML"; } public override bool Read() { bool status = base.Read(); if( status ) { if( base.NodeType == XmlNodeType.Element ) { // Got a node with prefix. This must be one // of those "" or something else. // Skip this node entirely. We want prefix // less nodes so that the resultant XML // requires not namespace. if( base.Name.IndexOf(':') > 0 ) base.Skip(); } } return status; } }
HtmlWriter
這個類是有點麻煩。下面是使用技巧:
-
重寫WriteString方法並避免使用常規的XML編碼。對HTML檔案手動更改編碼。
-
重寫WriteStartElementis以避免不被允許的標籤寫到輸出中。
-
重寫WriteAttributesis以避免不需求的屬性。
讓我們分部分來看下整個類:
可配置性
你可以通過修改下面的部分配置HtmlWriter:
public class HtmlWriter : XmlTextWriter { ////// If set to true, it will filter the output /// by using tag and attribute filtering, /// space reduce etc ///public bool FilterOutput = false; ////// If true, it will reduce consecutive with one instance ///public bool ReduceConsecutiveSpace = true; ////// Set the tag names in lower case which are allowed to go to output ///public string [] AllowedTags = new string[] { "p", "b", "i", "u", "em", "big", "small", "div", "img", "span", "blockquote", "code", "pre", "br", "hr", "ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"}; ////// If any tag found which is not allowed, it is replaced by this tag. /// Specify a tag which has least impact on output ///public string ReplacementTag = "dd"; ////// New lines \r\n are replaced with space /// which saves space and makes the /// output compact ///public bool RemoveNewlines = true; ////// Specify which attributes are allowed. /// Any other attribute will be discarded ///public string [] AllowedAttributes = new string[] { "class", "href", "target", "border", "src", "align", "width", "height", "color", "size" }; }
WriteString方法
////// The reason why we are overriding /// this method is, we do not want the output to be /// encoded for texts inside attribute /// and inside node elements. For example, all the /// gets converted to   in output. But this does not /// apply to HTML. In HTML, we need to have as it is. //////public override void WriteString(string text) { // Change all non-breaking space to normal space text = text.Replace( " ", " " ); /// When you are reading RSS feed and writing Html, /// this line helps remove those CDATA tags text = text.Replace("", ""); // Do some encoding of our own because // we are going to use WriteRaw which won't // do any of the necessary encoding text = text.Replace( "<", "<" ); text = text.Replace( ">", ">" ); text = text.Replace( "'", "'" ); text = text.Replace( "\"", ""e;" ); if( this.FilterOutput ) { text = text.Trim(); // We want to replace consecutive spaces // to one space in order to save horizontal width if( this.ReduceConsecutiveSpace ) text = text.Replace(" ", " "); if( this.RemoveNewlines ) text = text.Replace(Environment.NewLine, " "); base.WriteRaw( text ); } else { base.WriteRaw( text ); } }
WriteStartElement: 應用標籤過濾
public override void WriteStartElement(string prefix, string localName, string ns) { if( this.FilterOutput ) { bool canWrite = false; string tagLocalName = localName.ToLower(); foreach( string name in this.AllowedTags ) { if( name == tagLocalName ) { canWrite = true; break; } } if( !canWrite ) localName = "dd"; } base.WriteStartElement(prefix, localName, ns); }
WriteAttributes方法: 應用屬性過濾
bool canWrite = false; string attributeLocalName = reader.LocalName.ToLower(); foreach( string name in this.AllowedAttributes ) { if( name == attributeLocalName ) { canWrite = true; break; } } // If allowed, write the attribute if( canWrite ) this.WriteStartAttribute(reader.Prefix, attributeLocalName, reader.NamespaceURI); while (reader.ReadAttributeValue()) { if (reader.NodeType == XmlNodeType.EntityReference) { if( canWrite ) this.WriteEntityRef(reader.Name); continue; } if( canWrite )this.WriteString(reader.Value); } if( canWrite ) this.WriteEndAttribute();
結論
示例應用是一個你可以立即用來清理HTML檔案的實用工具。你可以將這個類應用在像部落格等需要釋出一些HTML到Web服務的工具中。
原文地址:http://www.codeproject.com/Articles/10792/Convert-HTML-to-XHTML-and-Clean-Unnecessary-Tags-a
相關文章
- HTML 常用的標籤和屬性HTML
- 常用的HTML標籤和屬性HTML
- HTML標籤屬性HTML
- 語義化你的HTML標籤和屬性HTML
- 自定義html標籤和表單屬性HTML
- HTML 標籤與屬性大小寫HTML
- js給html標籤新增屬性JSHTML
- HTML meta 標籤總結與屬性HTML
- HTML常用標籤或屬性全稱HTML
- 正值表示式匹配html標籤的屬性值HTML
- a標籤的使用,以及一些全域性屬性和連結型別。型別
- 標籤的 src 屬性
- xss標籤和屬性爆破
- HTML5 <meta> 標籤屬性,所有meta用法HTML
- html的meta總結,html標籤中meta屬性使用介紹HTML
- Vue 標籤中的ref屬性和refsVue
- HTML一些常用的標籤HTML
- HTML5 標籤、屬性及相容性速查表HTML
- DOM物件屬性(property)與HTML標籤特性(attribute)物件HTML
- C#裡面標籤的屬性和事件C#事件
- script標籤的defer和async屬性詳解
- 移動端的頭部標籤和meta,[html5]移動端的頭部標籤和meta屬性詳解HTML
- 《HTML5移動應用開發入門經典》——2.3 HTML4標籤和屬性的變化HTML
- 標籤的readonly和disabled屬性的區別
- vue 標籤和屬性中 字串拼接方法Vue字串
- Maven屬性(properties)標籤的使用Maven
- script標籤的crossorigin屬性ROS
- 將Object物件轉換成Map 屬性名和值的形式Object物件
- HTML中你不得不知道的標籤及屬性HTML
- 正規表示式查詢具有指定屬性值的html標籤HTML
- html5標籤的data-*屬性用法簡單介紹HTML
- html一些基礎標籤HTML
- input標籤autocomplete 屬性
- jquery對標籤屬性操作jQuery
- 說說你對HTML5的img標籤屬性srcset和sizes的理解?都有哪些應用場景?HTML
- <checkBox>標籤的value屬性的作用
- 請教如何在標籤的dostart裡獲取當前頁面的URL地址以及父標籤的標籤型別和某個屬性的屬性值呢?型別
- link標籤屬性介紹