將HTML轉成XHTML並清除一些無用的標籤和屬性

oschina發表於2014-03-06

　介紹

　　這是一個能幫你從HTML生成有效XHTML的經典庫。它還提供對標籤以及屬性過濾的支援。你可以指定允許哪些標籤和屬性可在出現在輸出中，而其他的標籤過濾掉。你也可以使用這個庫清理Microsoft Word文件轉化成HTML時生成的臃腫的HTML。你也在將HTML釋出到部落格網站前清理一下，否則像WordPress、b2evolution等部落格引擎會拒絕的。

　它是如何工作的

　　裡面有兩個類：HtmlReader和HtmlWriter

　　HtmlReader擴充了著名的由Chris Clovett開發的SgmlReader。當它讀取HTML時，它跳過所有有字首的節點。其中，所有像<o:p>、<o:Document>、<st1:personname>等上百的無用標籤被濾除了。這樣你讀取的HTML就剩下核心的HTML標籤了。

　　HtmlWriter擴充了常規的XmlWriter，XmlWriter生成XML。XHTML本質上是XML格式的HTML。所有你熟悉使用的標籤——比如<img>、<br>和<hr>，都不是閉合的標籤——在XHTML中必需是空元素形式，像<img .. />、<br/>和<hr/>。由於XHTML是常見的XML格式，你可以方便的使用XML解析器讀取XHTML文件。這使得有了應用XPath搜尋的機會。

　HtmlReader

　　HtmlReader很簡單，下面是完整的類：

////// This class skips all nodes which has some
/// kind of prefix. This trick does the job 
/// to clean up MS Word/Outlook HTML markups.
///public class HtmlReader : Sgml.SgmlReader
{
    public HtmlReader( TextReader reader ) : base( )
    {
        base.InputStream = reader;
        base.DocType = "HTML";
    }
    public HtmlReader( string content ) : base( )
    {
        base.InputStream = new StringReader( content );
        base.DocType = "HTML";
    }
    public override bool Read()
    {
        bool status = base.Read();
        if( status )
        {
            if( base.NodeType == XmlNodeType.Element )
            {
                // Got a node with prefix. This must be one
                // of those "" or something else.
                // Skip this node entirely. We want prefix
                // less nodes so that the resultant XML 
                // requires not namespace.
                if( base.Name.IndexOf(':') > 0 )
                    base.Skip();
            }
        }
        return status;
    }
}

　HtmlWriter

　　這個類是有點麻煩。下面是使用技巧：

重寫WriteString方法並避免使用常規的XML編碼。對HTML檔案手動更改編碼。
重寫WriteStartElementis以避免不被允許的標籤寫到輸出中。
重寫WriteAttributesis以避免不需求的屬性。

　　讓我們分部分來看下整個類：

　　可配置性

　　你可以通過修改下面的部分配置HtmlWriter：

public class HtmlWriter : XmlTextWriter
{
    ////// If set to true, it will filter the output
    /// by using tag and attribute filtering,
    /// space reduce etc
    ///public bool FilterOutput = false;
    ////// If true, it will reduce consecutive   with one instance
    ///public bool ReduceConsecutiveSpace = true;
    ////// Set the tag names in lower case which are allowed to go to output
    ///public string [] AllowedTags = 
        new string[] { "p", "b", "i", "u", "em", "big", "small", 
        "div", "img", "span", "blockquote", "code", "pre", "br", "hr", 
        "ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"};
    ////// If any tag found which is not allowed, it is replaced by this tag.
    /// Specify a tag which has least impact on output
    ///public string ReplacementTag = "dd";
    ////// New lines \r\n are replaced with space 
    /// which saves space and makes the
    /// output compact
    ///public bool RemoveNewlines = true;
    ////// Specify which attributes are allowed. 
    /// Any other attribute will be discarded
    ///public string [] AllowedAttributes = new string[] 
    { 
        "class", "href", "target", "border", "src", 
        "align", "width", "height", "color", "size" 
    };
}

　　WriteString方法

////// The reason why we are overriding
/// this method is, we do not want the output to be
/// encoded for texts inside attribute
/// and inside node elements. For example, all the  
/// gets converted to &nbsp in output. But this does not 
/// apply to HTML. In HTML, we need to have   as it is.
//////public override void WriteString(string text)
{
    // Change all non-breaking space to normal space
    text = text.Replace( " ", " " );
    /// When you are reading RSS feed and writing Html, 
    /// this line helps remove those CDATA tags
    text = text.Replace("", "");

    // Do some encoding of our own because
    // we are going to use WriteRaw which won't
    // do any of the necessary encoding
    text = text.Replace( "<", "<" );
    text = text.Replace( ">", ">" );
    text = text.Replace( "'", "&apos;" );
    text = text.Replace( "\"", ""e;" );

    if( this.FilterOutput )
    {
        text = text.Trim();

        // We want to replace consecutive spaces
        // to one space in order to save horizontal width
        if( this.ReduceConsecutiveSpace ) 
            text = text.Replace("   ", " ");
        if( this.RemoveNewlines ) 
            text = text.Replace(Environment.NewLine, " ");

        base.WriteRaw( text );
    }
    else
    {
        base.WriteRaw( text );
    }
}

　　WriteStartElement: 應用標籤過濾

public override void WriteStartElement(string prefix, 
    string localName, string ns)
{
    if( this.FilterOutput ) 
    {
        bool canWrite = false;
        string tagLocalName = localName.ToLower();
        foreach( string name in this.AllowedTags )
        {
            if( name == tagLocalName )
            {
                canWrite = true;
                break;
            }
        }
        if( !canWrite ) 
        localName = "dd";
    }
    base.WriteStartElement(prefix, localName, ns);
}

　　WriteAttributes方法: 應用屬性過濾

bool canWrite = false;
string attributeLocalName = reader.LocalName.ToLower();
foreach( string name in this.AllowedAttributes )
{
    if( name == attributeLocalName )
    {
        canWrite = true;
        break;
    }
}
// If allowed, write the attribute
if( canWrite ) 
    this.WriteStartAttribute(reader.Prefix, 
    attributeLocalName, reader.NamespaceURI);
while (reader.ReadAttributeValue())
{
    if (reader.NodeType == XmlNodeType.EntityReference)
    {
        if( canWrite ) this.WriteEntityRef(reader.Name);
        continue;
    }
    if( canWrite )this.WriteString(reader.Value);
}
if( canWrite ) this.WriteEndAttribute();

　結論

　　示例應用是一個你可以立即用來清理HTML檔案的實用工具。你可以將這個類應用在像部落格等需要釋出一些HTML到Web服務的工具中。

　　原文地址：http://www.codeproject.com/Articles/10792/Convert-HTML-to-XHTML-and-Clean-Unnecessary-Tags-a

HTML 常用的標籤和屬性
2019-02-27
HTML
常用的HTML標籤和屬性
2013-07-30
HTML
HTML標籤屬性
2024-04-03
HTML
語義化你的HTML標籤和屬性
2011-12-21
HTML
自定義html標籤和表單屬性
2012-05-31
HTML
HTML 標籤與屬性大小寫
2019-01-07
HTML
js給html標籤新增屬性
2016-12-26
JSHTML
HTML meta 標籤總結與屬性
2018-01-29
HTML
HTML常用標籤或屬性全稱
2016-10-11
HTML
正值表示式匹配html標籤的屬性值
2022-03-21
HTML
a標籤的使用，以及一些全域性屬性和連結型別。
2019-04-27
型別
標籤的 src 屬性
2016-10-25
xss標籤和屬性爆破
2022-02-11
HTML5 <meta> 標籤屬性，所有meta用法
2016-01-26
HTML
html的meta總結，html標籤中meta屬性使用介紹
2017-10-11
HTML
Vue 標籤中的ref屬性和refs
2020-11-02
Vue
HTML一些常用的標籤
2019-02-13
HTML
HTML5 標籤、屬性及相容性速查表
2014-08-12
HTML
DOM物件屬性(property)與HTML標籤特性(attribute)
2014-09-09
物件HTML
C#裡面標籤的屬性和事件
2021-09-09
C#事件
script標籤的defer和async屬性詳解
2024-07-27
移動端的頭部標籤和meta，[html5]移動端的頭部標籤和meta屬性詳解
2018-04-02
HTML
《HTML5移動應用開發入門經典》——2.3　HTML4標籤和屬性的變化
2017-05-02
HTML
標籤的readonly和disabled屬性的區別
2017-02-10
vue 標籤和屬性中字串拼接方法
2021-11-30
Vue字串
Maven屬性（properties）標籤的使用
2017-11-14
Maven
script標籤的crossorigin屬性
2014-06-23
ROS
將Object物件轉換成Map 屬性名和值的形式
2017-03-12
Object物件
HTML中你不得不知道的標籤及屬性
2020-12-12
HTML
正規表示式查詢具有指定屬性值的html標籤
2017-02-10
HTML
html5標籤的data-*屬性用法簡單介紹
2017-02-19
HTML
html一些基礎標籤
2020-11-26
HTML
input標籤autocomplete 屬性
2019-04-12
jquery對標籤屬性操作
2015-06-02
jQuery
說說你對HTML5的img標籤屬性srcset和sizes的理解？都有哪些應用場景？
2024-11-23
HTML
＜checkBox＞標籤的value屬性的作用
2020-12-04
請教如何在標籤的dostart裡獲取當前頁面的URL地址以及父標籤的標籤型別和某個屬性的屬性值呢？
2004-11-27
型別
link標籤屬性介紹
2017-03-07

將HTML轉成XHTML並清除一些無用的標籤和屬性

介紹

它是如何工作的

HtmlReader

HtmlWriter

可配置性

WriteString方法

WriteStartElement: 應用標籤過濾

WriteAttributes方法: 應用屬性過濾

結論

相關文章

　介紹

　它是如何工作的

　HtmlReader

　HtmlWriter

　　可配置性

　　WriteString方法

　　WriteStartElement: 應用標籤過濾

　　WriteAttributes方法: 應用屬性過濾

　結論