python 解析xml 檔案: SAX方式

微微微笑發表於2015-12-31

環境

python:3.4.4

準備xml檔案

首先新建一個xml檔案,countries.xml。內容是在python官網上看到的。

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

準備python檔案

新建一個test_SAX.py,用來解析xml檔案。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import xml.sax

class CountryHandler( xml.sax.ContentHandler ):
    def __init__(self):
        self.CurrentData = ""
        self.CurrentAttributes = ""
        self.rank = ""
        self.year = ""
        self.gdppc = ""
        self.nei_name = ""
        self.nei_dire = ""

    def startElement(self, tag, attributes):
        self.CurrentData = tag
        self.CurrentAttributes = attributes
        if tag == "country":
            print ("*****Country*****")
            name = attributes["name"]
            print ("Name:", name)
        if tag == "neighbor":
            self.nei_name = attributes["name"]
            self.nei_dire = attributes["direction"]

    def endElement(self, tag):
        if self.CurrentData == "rank":
            print ("Rank:", self.rank)
        elif self.CurrentData == "year":
            print ("Year:", self.year)
        elif self.CurrentData == "gdppc":
            print ("Gdppc:", self.gdppc)
        elif self.CurrentData == "neighbor":
            print ("Neighbor:", self.nei_name,self.nei_dire)
        self.CurrentData = ""
        self.nei_name = ""
        self.nei_dire = ""

    def characters(self, content):
        if self.CurrentData == "rank":
            self.rank = content
        elif self.CurrentData == "year":
            self.year = content
        elif self.CurrentData == "gdppc":
            self.gdppc = content
  
if __name__ == "__main__":
    parser = xml.sax.make_parser()
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)
    Handler = CountryHandler()
    parser.setContentHandler( Handler )
    parser.parse("countries.xml")

執行結果

>python test_SAX.py
*****Country*****
Name: Liechtenstein
Rank: 1
Year: 2008
Gdppc: 141100
Neighbor: Austria E
Neighbor: Switzerland W
*****Country*****
Name: Singapore
Rank: 4
Year: 2011
Gdppc: 59900
Neighbor: Malaysia N
*****Country*****
Name: Panama
Rank: 68
Year: 2011
Gdppc: 13600
Neighbor: Costa Rica W
Neighbor: Colombia E

備註

SAX是一種基於事件驅動的API。

SAX主要包括三種物件: readers,handlers 以及 input sources。即解析器,事件處理器以及輸入源。

解析器負責讀取輸入源,如xml文件,並向事件處理器傳送事件,如元素開始和元素結束事件。

事件處理器負責處理事件,對xml文件資料進行處理。

parser = xml.sax.make_parser()

新建並且返回一個 SAX XMLReader 物件。

參見: https://docs.python.org/2/library/xml.sax.html

xml.sax.make_parser([parser_list])
Create and return a SAX XMLReader object. The first parser found will be used. If parser_list is provided, it must be a sequence of strings which name modules that have a function named create_parser(). Modules listed in parser_list will be used before modules in the default list of parsers.

parser.setFeature(xml.sax.handler.feature_namespaces, 0)

設定xml.sax.handler.feature_namespaces值為0。其實就是關閉 namespace模式。

參見:https://docs.python.org/2/library/xml.sax.reader.html

XMLReader.setFeature(featurename, value)
Set the featurename to value. If the feature is not recognized, SAXNotRecognizedException is raised. If the feature or its setting is not supported by the parser, SAXNotSupportedException is raised.

class CountryHandler( xml.sax.ContentHandler )

SAX API 定義了4種handler:content handler,DTD handler,error handlers,和 entity resolvers。

程式只需要實現自己感興趣的事件的介面,比如我們這裡只實現了 ContentHandler介面裡的部分方法。

class xml.sax.handler.ContentHandler
This is the main callback interface in SAX, and the one most important to applications. The order of events in this interface mirrors the order of the information in the document.

ContentHandler 有很多方法。具體可參見: https://docs.python.org/2/library/xml.sax.handler.html#contenthandler-objects

我們這裡首先新建一個CountryHandler類,繼承自 xml.sax.ContentHandler。然後實現了他的 startElement(),endElement() 以及 characters()方法。

def startElement(self, tag, attributes)

遇到XML開始標籤時呼叫。tag是標籤的名字,attributes 是標籤的屬性值字典。

Signals the start of an element in non-namespace mode.

The name parameter contains the raw XML 1.0 name of the element type as a string and the attrs parameter holds an object of the Attributes interface (see The Attributes Interface) containing the attributes of the element. The object passed as attrs may be re-used by the parser; holding on to a reference to it is not a reliable way to keep a copy of the attributes. To keep a copy of the attributes, use the copy() method of the attrs object.

def endElement(self, tag)

遇到XML結束標籤時呼叫。tag是標籤的名字。

Signals the end of an element in non-namespace mode.
The name parameter contains the name of the element type, just as with the startElement() event.

def characters(self, content)

遇到XML元素內容時呼叫。content為元素的內容值。

Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

content may be a Unicode string or a byte string; the expat reader module produces always Unicode strings.

parser.setContentHandler( Handler )

設定當前的ContentHandler為我們自己寫的handler例項。如果不進行設定,content 事件會被忽略。

參見:https://docs.python.org/2/library/xml.sax.reader.html

XMLReader.setContentHandler(handler)¶
Set the current ContentHandler. If no ContentHandler is set, content events will be discarded.

parser.parse("countries.xml")

開始解析 xml檔案。

參見:https://docs.python.org/2/library/xml.sax.reader.html

Process an input source, producing SAX events. The source object can be a system identifier (a string identifying the input source – typically a file name or an URL), a file-like object, or an InputSource object. When parse() returns, the input is completely processed, and the parser object can be discarded or reset. As a limitation, the current implementation only accepts byte streams; processing of character streams is for further study.

 

相關文章