環境
python:3.4.4
準備xml檔案
首先新建一個xml檔案,countries.xml。內容是在python官網上看到的。
<?xml version="1.0"?> <data> <country name="Liechtenstein"> <rank>1</rank> <year>2008</year> <gdppc>141100</gdppc> <neighbor name="Austria" direction="E"/> <neighbor name="Switzerland" direction="W"/> </country> <country name="Singapore"> <rank>4</rank> <year>2011</year> <gdppc>59900</gdppc> <neighbor name="Malaysia" direction="N"/> </country> <country name="Panama"> <rank>68</rank> <year>2011</year> <gdppc>13600</gdppc> <neighbor name="Costa Rica" direction="W"/> <neighbor name="Colombia" direction="E"/> </country> </data>
準備python檔案
新建一個test_SAX.py,用來解析xml檔案。
#!/usr/bin/python # -*- coding: UTF-8 -*- import xml.sax class CountryHandler( xml.sax.ContentHandler ): def __init__(self): self.CurrentData = "" self.CurrentAttributes = "" self.rank = "" self.year = "" self.gdppc = "" self.nei_name = "" self.nei_dire = "" def startElement(self, tag, attributes): self.CurrentData = tag self.CurrentAttributes = attributes if tag == "country": print ("*****Country*****") name = attributes["name"] print ("Name:", name) if tag == "neighbor": self.nei_name = attributes["name"] self.nei_dire = attributes["direction"] def endElement(self, tag): if self.CurrentData == "rank": print ("Rank:", self.rank) elif self.CurrentData == "year": print ("Year:", self.year) elif self.CurrentData == "gdppc": print ("Gdppc:", self.gdppc) elif self.CurrentData == "neighbor": print ("Neighbor:", self.nei_name,self.nei_dire) self.CurrentData = "" self.nei_name = "" self.nei_dire = "" def characters(self, content): if self.CurrentData == "rank": self.rank = content elif self.CurrentData == "year": self.year = content elif self.CurrentData == "gdppc": self.gdppc = content if __name__ == "__main__": parser = xml.sax.make_parser() parser.setFeature(xml.sax.handler.feature_namespaces, 0) Handler = CountryHandler() parser.setContentHandler( Handler ) parser.parse("countries.xml")
執行結果
>python test_SAX.py *****Country***** Name: Liechtenstein Rank: 1 Year: 2008 Gdppc: 141100 Neighbor: Austria E Neighbor: Switzerland W *****Country***** Name: Singapore Rank: 4 Year: 2011 Gdppc: 59900 Neighbor: Malaysia N *****Country***** Name: Panama Rank: 68 Year: 2011 Gdppc: 13600 Neighbor: Costa Rica W Neighbor: Colombia E
備註
SAX是一種基於事件驅動的API。
SAX主要包括三種物件: readers,handlers 以及 input sources。即解析器,事件處理器以及輸入源。
解析器負責讀取輸入源,如xml文件,並向事件處理器傳送事件,如元素開始和元素結束事件。
事件處理器負責處理事件,對xml文件資料進行處理。
parser = xml.sax.make_parser()
新建並且返回一個 SAX XMLReader 物件。
參見: https://docs.python.org/2/library/xml.sax.html
xml.sax.make_parser([parser_list]) Create and return a SAX XMLReader object. The first parser found will be used. If parser_list is provided, it must be a sequence of strings which name modules that have a function named create_parser(). Modules listed in parser_list will be used before modules in the default list of parsers.
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
設定xml.sax.handler.feature_namespaces值為0。其實就是關閉 namespace模式。
參見:https://docs.python.org/2/library/xml.sax.reader.html
XMLReader.setFeature(featurename, value) Set the featurename to value. If the feature is not recognized, SAXNotRecognizedException is raised. If the feature or its setting is not supported by the parser, SAXNotSupportedException is raised.
class CountryHandler( xml.sax.ContentHandler )
SAX API 定義了4種handler:content handler,DTD handler,error handlers,和 entity resolvers。
程式只需要實現自己感興趣的事件的介面,比如我們這裡只實現了 ContentHandler介面裡的部分方法。
class xml.sax.handler.ContentHandler This is the main callback interface in SAX, and the one most important to applications. The order of events in this interface mirrors the order of the information in the document.
ContentHandler 有很多方法。具體可參見: https://docs.python.org/2/library/xml.sax.handler.html#contenthandler-objects
我們這裡首先新建一個CountryHandler類,繼承自 xml.sax.ContentHandler。然後實現了他的 startElement(),endElement() 以及 characters()方法。
def startElement(self, tag, attributes)
遇到XML開始標籤時呼叫。tag是標籤的名字,attributes 是標籤的屬性值字典。
Signals the start of an element in non-namespace mode. The name parameter contains the raw XML 1.0 name of the element type as a string and the attrs parameter holds an object of the Attributes interface (see The Attributes Interface) containing the attributes of the element. The object passed as attrs may be re-used by the parser; holding on to a reference to it is not a reliable way to keep a copy of the attributes. To keep a copy of the attributes, use the copy() method of the attrs object.
def endElement(self, tag)
遇到XML結束標籤時呼叫。tag是標籤的名字。
Signals the end of an element in non-namespace mode.
The name parameter contains the name of the element type, just as with the startElement() event.
def characters(self, content)
遇到XML元素內容時呼叫。content為元素的內容值。
Receive notification of character data. The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information. content may be a Unicode string or a byte string; the expat reader module produces always Unicode strings.
parser.setContentHandler( Handler )
設定當前的ContentHandler為我們自己寫的handler例項。如果不進行設定,content 事件會被忽略。
參見:https://docs.python.org/2/library/xml.sax.reader.html
XMLReader.setContentHandler(handler)¶ Set the current ContentHandler. If no ContentHandler is set, content events will be discarded.
parser.parse("countries.xml")
開始解析 xml檔案。
參見:https://docs.python.org/2/library/xml.sax.reader.html
Process an input source, producing SAX events. The source object can be a system identifier (a string identifying the input source – typically a file name or an URL), a file-like object, or an InputSource object. When parse() returns, the input is completely processed, and the parser object can be discarded or reset. As a limitation, the current implementation only accepts byte streams; processing of character streams is for further study.