Using sub-generators for lexical scanning in Python

jieforest發表於2012-08-14
A few days ago I watched a very interesting talk by Rob Pike about writing a non-trivial lexer in Go. Rob discussed how the traditional switch-based state machine approach is cumbersome to write, because it’s not really compatible with the algorithm we want to express. The main problem is that when we return a new token, a traditional state-machine structure forces us to explicitly pack up the state of where we are and return to the caller. Especially in cases where we just want to stay in the same state, this makes code unnecessarily convoluted.

This struck a chord with me, because I’ve already written about simplifying state machine code in Python with coroutines. I couldn’t help but wonder what would be an elegant Pythonic way to implement Rob’s template lexer (watch the talk or take a look at his slides for the syntax).

What follows is my attempt, which uses the new yield from syntax from PEP 380, and hence requires Python 3.3 (which is currently in beta, but should be released soon). I’ll present the code in small chunks with explanations; the full source is available for download here. It’s heavily commented, so should be easy to grok.

First, some helper types and constants:

CODE:

TOK_TEXT        = 'TOK_TEXT'
TOK_LEFT_META   = 'TOK_LEFT_META'
TOK_RIGHT_META  = 'TOK_RIGHT_META'
TOK_PIPE        = 'TOK_PIPE'
TOK_NUMBER      = 'TOK_NUMBER'
TOK_ID          = 'TOK_ID'

# A token has
#   type: one of the TOK_* constants
#   value: string value, as taken from input
#
Token = namedtuple('Token', 'type value')

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/301743/viewspace-740958/,如需轉載,請註明出處,否則將追究法律責任。

相關文章