Writing a simple Lexer in PHP/C++/Java

weixin_34162629發表於2015-09-02

catalog

0. Comparison of parser generators
1. Writing a simple lexer in PHP
2. phc
3. JLexPHP: A PHP Lexer(xx.lex.php) Created By Java By x.lex File Input
4. JFlex
5. JLex: A Lexical Analyzer Generator for Java
6. PhpParser

 

0. Comparison of parser generators

Relevant Link:

https://en.wikipedia.org/wiki/Comparison_of_parser_generators
http://www2.cs.tum.edu/projekte/cup/examples.php
http://pygments.org/docs/lexers/
http://www.phpdeveloper.org/news/5678

 

1. Writing a simple lexer in PHP

0x1: Introduction

Router::connect('/login', array('Sessions::add'));

The map /login -> Sessions::add may be translated into the following token stream by the lexer

<T_MAP, "map">
<T_URL, "/login">
<T_BLOCKSTART, "->">
<T_IDENTIFIER, "Sessions::add">

he parser can parse the following notation

<whitespace>     := [\s]
<map>            := "map"
<url>            := [a-z/]
<blockstart>     := "->"
<identifier>     := [a-zA-Z0-9:]
<mapBlock>       := <map> <whitespace>* <url>+ <whitespace>* <blockstart> <whitespace>* <identifier>+

very rule identified by a := is called a production rule. I leave the parsing part for a later post, because that would be too much for a single posting

0x2: Implementation

The basic idea is that we match a list of regexes against the current line. If one of them matches, we store that token and advance our offset to the first character after the match. If no token is found and we are not yet at the end of the line, raise an exception (because something invalid is in front of our offset).
Let's stick with the example already mentioned above. We want to create tokens for an input file like map /login -> Sessions::add or root -> Pages::home.
The first thing we need is an array of terminal symbols that map to token identifiers.

<?php
class Lexer {
    protected static $_terminals = array(
        "/^(root)/" => "T_ROOT",
        "/^(map)/" => "T_MAP",
        "/^(\s+)/" => "T_WHITESPACE",
        "/^(\/[A-Za-z0-9\/:]+[^\s])/" => "T_URL",
        "/^(->)/" => "T_BLOCKSTART",
        "/^(::)/" => "T_DOUBLESEPARATOR",
        "/^(\w+)/" => "T_IDENTIFIER",
    );
}
?>

Now, let's implement a run method that accepts an array of source lines and returns an array of tokens. It calls the helper _match method that performs the actual matching and raises an exception if no token was found.

public static function run($source) {
    $tokens = array();
 
    foreach($source as $number => $line) {            
        $offset = 0;
        while($offset < strlen($line)) {
            $result = static::_match($line, $number, $offset);
            if($result === false) {
                throw new Exception("Unable to parse line " . ($line+1) . ".");
            }
            $tokens[] = $result;
            $offset += strlen($result['match']);
        }
    }
 
    return $tokens;
}

ote that how we advance the offset further every iteration and work towards the end of the string. To get the full picture, here's the _match helper method.

protected static function _match($line, $number, $offset) {
    $string = substr($line, $offset);
 
    foreach(static::$_terminals as $pattern => $name) {
        if(preg_match($pattern, $string, $matches)) {
            return array(
                'match' => $matches[1],
                'token' => $name,
                'line' => $number+1
            );
        }
    }
 
    return false;
}

We use the preg_match method to check if one of our pattern matches the current string. If you look closely, you can see that all of our regexes start at the beginning of the line (^) and are enclosed (()) so we can find them exactly at the beginning and get the inner content. We also store the current line, because we need it in our parser and to display helpful error messages.
Let's run our lexer with some example input:

$input = array('root -> Foo::bar');
$result = Lexer::run($input);
var_dump($result);

0x3: Wrapping Up

One thing you need to be aware of is that you need to place your token regex in the correct order - namely from special to general. If we put the T_IDENTIFIER before T_ROOT, the root keyword would always be matched as an identifier. While I haven't tried it out yet
在一個完備的詞法解析器中,匹配狀態機應該是採取"貪婪策略",即儘可能地匹配最長的"詞法模式"

Relevant Link:

http://nitschinger.at/Writing-a-simple-lexer-in-PHP
http://www.codediesel.com/php/building-a-simple-parser-and-lexer-in-php/

 

2. phc

phc is an open source compiler for PHP with support for plugins. In addition, it can be used to pretty-print or obfuscate PHP code, as a framework for developing applications that process PHP scripts, or to convert PHP into XML and back, enabling processing of PHP scripts using XML tools.

1. phc for PHP programmers  
    1) Compile PHP source into an (optimized) executable (supports entire PHP standard library).
    2) Compile a web application into an (optimized) extension (supports entire PHP standard library).
    3) Pretty-print PHP code.
    4) Obfuscate PHP code (--obfuscate flag - experimental).
    5) Combine many php scripts into a single file (--include flag - experimental).
    6) Optimize PHP code using classical compiler optimizations (in the dataflow branch - very experimental).

2. phc for tools developers 
    1) Analyse, modify or refactor PHP scripts using C++ plugins.
    2) Convert PHP into a well-defined XML format, process it with your own tools, and convert it back to PHP.
    3) Operate on ASTs, simplified ASTs, or 3-address code.
    4) Analyse or optimize PHP code using an SSA-based IR (in the dataflow branch - very experimental).

0x1: Installation Instructions

1. g++ version 3.4.0 or higher
2. make
3. Boost version 1.34 or higher
4. PHP5 embed SAPI (version 5.2.x recommended; refer to PHP embed SAPI installation instructions for more details). This is required to compile PHP code with phc.
5. Xerces-C++ if you want support for XML parsing (you don’t need Xerces for XML unparsing).
6. Boehm garbage collector is used in phc, but not in code compiled by phc. If unavailable, it can be disabled with --disable-gc, but phc will leak all memory it uses.
//The following dependencies are optional:
7. a DOT viewer such as graphviz if you want to be able to view the graphical output generated by phc (for example, syntax trees)
/*
Under Debian/Ubuntu, the following command will install nearly all dependencies:
apt-get install build-essential libboost-all-dev libxerces27-dev graphviz libgc-dev
*/

0x2: Running phc

<?php
   echo "Hello world!";
?>

phc -c helloworld.php -o helloworld
//This creates an executable helloworld, which can then be run
./helloworld

0x3: Traversing the Tree

<?php
   $x = 5;
   if($x == 5)
      echo "yes";
   else
      echo "no";
?>

phc解析得到的AST(抽象語法樹)如下

Relevant Link:

http://www.phpcompiler.org/downloads.html
http://www.phpcompiler.org/ 
http://www.phpcompiler.org/doc/latest/install.html 
http://www.phpcompiler.org/doc/latest/runningphc.html
http://www.phpcompiler.org/doc/latest/treetutorial1.html#treetutorial1
http://www.phpcompiler.org/doc/latest/manual.html

 

3. JLexPHP: A PHP Lexer(xx.lex.php) Created By Java By x.lex File Input

A lexer generator for PHP. It is based on JLex and requires Java to generate the lexer. Once generated, the lexer only requires PHP to run

0x1: x.lex(詞法規則檔案)

<?php # vim:ft=php
include 'jlex.php';

%%

%{
//<YYINITIAL> L? \" (\\.|[^\\\"])* \"    { $this->createToken(CParser::TK_STRING_LITERAL); }
    /* blah */
%}

%function nextToken
%line
%char
%state COMMENTS

ALPHA=[A-Za-z_]
DIGIT=[0-9]
ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*
NUMBER=({DIGIT})+
WHITE_SPACE=([\ \n\r\t\f])+

%%

<YYINITIAL> {NUMBER} { 
      return $this->createToken();
}
<YYINITIAL> {WHITE_SPACE} { }

<YYINITIAL> "+" { 
      return $this->createToken();
} 
<YYINITIAL> "-" { 
      return $this->createToken();
} 
<YYINITIAL> "*" { 
      return $this->createToken();
} 
<YYINITIAL> "/" { 
      return $this->createToken();
} 
<YYINITIAL> ";" { 
      return $this->createToken();
} 
<YYINITIAL> "//" {
      $this->yybegin(self::COMMENTS);
}
<COMMENTS> [^\n] {
}
<COMMENTS> [\n] {
      $this->yybegin(self::YYINITIAL);
}
<YYINITIAL> . {
      throw new Exception("bah!");
}

0x2: Lexer Generator(By Java Language)

//create the jar file
javac -Xlint:unchecked JLexPHP/Main.java
jar cvf JLexPHP.jar JLexPHP/*.class
//負責讀取詞法規則檔案,並生成Lexer 
java -cp JLexPHP.jar JLexPHP.Main simple.lex 

編譯得到simple.lex.php,這個.php檔案中包含了PHP Lexer的程式碼邏輯

0x3: 呼叫simple.lex.php、解析PHP檔案詞法

<?php
    $scanner = new Yylex(fopen("file", "r"));
    while ($scanner->yylex())
        ;

?>

Relevant Link:

https://github.com/wez/JLexPHP/blob/master/JLexPHP/Main.java
https://github.com/wez/JLexPHP
http://wezfurlong.org/blog/2006/nov/parser-and-lexer-generators-for-php/

 

4. JFlex

1. JFlex is a lexical analyzer generator (also known as scanner generator) for Java, written in Java.
2. A lexical analyzer generator takes as input a specification with a set of regular expressions and corresponding actions. It generates a program (a lexer) that reads input, matches the input against the regular expressions in the spec file, and runs the corresponding action if a regular expression matched. 
3. Lexers usually are the first front-end step in compilers, matching keywords, comments, operators, etc, and generating an input token stream for parsers. Lexers can also be used for many other purposes.
4. JFlex lexers are based on deterministic finite automata (DFAs). They are fast, without expensive backtracking.
5. JFlex is designed to work together with the LALR parser generator CUP by Scott Hudson, and the Java modification of Berkeley Yacc BYacc/J by Bob Jamison. It can also be used together with other parser generators like ANTLR or as a standalone tool.

Relevant Link:

http://jflex.de/

 

5. JLex: A Lexical Analyzer Generator for Java

JLex is a lexical analyzer generator, written for Java, in Java

Relevant Link:

http://www.cs.princeton.edu/~appel/modern/java/JLex/
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html#SECTION1

 

6. PhpParser

PhpParser generates a pure Java parser for PHP programs. Invoking this parser yields an explicit parse tree suitable for further analysis. This package is based upon

1. JFlex 1.4.1
2. Cup 0.10k
3. Grammar and lexer specifications of PHP 4.3.10. 

0x1: Project settings for IntelliJ IDEA

1. Project > Language Level: 1.7
2. Modules > Sources: only src/java_cup, src/project, src/jFlex

0x2: Building and cleaning the project with Ant from within Eclipse

1. Project > Properties > Builders
2. Deactivate the Java Builder.
3. New ...
4. Select "Ant builder"
5. Name it "Ant build" or "PhpParser build" (or any other suitable name).
6. In the Main tab, select the build.xml in the project directory as Buildfile and the project directory as Base directory.
7. In the Targets tab for "Manual build", select "build".
8. In the Targets tab for "During a clean", select "clean all".
9. OK the changes for both dialogs.
10. Project > Build Project and Clean the project using Project > Clean ...
//Or you can build the project using the command line from within the project main directory
ant build

build.xml

<project name="PhpParser" basedir="." default="build">
    <!-- PROPERTIES *************************************************************-->

    <!-- java/javac properties -->
    <property name="src.dir" value="src"/>
    <property name="src.project.dir" value="${src.dir}/project"/>
    <property name="src.spec.dir" value="${src.dir}/spec"/>
    <property name="src.jflex.dir" value="${src.dir}/JFlex"/>
    <property name="src.cup.dir" value="${src.dir}/java_cup"/>

    <property name="build.dir" value="build"/>
    <property name="build.java.dir" value="${build.dir}/java"/>
    <property name="build.class.dir" value="${build.dir}/class"/>

    <property name="lexparse.package" value="at.ac.tuwien.infosys.www.phpparser"/>
    <property name="lexparse.dir" value="${build.java.dir}/at/ac/tuwien/infosys/www/phpparser"/>

    <property name="javadoc.dir" value="doc/html"/>
    <property name="javadoc.lexparse.dir" value="${javadoc.dir}/phpparser"/>

    <!-- lexer generator and generated lexer -->
    <property name="lexgen.main" value="JFlex.Main"/>
    <property name="lexgen.input" value="${src.spec.dir}/php.jflex"/>
    <!-- the lexer name is specified with the %class option in the input file -->
    <property name="lexer.name" value="PhpLexer"/>
    <property name="lexer.source" value="${lexer.name}.java"/>
    <property name="lexer.class" value="${lexer.name}.class"/>

    <!-- parser generator and generated parser -->
    <property name="parsegen.main" value="java_cup.Main"/>
    <property name="parsegen.input" value="${src.spec.dir}/php.cup"/>
    <!-- CAUTION: when changing this property, consult the parser generator's input file first -->
    <property name="parser.name" value="PhpParser"/>
    <property name="parser.source" value="${parser.name}.java"/>
    <property name="parser.sym.name" value="PhpSymbols"/>
    <property name="parser.sym.source" value="${parser.sym.name}.java"/>

    <!-- classpath -->
    <path id="classpath">
        <pathelement location="${build.class.dir}"/>
        <!-- -necessary because of JFlex Messages bundle -->
        <pathelement location="${src.jflex.dir}"/>
    </path>


    <!-- TARGETS ****************************************************************-->

    <target name="cup" description="Compiles the modified Cup.">
        <mkdir dir="${build.class.dir}"/>
        <javac srcdir="${src.cup.dir}" destdir="${build.class.dir}" debug="on"  includeantruntime="false">
            <compilerarg line="-encoding GBK"/>     
            <classpath refid="classpath"/>
        </javac>
    </target>

    <target name="jflex" description="Compiles the modified JFlex.">
        <javac srcdir="${src.jflex.dir}" destdir="${build.class.dir}" debug="on" includeantruntime="false">
            <compilerarg line="-encoding GBK"/>
            <classpath refid="classpath"/>
        </javac>
    </target>

    <target name="lexer.source" depends="cup,jflex"
        description="Uses the lexer generator to create a Java lexer from the input file.">
        <mkdir dir="${lexparse.dir}"/>
        <java classname="${lexgen.main}" fork="yes">
            <arg value="${lexgen.input}"/>
            <arg value="-d"/>
            <arg value="${lexparse.dir}"/>
            <classpath refid="classpath"/>
        </java>
    </target>

    <target name="parser.source" depends="cup"
        description="Uses the parser generator to create a Java parser from the input file.">
        <mkdir dir="${lexparse.dir}"/>
        <java classname="${parsegen.main}" fork="yes">
            <arg value="-parser"/>
            <arg value="${parser.name}"/>
            <arg value="-symbols"/>
            <arg value="${parser.sym.name}"/>
            <arg value="-nonterms"/>
            <arg value="-expect"/>
            <arg value="1"/>
            <arg value="${parsegen.input}"/>
            <classpath refid="classpath"/>
        </java>
        <move file="${basedir}/${parser.source}" todir="${lexparse.dir}"/>
        <move file="${basedir}/${parser.sym.source}" todir="${lexparse.dir}"/>
    </target>

    <target name="javac"
        description="Internal target for Java development. Doesn't try to generate lexer and parser.">
        <mkdir dir="${build.class.dir}"/>
        <javac destdir="${build.class.dir}" debug="on" includeantruntime="false">
            <compilerarg line="-encoding GBK"/>
            <src>
                <pathelement path="${src.project.dir}"/>
                <pathelement path="${build.java.dir}"/>
            </src>
            <classpath refid="classpath"/>
        </javac>
    </target>

    <target name="javadoc" depends="javac" description="Generates JavaDoc.">
        <javadoc destdir="${javadoc.lexparse.dir}" packagenames="${lexparse.package}" Windowtitle="PhpParser 1.0">
            <sourcepath>
                <pathelement path="${src.project.dir}"/>
                <pathelement path="${build.java.dir}"/>
            </sourcepath>
            <classpath refid="classpath"/>
        </javadoc>
    </target>

    <target name="build" depends="lexer.source,parser.source,javac,javadoc"
        description="Builds the whole project together with the generated lexer and parser."/>

    <target name="clean" description="Cleans up.">
        <delete dir="${build.java.dir}"/>
        <delete dir="${build.class.dir}"/>
        <delete dir="${graphs.dir}"/>
        <delete file="${jar.file}"/>
    </target>

    <target name="cleanall" depends="clean" description="Cleans up JFlex, Cup and JavaDoc as well.">
        <delete dir="${lib.dir}/JFlex"/>
        <delete dir="${lib.dir}/java_cup"/>
        <delete dir="${javadoc.dir}"/>
    </target>

    <target name="dist">
        <mkdir dir="dist"/>
    </target>

    <target name="help">
        <echo message="You probably want to do 'ant build'. Otherwise, type 'ant -projecthelp' for help."/>
    </target>
</project>

編譯得到的Java版本的Lexer解析引擎,我們可以直接在程式碼中例項化並呼叫其中的函式

0x3: Usage

Example.java

import at.ac.tuwien.infosys.www.phpparser.*;
import java.io.*;
import java.util.*;

class Example {

    public static void main(String[] args) {

        if (args.length == 0) {
            System.out.println("Please specify one or more PHP files to be parsed.");
            System.exit(1);
        }

        for (int i = 0; i < args.length; i++) {

            String fileName = args[i];
            
            ParseTree parseTree = null;
            try {
                PhpParser parser = new PhpParser(new PhpLexer(new FileReader(fileName)));
                ParseNode rootNode = (ParseNode) parser.parse().value;
                parseTree = new ParseTree(rootNode);
            } catch (FileNotFoundException e) {
                System.err.println("File not found: " + fileName);
                System.exit(1);
            } catch (Exception e) {
                System.err.println("Error parsing " + fileName);
                System.err.println(e.getMessage());
                e.printStackTrace();
                System.exit(1);
            }

            System.out.println("*** Printing tokens for file " + fileName + "...");
            for (Iterator iter = parseTree.leafIterator(); iter.hasNext(); ) {
                ParseNode leaf = (ParseNode) iter.next();
                System.out.println(leaf.getLexeme());
            }
        }
    }

}

編譯

pushd D:\eclipse-javaEE\workspace\phpparser\doc\example
javac -classpath ../../build/class Example.java

執行

java -classpath ../../build/class:. Example test1.php test2.php

0x4: Directory layout

build.xml
README
build
    class: generated java class files
    java:generated java source files (PHP Lexer and Parser)
doc
    various documentation files
src
    java_cup: modified version of the Cup parser generator
    jflex: modified version of the JFlex scanner generator
    project: parse tree data structures
    spec: specification (input) files for Cup and JFlex

Relevant Link:

https://github.com/oliverklee/phpparser/blob/master/src/spec/php.cup
https://github.com/oliverklee/phpparser/blob/master/src/spec/php.jflex
https://github.com/oliverklee/phpparser

 

Copyright (c) 2015 LittleHann All rights reserved

 

相關文章