詞法分析器

bxtkezhan發表於2021-05-08

原文網址 : http://www.ituring.com.cn/article/519731

維基百科介紹：詞法分析是電腦科學中將字元序列轉換為標記序列的過程。進行詞法分析的程式或者函式叫作詞法分析器。️

有如下原始程式程式碼

add_result = 1 + 2

通過詞法分析得到以下結果

NAME   `add_result` 0,  0
SYMBOL `=`          0, 11
INT    `1`          0, 13
SYMBOL `+`          0, 15
INT    `2`          0, 17

我們可以利用Go語言輕鬆實現可用的詞法分析器 ?️

Go語言實現詞法分析器

package main

import (
    "fmt"
    "regexp"
    "unicode/utf8"
    "os"
)

var exprs = []string{"\\d+", "[\\p{L}\\d_]+", "[\\+\\-=]"}
var names = []string{"INT",  "NAME",         "SYMBOL"}

func main() {
    rules := []*regexp.Regexp{}
    for i, expr := range exprs {
        rule, _ := regexp.Compile("^" + expr)
        rules = append(rules, rule)
        fmt.Println(names[i], rule)
    }

    fmt.Println("--------------------------------")
    for row, code := range os.Args[1:] {
        position := 0
        col := 0
        for true {
            for position < len(code) && (code[position] == ' ' || code[position] == '\t') {
                position += 1
                col += 1
            }
            if position >= len(code) {
                break
            }
            source := ""
            tokenType := -1
            for i, rule := range rules {
                source = rule.FindString(code[position:])
                if source != "" {
                    tokenType = i
                    break
                }
            }
            if tokenType >= 0 {
                fmt.Printf("%s\t`%s`\t%d\t%d\n", names[tokenType], source, row, col)
                position += len(source)
                col += utf8.RuneCountInString(source)
            } else {
                fmt.Printf("error in: %d, %d\n", row, col)
                break
            }
        }
    }

}

在命令列中執行測試

➜ go run lexer.go "數值 = PI + 100"
INT        ^\d+
NAME    ^[\p{L}\d_]+
SYMBOL    ^[\+-=]
--------------------------------
NAME    `數值`    0    0
SYMBOL    `=`        0    3
NAME    `PI`    0    5
SYMBOL    `+`        0    8
INT        `100`    0    10

Go語言程式碼說明

引入需要用到的包:

package main

import (
    "fmt"
    "regexp"
    "unicode/utf8"
    "os"
)

fmt 用於列印輸出
regexp 正規表示式
unicode/utf8 統計utf8的符文數量
os 獲取使用者輸入

指定正規表示式和欄位型別名稱:

var exprs = []string{"\\d+", "[\\p{L}\\d_]+", "[\\+\\-=]"}
var names = []string{"INT",  "NAME",         "SYMBOL"}

建立兩個字串陣列分別用於儲存正規表示式與對應的欄位型別名稱。

初始化欄位匹配規則:

func main() {
    rules := []*regexp.Regexp{}
    for i, expr := range exprs {
        rule, _ := regexp.Compile("^" + expr)
        rules = append(rules, rule)
        fmt.Println(names[i], rule)
    }

需要注意的是必須為每一個正規表示式頭前插入^用來確保匹配的字串包括最左邊的一個字元，避免“跳躍匹配”。

迴圈匹配欄位:

for row, code := range os.Args[1:] {
    position := 0
    col := 0
    for true {
        for position < len(code) && (code[position] == ' ' || code[position] == '\t') {
            position += 1
            col += 1
        }
        if position >= len(code) {
            break
        }
        source := ""
        tokenType := -1
        for i, rule := range rules {
            source = rule.FindString(code[position:])
            if source != "" {
                tokenType = i
                break
            }
        }
        if tokenType >= 0 {
            fmt.Printf("%s\t`%s`\t%d\t%d\n", names[tokenType], source, row, col)
            position += len(source)
            col += utf8.RuneCountInString(source)
        } else {
            fmt.Printf("error in: %d, %d\n", row, col)
            break
        }
    }
}

使用遍歷os.Args[1:]的方法將使用者輸入的每一個引數作為一行程式碼進行詞法分析。

跳過【忽略】空字元:

for position < len(code) && (code[position] == ' ' || code[position] == '\t') {
    position += 1
    col += 1
}

因為我們的正規表示式必須匹配最左邊的一個字元所以需要跳過一些常常沒有意義的空字元。

判斷是否需要中斷迴圈:

if position >= len(code) {
    break
}

遍歷匹配規則嘗試匹配:

source := ""
tokenType := -1
for i, rule := range rules {
    source = rule.FindString(code[position:])
    if source != "" {
        tokenType = i
        break
    }
}

迴圈遍歷設定的規則進行匹配，如果成功則將下標設定為tokenType的值，如果始終沒有匹配則tokenType預設-1。

根據匹配結果判斷後續行為:

if tokenType >= 0 {
    fmt.Printf("%s\t`%s`\t%d\t%d\n", names[tokenType], source, row, col)
    position += len(source)
    col += utf8.RuneCountInString(source)
} else {
    fmt.Printf("error in: %d, %d\n", row, col)
    break
}

如果tokenType不為-1，則匹配成功，將列印欄位名稱，字面量，行列資訊，並且設定position使之跳過當前欄位，需要注意下一個欄位起始的列號col的增量需要使用utf8的符文計數方法獲得，否則遇到一些unicode/utf8編碼將無法得到正確指向。

Python使用者也可以輕鬆的實現 ?️

Python詞法分析器

import re
import sys


exprs = ['\\d+', '\\w+', '[\\+\\-=]']
names = ['INT',  'NAME', 'SYMBOL']


def main():
    rules = []
    for i, expr in enumerate(exprs):
        rules.append(re.compile('^' + expr))
        print(names[i], rules[-1].pattern)

    print('-' * 32)
    for row, code in enumerate(sys.argv[1:]):
        position = 0
        while True:
            while position < len(code) and (code[position] == ' ' or code[position] == '\t'):
                position += 1
            if position >= len(code):
                break

            source = ''
            tokenType = -1
            for i, rule in enumerate(rules):
                result = rule.findall(code[position:])
                if len(result) > 0:
                    source = result[0]
                    tokenType = i
                    break
            if tokenType >= 0:
                print(f'{names[tokenType]}\t`{source}`\t{row}\t{position}')
                position += len(source)
            else:
                print(f'error in {row}, {position}')
                break


if __name__ == "__main__":
    main()

作為補充內容這裡也提供C++方案 ?️

C++實現詞法分析器

#include <locale>
#include <regex>
#include <string>
#include <vector>
#include <codecvt>


std::vector<std::wstring> exprs{L"\\d+", L"\\w+", L"[\\+\\-=]"};
std::vector<std::string> names{"INT",  "NAME", "SYMBOL"};


int main(int argc, char *argv[]) {
    std::locale old;
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wstring_convert<std::codecvt_utf8<wchar_t>> codecvt_utf8;

    std::vector<std::wregex> rules;
    for (size_t i = 0, count = exprs.size(); i < count; ++i) {
        rules.push_back(std::wregex(L"^" + exprs[i]));
        printf("%s ^%s\n", names[i].c_str(), codecvt_utf8.to_bytes(exprs[i]).c_str());
    }

    printf("--------------------------------\n");
    for (int row = 0; row < argc - 1; ++row) {
        std::wstring code = codecvt_utf8.from_bytes(argv[row + 1]);
        size_t position = 0;
        while (true) {
            while (position < code.size() && (code[position] == L' ' || code[position] == L'\t'))
                position += 1;
            if (position >= code.size())
                break;

            auto subcode = code.substr(position);
            std::wsmatch match;
            int tokenType = -1;
            for (size_t i = 0, count = rules.size(); i < count; ++i) {
                if (std::regex_search(subcode, match, rules[i])) {
                    tokenType = i;
                    break;
                }
            }

            if (tokenType >= 0) {
                auto source = match.str(0);
                printf("%s\t`%s`\t%d\t%ld\n",
                    names[tokenType].c_str(), codecvt_utf8.to_bytes(source).c_str(), row, position);
                position += source.size();
            } else {
                printf("error in: %d, %ld\n", row, position);
                break;
            }
        }
    }

    std::locale::global(old);
    return 0;
}

Lex詞法分析器
2019-05-10
詞法分析
Monkey 01 lexer 詞法分析器
2024-07-15
詞法分析
【編譯原理】手工打造詞法分析器
2024-03-28
編譯原理詞法分析
用Python實現詞法分析器（Lexical Analyzer）
2019-12-17
Python詞法分析
實現指令碼直譯器 - 詞法分析器
2019-05-11
指令碼詞法分析
Hanlp自然語言處理工具之詞法分析器
2019-04-10
HanLP自然語言處理詞法分析
【水汐の編譯原理】詞法分析器課題1
2020-10-02
編譯原理詞法分析
編譯器前端之如何實現基於DFA的詞法分析器
2021-09-21
編譯前端詞法分析
漢語言處理包HanLPv1.6.0釋出，感知機詞法分析器
2018-11-05
HanLP詞法分析
this詞法
2018-11-14
【編譯原理】手工打造語法分析器
2024-04-07
編譯原理語法分析
vue之詞法分析
2018-09-18
Vue詞法分析
詞法分析基礎
2024-04-07
詞法分析
javascript 詞法作用域
2019-02-02
JavaScript
什麼是詞法分析？請描述下js詞法分析的過程？
2024-11-26
詞法分析JS
Hanlp分詞之CRF中文詞法分析詳解
2019-02-18
HanLP分詞CRF詞法分析
Elasticsearch Analysis 分析器
2021-09-09
Elasticsearch
C++原始碼單詞掃描程式（詞法分析）
2020-10-16
C++原始碼詞法分析
引言：分詞與語法解析
2018-10-29
分詞
js 的詞法作用域和 this
2020-10-11
JS
javascript 詞法結構小結
2018-12-27
JavaScript
淺談JS詞法環境
2023-02-09
JS
CodeQL學習筆記(1)-QL語法（邏輯連線詞、量詞、聚合詞、謂詞和類）
2024-10-25
筆記
中文分詞工具之基於字標註法的分詞
2019-06-26
中文分詞
Go 語言的詞法分析和語法分析(1)
2021-03-23
Go詞法分析語法分析
1.分詞與語法解析
2018-10-29
分詞
深入JavaScript系列（一）：詞法環境
2019-03-03
JavaScript
[JS]閉包和詞法環境
2021-07-27
JS
詞法分析的前奏：字元分析（三）
2020-12-27
詞法分析字元
英語單詞記憶詞源法-思維導圖（14）詞源press-51
2020-12-05
PostgreSQL 原始碼解讀（167）- 查詢#87(基礎知識-語法分析器Bison)
2019-04-12
SQL原始碼語法分析
用Java寫編譯器（1）- 詞法和語法分析
2020-09-02
Java編譯語法分析
JavaScript權威指南（2）——詞法結構
2019-07-20
JavaScript
Go編譯原理系列3（詞法分析）
2022-01-02
Go編譯原理詞法分析
Go編譯原理系列2（詞法分析&語法分析基礎）
2021-12-23
Go編譯原理詞法分析語法分析
「て」形訣，掌握一類動詞詞形變化的趣味記憶法
2020-09-07
PHP-7.1 原始碼學習：詞法分析
2019-02-16
PHP原始碼詞法分析
Shading-jdbc原始碼分析-sql詞法解析
2019-03-04
JDBC原始碼SQL