【譯】資料庫基礎:用 Go 從零開始寫一個 SQL 資料庫 —— 第一部分

suhanyujie發表於2020-03-18

Database basics: writing a SQL database from scratch in Go —— part 1

【譯】資料庫基礎:用 Go 從零開始寫一個 SQL 資料庫 —— 第一部分

In this series we'll write a rudimentary database from scratch in Go. Project source code is available on Github.

在這個系列文章中,我們將使用 Go 從零開始寫一個很基礎的資料庫。專案原始碼可在 Github 找到。

In this first post we'll build enough of a parser to run some simple CREATE, INSERT, and SELECT queries. Then we'll build an in-memory backend supporting TEXT and INT types and write a basic REPL.

在第一篇中,我們將構建解析器來執行一些簡單的 CREATEINSERTSELECT 查詢。然後,構建一個支援 TEXTINT 等後端記憶體的型別,並編寫一個基本的 REPL。

We'll be able to support the following interaction:

專案完成後,我們將支援下面的互動:

$ go run *.go
Welcome to gosql.
# CREATE TABLE users (id INT, name TEXT);
ok
# INSERT INTO users VALUES (1, 'Phil');
ok
# SELECT id, name FROM users;
| id | name |
====================
| 1 |  Phil |
ok
# INSERT INTO users VALUES (2, 'Kate');
ok
# SELECT name, id FROM users;
| name | id |
====================
| Phil |  1 |
| Kate |  2 |
ok

The first stage will be to map a SQL source into a list of tokens (lexing). Then we'll call parse functions to find individual SQL statements (such as SELECT). These parse functions will in turn call their own helper functions to find patterns of recursively parseable chunks, keywords, symbols (like parenthesis), identifiers (like a table name), and numeric or string literals.

第一步是將 SQL 源對映為 token 列表(詞法分析)。然後我們將呼叫解析函式來解析單個 SQL 語句(如 SELECT)。這些解析函式將依次呼叫一些輔助函式,以發現可遞迴解析的語句塊、關鍵字、符號(如 “括號”)、識別符號(如 “表名稱”)和數字或字串文字。

Then, we'll write an in-memory backend to do operations based on an AST. Finally, we'll write a REPL to accept SQL from a CLI and pass it to the in-memory backend.

然後,我們將編寫一個記憶體後端來執行基於 AST 的操作,最後,我們會編寫一個 REPL 來接收命令列下的 SQL 並將其傳遞到記憶體後端進行解析。


This post assumes a basic understanding of parsing concepts. We won't skip any code, but also won't go into great detail on why we structure the way we do.

本文是假定你對解析概念有了基本的瞭解之上的。我們不會跳過任何程式碼,但也不會詳細討論為何要這樣寫。

For a simpler introduction to parsing and parsing concepts, see this post on parsing JSON.

有關解析和解析概念的更簡單介紹,請參閱該解析 json 文章


lexing

詞法分析

The lexer is responsible for finding every distinct group of characters in source code: tokens. This will consist primarily of identifiers, numbers, strings, and symbols.

lexer(詞法分析器)負責查詢原始碼中的每一組不同的字元:tokens(令牌)。它主要由識別符號、數字、字串和符號組成。

The gist of the logic will be to iterate over the source string and collect characters until we find a delimiting character such as a space or comma. In this first pass, we'll pretend users don't insert delimiting characters into strings. Once we've reached a delimiting character, we'll "finalize" the token and decide whether it is valid or not.

主要邏輯是迭代源字串並收集字元,直到找到分隔符(如空格、逗號)。在第一遍字元傳遞中,我們假設使用者沒有將分隔字元插入到字串中。一旦遇到一個分隔符,我們將會” 最終確定 “並標記它是否有效。

First off, we'll define a few types and constants for use in lexer.go:

首先,我們在 lexer.go 中定義一些型別和常量:

package main

import (
    "fmt"
    "io"
    "strings"
)

type location struct {
    line uint
    col  uint
}

type keyword string

const (
    selectKeyword keyword = "select"
    fromKeyword   keyword = "from"
    asKeyword     keyword = "as"
    tableKeyword  keyword = "table"
    createKeyword keyword = "create"
    insertKeyword keyword = "insert"
    intoKeyword   keyword = "into"
    valuesKeyword keyword = "values"
    intKeyword    keyword = "int"
    textKeyword   keyword = "text"
)

type symbol string

const (
    semicolonSymbol  symbol = ";"
    asteriskSymbol   symbol = "*"
    commaSymbol      symbol = ","
    leftparenSymbol  symbol = "("
    rightparenSymbol symbol = ")"
)

type tokenKind uint

const (
    keywordKind tokenKind = iota
    symbolKind
    identifierKind
    stringKind
    numericKind
)

type token struct {
    value string
    kind  tokenKind
    loc   location
}

func (t *token) equals(other *token) bool {
    return t.value == other.value && t.kind == other.kind
}

func (t *token) finalize() bool {
    return true
}

Next we'll write out the main loop:

接著我們寫出迴圈體的主要內容:

func lex(source io.Reader) ([]*token, error) {
    buf := make([]byte, 1)
    tokens := []*token{}
    current := token{}
    var line uint = 0
    var col uint = 0

    for {
        _, err := source.Read(buf)
        if err != nil && err != io.EOF {
            return nil, err
        }

        // Add semi-colon for EOF
        // 在結尾增加分號
        var c byte = ';'
        if err == nil {
            c = buf[0]
        }

        switch c {
        case '\n':
            line++
            col = 0
            continue
        case ' ':
            fallthrough
        case ',':
            fallthrough
        case '(':
            fallthrough
        case ')':
            fallthrough
        case ';':
            if !current.finalize() {
                return nil, fmt.Errorf("Unexpected token '%s' at %d:%d", current.value, current.loc.line, current.loc.col)
            }

            if current.value != "" {
                copy := current
                tokens = append(tokens, &copy)
            }

            if c == ';' || c == ',' || c == '(' || c == ')' {
                tokens = append(tokens, &token{
                    loc:   location{col: col, line: line},
                    value: string(c),
                    kind:  symbolKind,
                })
            }

            current = token{}
            current.loc.col = col
            current.loc.line = line
        default:
            current.value += string(c)
        }

        if err == io.EOF {
            break
        }
        col++
    }

    return tokens, nil
}

Last, we'll write a finalizer helper for each kind of fundemental token and check on each in a reasonable order.

最後,我們會給每個基本 token 編寫終結輔助函式,並按一定的順序檢查所有 token。

Validating numbers

驗證數字

Numbers are the most complex. So we'll refer to the PostgreSQL documentation (section 4.1.2.6) for what constitutes a valid number.

數字是最複雜的。因此我們參考 PostgreSQL 文件(4.1.2.6 節)來確定有效數字是怎樣的。

func (t *token) finalizeNumeric() bool {
    if len(t.value) == 0 {
        return false
    }

    periodFound := false
    expMarkerFound := false

    i := 0
    for i < len(t.value) {
        c := t.value[i]

        isDigit := c >= '0' && c <= '9'
        isPeriod := c == '.'
        isExpMarker := c == 'e'

        // Must start with a digit or period
        // 必須以數字或分隔符`.`開始
        if i == 0 {
            if !isDigit && !isPeriod {
                return false
            }

            periodFound = isPeriod
            i++
            continue
        }

        if isPeriod {
            if periodFound {
                return false
            }

            periodFound = true
            i++
            continue
        }

        if isExpMarker {
            if expMarkerFound {
                return false
            }

            // No periods allowed after expMarker
            // expMarker 後可以是非分隔符`.`
            periodFound = true
            expMarkerFound = true

            // expMarker must be followed by digits
            // expMarker 後必須是數字
            if i == len(t.value)-1 {
                return false
            }

            cNext := t.value[i+1]
            if cNext == '-' || cNext == '+' {
                i++
            }

            i++
            continue
        }

        if !isDigit {
            return false
        }

        i++
    }

    t.kind = numericKind
    return true
}

Validating strings

驗證字串

Strings must start and end with a single apostrophe. But once we identify this is a string, we'll rewrite the value dropping these for easier use by the rest of the project.

字串一定是以單撇號開始與結束的。但是,一旦我們確定這是一個字串,我們將會丟棄掉這些字串定界符,以便其餘真實字串的解析。

func (t *token) finalizeString() bool {
    if len(t.value) == 0 {
        return false
    }

    if t.value[0] == '\'' && t.value[len(t.value)-1] == '\'' {
        t.kind = stringKind
        t.value = t.value[1 : len(t.value)-1]
        return true
    }

    return false
}

Validating symbols and keywords

驗證符號和關鍵字

Symbols and keywords come from a fixed set of strings, so they're easy to compare against.

符號和關鍵字就是一組固定的字串,因此它們很容易就能通過比較解析它們。

func (t *token) finalizeSymbol() bool {
    switch t.value {
    case "*":
        break
    case ";":
        break
    case "(":
        break
    case ")":
        break
    default:
        return false
    }

    t.kind = symbolKind
    return true
}

func (t *token) finalizeKeyword() bool {
    switch strings.ToLower(t.value) {
    case "select":
        break
    case "from":
        break
    case "as":
        break
    case "table":
        break
    case "create":
        break
    case "insert":
        break
    case "into":
        break
    case "values":
        break
    case "int":
        break
    case "text":
        break
    default:
        return false
    }

    t.value = strings.ToLower(t.value)
    t.kind = keywordKind
    return true
}

Validating identifiers

驗證識別符號

Now we can finish up the original finalize function and assume any token not matching one of these is a valid identifier.

現在,我們可以完成前面的 finalize 函式,並假設與其中有一個不能匹配的 token,則視為有效的識別符號。

func (t *token) finalizeIdentifier() bool {
    t.kind = identifierKind
    return true
}

func (t *token) finalize() bool {
    if t.finalizeSymbol() {
        return true
    }

    if t.finalizeKeyword() {
        return true
    }

    if t.finalizeNumeric() {
        return true
    }

    if t.finalizeString() {
        return true
    }

    if t.finalizeIdentifier() {
        return true
    }

    return false
}

And that's it for the lexer! If you copy lexer_test.go from the main project, the tests should now pass.

這就是 lexer!如果你是從主專案中複製了 lexer_test.go 檔案,則單元測試應該能通過。

AST model

AST 模型

At the highest level, an AST is a collection of statements:

從全域性上看,AST 就是語句的集合:

package main

type Ast struct {
    Statements []*Statement
}

A statement, for now, is one of INSERT, CREATE, or SELECT:

現在,一條語句就是 INSERTCREATESELECT 中的一種:

type AstKind uint

const (
    SelectKind AstKind = iota
    CreateTableKind
    InsertKind
)

type Statement struct {
    SelectStatement      *SelectStatement
    CreateTableStatement *CreateTableStatement
    InsertStatement      *InsertStatement
    Kind                 AstKind
}

Insert

An insert statement, for now, has a table name and a list of values to insert:

目前,一條 insert 語句中需要有一個表名和插入的值列表:

type InsertStatement struct {
    table  token
    values *[]*expression
}

An expression is a literal token or (in the future) a function call or inline operation:

表示式就是一個字元 token(在以後)或者是 一個函式呼叫或者是一個內鏈操作:

type expressionKind uint

const (
    literalKind expressionKind = iota
)

type expression struct {
    literal *token
    kind    expressionKind
}

CREATE

A create statement, for now, has a table name and a list of column names and types:

現在,create 語句需要有表名和欄位名及其型別的列表:

type columnDefinition struct {
    name     token
    datatype token
}

type CreateTableStatement struct {
    name token
    cols *[]*columnDefinition
}

SELECT

A select statement, for now, has a table name and a list of column names:

select 語句需要有表名和欄位名:

type SelectStatement struct {
    item []*expression
    from token
}

And that's it for the AST.

以上都是 AST 中的內容。

Parsing

解析

The Parse entrypoint will take a list of tokens and attempt to parse statements, separated by a semi-colon, until it reaches the last token.

Parse 函式入口會獲取一個 token 列表,並嘗試解析由分號分隔的語句,直到最後一個 token。

In general our strategy will be to increment and pass around a cursor containing the current position of unparsed tokens. Each helper will return the new cursor that the caller should start from.

通常,我們的策略是遞增並傳遞一個含有未解析 token 的位置的遊標。輔助函式會返回撥用方即將要解析的新遊標。

package main

import (
    "errors"
    "fmt"
    "io"
)

func tokenFromKeyword(k keyword) token {
    return token{
        kind:  keywordKind,
        value: string(k),
    }
}

func tokenFromSymbol(s symbol) token {
    return token{
        kind:  symbolKind,
        value: string(s),
    }
}

func expectToken(tokens []*token, cursor uint, t token) bool {
    if cursor >= uint(len(tokens)) {
        return false
    }

    return t.equals(tokens[cursor])
}

func helpMessage(tokens []*token, cursor uint, msg string) {
    var c *token
    if cursor < uint(len(tokens)) {
        c = tokens[cursor]
    } else {
        c = tokens[cursor-1]
    }

    fmt.Printf("[%d,%d]: %s, got: %s\n", c.loc.line, c.loc.col, msg, c.value)
}

func Parse(source io.Reader) (*Ast, error) {
    tokens, err := lex(source)
    if err != nil {
        return nil, err
    }

    a := Ast{}
    cursor := uint(0)
    for cursor < uint(len(tokens)) {
        stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
        if !ok {
            helpMessage(tokens, cursor, "Expected statement")
            return nil, errors.New("Failed to parse, expected statement")
        }
        cursor = newCursor

        a.Statements = append(a.Statements, stmt)

        atLeastOneSemicolon := false
        for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
            cursor++
            atLeastOneSemicolon = true
        }

        if !atLeastOneSemicolon {
            helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
            return nil, errors.New("Missing semi-colon between statements")
        }
    }

    return &a, nil
}

Parsing statements

解析語句

Each statement will be one of INSERT, CREATE, or SELECT. The parseStatement helper will call a helper on each of these statement types and return true if one of them succeeds in parsing.

每個語句都會是 INSERTCREATESELECT 中的一個。parseStatement 輔助函式會對每個語句型別呼叫對應的輔助函式,如果解析成功,則返回 true

func parseStatement(tokens []*token, initialCursor uint, delimiter token) (*Statement, uint, bool) {
    cursor := initialCursor

    // Look for a SELECT statement
    // 查詢 SELECT 語句
    semicolonToken := tokenFromSymbol(semicolonSymbol)
    slct, newCursor, ok := parseSelectStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            SelectKind,
            SelectStatement: slct,
        }, newCursor, true
    }

    // Look for a INSERT statement
    // 查詢 INSERT  語句
    inst, newCursor, ok := parseInsertStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            InsertKind,
            InsertStatement: inst,
        }, newCursor, true
    }

    // Look for a CREATE statement
    // 查詢 CREATE 語句
    crtTbl, newCursor, ok := parseCreateTableStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:                 CreateTableKind,
            CreateTableStatement: crtTbl,
        }, newCursor, true
    }

    return nil, initialCursor, false
}

Parsing select statements

解析 select 語句

Parsing SELECT statements is easy. We'll look for the following token pattern:

解析 SELECT 語句很簡單。我們將查詢下方的 token 進行匹配:

1.SELECT 2.$expression [, ...] 3.FROM 4.$table-name`

Sketching that out we get:

我們大致可以得到下面的內容:

func parseSelectStatement(tokens []*token, initialCursor uint, delimiter token) (*SelectStatement, uint, bool) {
    cursor := initialCursor
    if !expectToken(tokens, cursor, tokenFromKeyword(selectKeyword)) {
        return nil, initialCursor, false
    }
    cursor++

    slct := SelectStatement{}

    exps, newCursor, ok := parseExpressions(tokens, cursor, []token{tokenFromKeyword(fromKeyword), delimiter})
    if !ok {
        return nil, initialCursor, false
    }

    slct.item = *exps
    cursor = newCursor

    if expectToken(tokens, cursor, tokenFromKeyword(fromKeyword)) {
        cursor++

        from, newCursor, ok := parseToken(tokens, cursor, identifierKind)
        if !ok {
            helpMessage(tokens, cursor, "Expected FROM token")
            return nil, initialCursor, false
        }

        slct.from = *from
        cursor = newCursor
    }

    return &slct, cursor, true
}

The parseToken helper will look for a token of a particular token kind.

parseToken 輔助函式會查詢特定型別的 token 所對應的 kind 值。

func parseToken(tokens []*token, initialCursor uint, kind tokenKind) (*token, uint, bool) {
    cursor := initialCursor

    if cursor >= uint(len(tokens)) {
        return nil, initialCursor, false
    }

    current := tokens[cursor]
    if current.kind == kind {
        return current, cursor + 1, true
    }

    return nil, initialCursor, false
}

The parseExpressions helper will look for tokens separated by a comma until a delimiter is found. It will use existing helpers plus parseExpression.

parseExpressions 輔助函式會查詢由逗號分隔的 token。它將使用現有的輔助函式以及 parseExpression 函式。

func parseExpressions(tokens []*token, initialCursor uint, delimiters []token) (*[]*expression, uint, bool) {
    cursor := initialCursor

    exps := []*expression{}
outer:
    for {
        if cursor >= uint(len(tokens)) {
            return nil, initialCursor, false
        }

        // Look for delimiter
        // 查詢分隔符
        current := tokens[cursor]
        for _, delimiter := range delimiters {
            if delimiter.equals(current) {
                break outer
            }
        }

        // Look for comma
        // 查詢逗號
        if len(exps) > 0 {
            if !expectToken(tokens, cursor, tokenFromSymbol(commaSymbol)) {
                helpMessage(tokens, cursor, "Expected comma")
                return nil, initialCursor, false
            }

            cursor++
        }

        // Look for expression
        // 查詢表示式
        exp, newCursor, ok := parseExpression(tokens, cursor, tokenFromSymbol(commaSymbol))
        if !ok {
            helpMessage(tokens, cursor, "Expected expression")
            return nil, initialCursor, false
        }
        cursor = newCursor

        exps = append(exps, exp)
    }

    return &exps, cursor, true
}

The parseExpression helper (for now) will look for a numeric, string, or identifier token.

parseExpression 函式(現在)會查詢數字、字串或識別符號 token。

func parseExpression(tokens []*token, initialCursor uint, _ token) (*expression, uint, bool) {
    cursor := initialCursor

    kinds := []tokenKind{identifierKind, numericKind, stringKind}
    for _, kind := range kinds {
        t, newCursor, ok := parseToken(tokens, cursor, kind)
        if ok {
            return &expression{
                literal: t,
                kind:    literalKind,
            }, newCursor, true
        }
    }

    return nil, initialCursor, false
}

And that's it for parsing a SELECT statement!

以上就是解析一條 SELECT 語句的內容!

--- 待續

更多原創文章乾貨分享,請關注公眾號
  • 【譯】資料庫基礎:用 Go 從零開始寫一個 SQL 資料庫 —— 第一部分
  • 加微信實戰群請加微信(註明:實戰群):gocnio

相關文章