Tokenizer: BPE, WordPiece, and SentencePiece

ForHHeart發表於2024-05-15

1 Word-based Tokenizer

2 Character-based Tokenizer

3 Subword-based Tokenizer

3.1 Byte-Pair Encoding(BPE)

Byte-Level BPE

3.2 WordPiece

3.3 Unigram

3.4 SentencePiece