Skip to main content

Table 8 Examples for different tokenization algorithms. Space is used as delimiter between tokens. Number of tokens per sentence is also tabulated

From: Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Method

Example

Segment count

Word

3

Morfessor

6

BPE

6

Unigram

5

Syllable

9

S-BPE

4