Skip to main content

Table 9 Lexicon Sizes of different tokenization algorithms

From: Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Segmentation

Lexicon size

Word

79,947

Morfessor

10,545

BPE

9986

Unigram

19,564

Syllable

6279

S-BPE

15,926