EURASIP Journal on Audio, Speech, and Music Processing

Table 9 Lexicon Sizes of different tokenization algorithms

From: Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Segmentation	Lexicon size
Word	79,947
Morfessor	10,545
BPE	9986
Unigram	19,564
Syllable	6279
S-BPE	15,926

Back to article page