EURASIP Journal on Audio, Speech, and Music Processing

Table 10 Sentence length statistics in terms of the number of tokens per sentence

From: Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Tokenization	Minimum	Maximum	Mean
Word	5	14	6.4
Morfessor	6	29	11.7
BPE	5	26	8.5
Unigram	5	29	10.1
Syllable	8	49	19.9
S-BPE	5	25	8.1

Back to article page