Skip to main content

Table 10 Sentence length statistics in terms of the number of tokens per sentence

From: Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Tokenization

Minimum

Maximum

Mean

Word

5

14

6.4

Morfessor

6

29

11.7

BPE

5

26

8.5

Unigram

5

29

10.1

Syllable

8

49

19.9

S-BPE

5

25

8.1