EURASIP Journal on Audio, Speech, and Music Processing

Table 12 Comparing the WER and model size of each subword tokenization method, at n-gram \(=\) 3. The relative reduction with respect to the baseline word model is also shown in percentage

From: Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Segmentation	WER (%)		Model size (MB)
Word (baseline)	27.4		123
Morfessor	11.7	\(\downarrow 57\%\)	104	\(\downarrow 15\%\)
BPE	13.7	\(\downarrow 50\%\)	90	\(\downarrow 26\%\)
Unigram	12.6	\(\downarrow 54\%\)	108	\(\downarrow 12\%\)
Syllable	14.7	\(\downarrow 46\%\)	94	\(\downarrow 23\%\)
S-BPE	11.4	\(\downarrow 58\%\)	110	\(\downarrow 11\%\)

Back to article page