Skip to main content

Table 4 Text resources for language model training

From: Classification-based spoken text selection for LVCSR language modeling

Corpus

Number of utterance

Number of token

Vocabulary size

HIT-BTEC

159,718

1,745,680

20,307

BEST

410,648

7,818,410

110,334

Web-blog

1,380,932

58,698,866

449,743

LOTUS-BN (TR)

50,187

929,810

35,327

ALL

2,001,485

69,192,766

615,711