From: Classification-based spoken text selection for LVCSR language modeling
Corpus | Number of utterance | Number of token | Vocabulary size |
---|---|---|---|
HIT-BTEC | 159,718 | 1,745,680 | 20,307 |
BEST | 410,648 | 7,818,410 | 110,334 |
Web-blog | 1,380,932 | 58,698,866 | 449,743 |
LOTUS-BN (TR) | 50,187 | 929,810 | 35,327 |
ALL | 2,001,485 | 69,192,766 | 615,711 |