From: Classification-based spoken text selection for LVCSR language modeling
Corpus | Text style | Number | Number of | Vocabulary |
---|---|---|---|---|
of utterances | word tokens | size | ||
LOTUS | Written | 4887 | 90,336 | 5112 |
LOTUS-CELL | Spoken | 55,457 | 284,498 | 9595 |
LOTUS-SOC | Spoken/Written | 78,264 | 1,601,230 | 13,739 |
VoiceTra4U-M | Spoken | 9899 | 30,876 | 2141 |