Classification of heterogeneous text data for robust domain-specific language modeling

EURASIP Journal on Audio, Speech, and Music Processing

Table 2 Statistics on the text corpora

Text corpus	Tokens	Sentences	Documents
Slovak web corpus	748,854,697	50,694,708	2,803,412
Corpus of newspapers	554,593,113	36,326,920	2,022,483
Corpus of legal texts	565,140,401	18,524,094	1,503,271
Corpus of fiction texts	101,234,475	8,039,739	367,956
Corpus of contemporary blogs	55,711,674	4,071,165	211,533
Development data set	55,163,941	1,782,333	165,577
Speech annotations	4,434,217	485,800	5,520
Total	2,085,132,518	119,924,759	7,079,752