From: Classification of heterogeneous text data for robust domain-specific language modeling
Text corpus | Tokens | Sentences | Documents |
---|---|---|---|
Slovak web corpus | 748,854,697 | 50,694,708 | 2,803,412 |
Corpus of newspapers | 554,593,113 | 36,326,920 | 2,022,483 |
Corpus of legal texts | 565,140,401 | 18,524,094 | 1,503,271 |
Corpus of fiction texts | 101,234,475 | 8,039,739 | 367,956 |
Corpus of contemporary blogs | 55,711,674 | 4,071,165 | 211,533 |
Development data set | 55,163,941 | 1,782,333 | 165,577 |
Speech annotations | 4,434,217 | 485,800 | 5,520 |
Total | 2,085,132,518 | 119,924,759 | 7,079,752 |