Skip to main content

Advertisement

Table 2 Statistics on the text corpora

From: Classification of heterogeneous text data for robust domain-specific language modeling

Text corpus Tokens Sentences Documents
Slovak web corpus 748,854,697 50,694,708 2,803,412
Corpus of newspapers 554,593,113 36,326,920 2,022,483
Corpus of legal texts 565,140,401 18,524,094 1,503,271
Corpus of fiction texts 101,234,475 8,039,739 367,956
Corpus of contemporary blogs 55,711,674 4,071,165 211,533
Development data set 55,163,941 1,782,333 165,577
Speech annotations 4,434,217 485,800 5,520
Total 2,085,132,518 119,924,759 7,079,752