Skip to main content

Table 2 Statistics on the text corpora

From: Classification of heterogeneous text data for robust domain-specific language modeling

Text corpus

Tokens

Sentences

Documents

Slovak web corpus

748,854,697

50,694,708

2,803,412

Corpus of newspapers

554,593,113

36,326,920

2,022,483

Corpus of legal texts

565,140,401

18,524,094

1,503,271

Corpus of fiction texts

101,234,475

8,039,739

367,956

Corpus of contemporary blogs

55,711,674

4,071,165

211,533

Development data set

55,163,941

1,782,333

165,577

Speech annotations

4,434,217

485,800

5,520

Total

2,085,132,518

119,924,759

7,079,752