Skip to main content

Table 2 Details of the orthographic language corpus content

From: Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

No.

Component type

No. of unique

No. of components

  

components

in the corpus

1

single words

1,943,462

230,301,313

2

2-word sequences

75,395,184

246,110,034

3

3-word sequences

170,180,746

246,066,692

4

4-word sequences

217,586,930

246,023,356

5

5-word sequences

232,439,967

245,980,021