Skip to main content

Advertisement

Table 2 Details of the orthographic language corpus content

From: Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

No. Component type No. of unique No. of components
   components in the corpus
1 single words 1,943,462 230,301,313
2 2-word sequences 75,395,184 246,110,034
3 3-word sequences 170,180,746 246,066,692
4 4-word sequences 217,586,930 246,023,356
5 5-word sequences 232,439,967 245,980,021