Skip to main content

Advertisement

Table 8 A sample section of the developed phonemic language corpus for Polish

From: Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

  Number of occurr. Orthographic word Phonemic transcription
  of 230301313   of word
i C(w i ) w i [SAMPA]
1 7,692,997 w [f]
2 5,333,210 i [i]
3 4,235,003 na [n][a]
4 4,158,902 z [s]
5 3,981,525 [s’][e]
6 3,601,719 nie [n’][e]
7 2,904,114 do [d][o]
8 2,205,896 Że [Z][e]
9 2,171,877 to [t][o]
10 1,731,304 o [o]
11 1,728,527 jest [j][e][s][t]
12 1,425,793 a [a]
13 1,003,027 jak [j][a][k]
14 983,395 po [p][o]
15 912,660 od [o][t]
16 877,522 ale [a][l][e]
17 847,373 za [z][a]
18 775,006 przez [p][S][e][s]
19 754,024 co [ts][o]
20 663,771 dla [d][l][a]
21 645,573 czy [tS][I]
22 610,035 tym [t][I][m]
23 607,673 juŻ [j][u][S]
24 544,343 tak [t][a][k]
25 534,509 tylko [t][I][l][k][o]
26 500,801 ma [m][a]
27 475,172 moŻe [m][o][Z][e]
28 451,225 tego [t][e][g][o]
29 445,705 ze [z][e]
30 426,201 jego [j][e][g][o]