Skip to main content

Table 8 A sample section of the developed phonemic language corpus for Polish

From: Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

 

Number of occurr.

Orthographic word

Phonemic transcription

 

of 230301313

 

of word

i

C(w i )

w i

[SAMPA]

1

7,692,997

w

[f]

2

5,333,210

i

[i]

3

4,235,003

na

[n][a]

4

4,158,902

z

[s]

5

3,981,525

[s’][e]

6

3,601,719

nie

[n’][e]

7

2,904,114

do

[d][o]

8

2,205,896

Że

[Z][e]

9

2,171,877

to

[t][o]

10

1,731,304

o

[o]

11

1,728,527

jest

[j][e][s][t]

12

1,425,793

a

[a]

13

1,003,027

jak

[j][a][k]

14

983,395

po

[p][o]

15

912,660

od

[o][t]

16

877,522

ale

[a][l][e]

17

847,373

za

[z][a]

18

775,006

przez

[p][S][e][s]

19

754,024

co

[ts][o]

20

663,771

dla

[d][l][a]

21

645,573

czy

[tS][I]

22

610,035

tym

[t][I][m]

23

607,673

juÅ»

[j][u][S]

24

544,343

tak

[t][a][k]

25

534,509

tylko

[t][I][l][k][o]

26

500,801

ma

[m][a]

27

475,172

moŻe

[m][o][Z][e]

28

451,225

tego

[t][e][g][o]

29

445,705

ze

[z][e]

30

426,201

jego

[j][e][g][o]

⋯

⋯

⋯