Skip to main content

Table 4 The number of documents after text classification

From: Classification of heterogeneous text data for robust domain-specific language modeling

Similarity/weighting

tf-idf

Okapi

Ltu

In-domain data set

   

Bhattacharyya coefficient

1,166,806

607,004

698,061

Jaccard correlation index

1,258,169

537,729

699,033

Jensen-Shannon divergence

2,305,230

956,243

698,062

Out-of-domain data set

   

Bhattacharyya coefficient

5,741,849

6,301,651

6,210,594

Jaccard correlation index

5,650,486

6,370,926

6,209,622

Jensen-Shannon divergence

4,603,425

5,952,412

6,210,593