Skip to main content

Advertisement

Table 4 The number of documents after text classification

From: Classification of heterogeneous text data for robust domain-specific language modeling

Similarity/weighting tf-idf Okapi Ltu
In-domain data set    
Bhattacharyya coefficient 1,166,806 607,004 698,061
Jaccard correlation index 1,258,169 537,729 699,033
Jensen-Shannon divergence 2,305,230 956,243 698,062
Out-of-domain data set    
Bhattacharyya coefficient 5,741,849 6,301,651 6,210,594
Jaccard correlation index 5,650,486 6,370,926 6,209,622
Jensen-Shannon divergence 4,603,425 5,952,412 6,210,593