From: Classification of heterogeneous text data for robust domain-specific language modeling
Similarity/weighting | tf-idf | Okapi | Ltu |
---|---|---|---|
In-domain data set | |||
Bhattacharyya coefficient | 1,166,806 | 607,004 | 698,061 |
Jaccard correlation index | 1,258,169 | 537,729 | 699,033 |
Jensen-Shannon divergence | 2,305,230 | 956,243 | 698,062 |
Out-of-domain data set | |||
Bhattacharyya coefficient | 5,741,849 | 6,301,651 | 6,210,594 |
Jaccard correlation index | 5,650,486 | 6,370,926 | 6,209,622 |
Jensen-Shannon divergence | 4,603,425 | 5,952,412 | 6,210,593 |