From: Classification of heterogeneous text data for robust domain-specific language modeling
APD1+APD2 | APD1+APD2 | APD1+APD2 | APD1+APD2 | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Text | +PAR 340 h | +PAR 340 h | +PAR 340 h | +PAR 340 h | ||||||
PPL | classification | sp. adapt.: female | sp. adapt.: male | sp. adapt.: female | sp. adapt.: male | |||||
eval. set: gender-bal. | eval. set: gender-bal. | eval. set: female sp. | eval. set: male sp. | |||||||
Weighting | Similarity | Acc % | Corr % | Acc % | Corr % | Acc % | Corr % | Acc % | Corr % | |
40.4302 | Reference language model | 90.15 | 91.68 | 92.72 | 93.80 | 95.72 | 96.48 | 94.10 | 94.87 | |
36.0428 | tf-idf | Bhattacharyya | 91.23 | 92.50 | 93.23 | 94.18 | 95.97 | 96.68 | 94.34 | 95.06 |
35.9444 | Jaccard index | 91.26 | 92.55 | 93.24 | 94.22 | 95.98 | 96.68 | 94.73 | 95.11 | |
38.1756 | Jensen-Shannon | 90.71 | 92.10 | 92.92 | 93.94 | 95.81 | 96.54 | 94.23 | 94.94 | |
38.1289 | Okapi | Bhattacharyya | 90.95 | 92.23 | 93.03 | 94.01 | 95.88 | 96.59 | 94.25 | 94.96 |
39.9782 | Jaccard index | 90.59 | 91.99 | 92.82 | 93.84 | 95.81 | 96.53 | 94.17 | 94.90 | |
39.2267 | Jensen-Shannon | 90.93 | 92.27 | 93.00 | 93.97 | 95.94 | 96.65 | 94.17 | 94.89 | |
40.1325 | Ltu | Bhattacharyya | 90.19 | 91.70 | 92.72 | 93.78 | 95.73 | 96.49 | 94.10 | 94.85 |
40.1439 | Jaccard index | 90.18 | 91.70 | 92.73 | 93.78 | 95.76 | 96.51 | 94.11 | 94.86 | |
40.1319 | Jensen-Shannon | 90.18 | 91.70 | 92.72 | 93.78 | 95.73 | 96.49 | 94.10 | 94.85 |