From: Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
MIL model
% speech detected at 1% FAR
Bag-level F1 score
Max pooling
93.7
0.76
Average pooling
88.0
0.74
Attention (softmax)
12.7
0.78
Attention (sigmoid)
68.5
Hybrid
90.1
0.73