Skip to main content
Fig. 1 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 1

From: Segment boundary detection directed attention for online end-to-end speech recognition

Fig. 1

Diagram of the online encoder-decoder model with segment boundary detection directed attention. The input and output sequence are (x1,x2,…,xT) and (y1,y2,…,yN), respectively. \(\tilde {b}_{t}\) denotes the random variable of binary segment boundary decision, αi is attention weight vector, ci is the attended context vector—weighted summation of the input hidden states, and ht, \(S_{t}^{a}\), si, \(\tilde {s}_{i}\) are the hidden states of the recurrent networks. y0 is the StartOfSequence symbol and s0 is the initial decoder state. The encoder processes the input stream frame-by-frame, and boundary decisions are made for each input. A shaded \(\tilde {b}_{t}\) indicates that current input is not a segment boundary for the model and the decoder stays idle while an unshaded \(\tilde {b}_{t}\) mean that current input is a segment boundary and the decoder should produce an output by attending the corresponding segment with soft attention. For instance, the decisions for the first three inputs are \(\tilde {\mathbf {b}}_{1:3}=(0,0,1)\) which means input x3 is a detected segment boundary. Then, the decoder predicts y1 by attending the hidden states of the segment (h1,h2,h3). After that, a new input x4 is received and this procedure continues until the end of the input or output sequence is reached

Back to article page