Skip to main content
Fig. 6 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 6

From: Segment boundary detection directed attention for online end-to-end speech recognition

Fig. 6

Speech spectrogram and segment boundary detection output (BPE) for utterance 4k0c0303 from “dev93” set. Top: Speech spectrogram with word segments (forced alignment generated by a HMM-GMM model) represented by dash lines and reference transcription denoted in each segment. Bottom: Segment boundary detection output generated with recognized hypothesis as decoder input. The blue line is the boundary probability of each input memory item ranging from 0 to 1 (most of the probabilities are around 0.1 to 0.4). The green dashed line indicates the threshold to emit output symbols which is set to 0.3 on WSJ. The dark dashed lines are detected boundaries of BPE outputs based on threshold decision. And the red lines are the detected boundaries of last piece of a word, which can be considered as word boundaries. The frame rate of boundary sequence is 1/4 of original input speech because of 1/4 downsampling rate in our model. And recognized hypothesis is also denoted in each detected segment with recognition errors marked in red

Back to article page