Skip to main content
Fig. 7 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 7

From: Segment boundary detection directed attention for online end-to-end speech recognition

Fig. 7

Speech spectrogram and BPE-based/character-based segment boundary detection output for utterance 4k0c030j from “dev93” set. Top: BPE-based segment boundary detection output. Middle: Speech spectrogram with word segments (forced alignment generated by a HMM-GMM model) represented by dash lines and reference transcription denoted in each segment. Bottom: Character-based segment boundary detection output. Both the detection outputs are generated with recognized hypothesis as decoder input. The blue lines are the boundary probabilities of each input memory item. The green dashed lines indicate the threshold to emit output symbols which is set to 0.3 for BPE and 0.5 for character. The dark dashed lines are detected boundaries of BPE/character outputs based on threshold decision. And the red lines are the detected boundaries of last piece/character of a word, which can be considered as word boundaries. The frame rate of boundary sequence is 1/4 of original input speech because of 1/4 downsampling rate in our model. And recognized hypotheses are also denoted in each detected segment with recognition errors marked in red

Back to article page