Skip to main navigation Skip to search Skip to main content

What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures

Research output: Contribution to journalArticlepeer-review

Abstract

Highly performing speech recognition is important for more fluent human–machine interaction (e.g., dialogue systems). Modern ASR architectures achieve human-level recognition performance on read speech but still perform sub-par on conversational speech, which arguably is or, at least, will be instrumental for human–machine interaction. Understanding the factors behind this shortcoming of modern ASR systems may suggest directions for improving them. In this work, we compare the performances of HMM- vs. transformer-based ASR architectures on a corpus of Austrian German conversational speech. Specifically, we investigate how strongly utterance length, prosody, pronunciation, and utterance complexity as measured by perplexity affect different ASR architectures. Among other findings, we observe that single-word utterances – which are characteristic of conversational speech and constitute roughly 30% of the corpus – are recognized more accurately if their F0 contour is flat; for longer utterances, the effects of the F0 contour tend to be weaker. We further find that zero-shot systems require longer utterance lengths and are less robust to pronunciation variation, which indicates that pronunciation lexicons and fine-tuning on the respective corpus are essential ingredients for the successful recognition of conversational speech.
Original languageEnglish
Article number101738
Journal Computer Speech and Language
Volume90
Early online date22 Oct 2024
DOIs
Publication statusPublished - Mar 2025

Keywords

  • Automatic speech recognition; Conversational speech; Kaldi; Wav2vec2; Whisper; Robustness
  • Whisper
  • Wav2vec2
  • Kaldi
  • Robustness
  • Automatic speech recognition
  • Conversational speech

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures'. Together they form a unique fingerprint.

Cite this