TY - THES
T1 - Cross-layer models for conversational speech
AU - Schuppler, Barbara
PY - 2024
Y1 - 2024
N2 - In the last decade, conversational speech has received a lot of attention among speech scientists. On theone hand, accurate automatic speech recognition (ASR) systems are essential for conversational dialoguesystems, as these become more interactional and social rather than solely transactional. On the otherhand, linguists study natural conversations, as they reveal additional insights to controlled experimentsabout how speech processing works. Several studies have indicated that the perceptual system accessesmeaning from speech by using the most salient sensory information from any combination of levels/layersof formal linguistic analysis, in line with the theoretical model Polysp by Hawkins and Smith (2001). Thismodel reminds of the cross-layered optimization principle in wireless communications. It was introducedas an alternative to the Open Systems Interconnection (OSI) model, where one layer provides servicesonly to its upper layer while exclusively receiving services from the layer below. I use the term cross-layerto refer to this view of how humans access meaning and to the system architecture of envisioned ASRsystems.This thesis summarizes works that I conducted with my team in the last decade aiming at increasingour understanding of pronunciation variation and prosodic variation in spontaneous conversations andusing gained knowledge to improve models for conversational speech processing, specifically for ASR. Wecreated speech resources of conversational Austrian German, suitable for investigations in speech scienceand technology and investigated which aspects of variation in Austrian German are variety-specific andwhich are typical for conversational speech in general. For our experiments, we carefully balance betweendata-driven and knowledge-based approaches, because in addition to improving ASR, we also aim atincreasing our knowledge on human speech processing, and because we deal with a language varietyof limited speech resources. Our detailed analyses of ASR errors show that in conversational speech,transformer-based ASR systems outperform HMM-based systems on average, but not for the frequentlyoccurring short utterances and utterance-fragments, where a classical HMM-based system can profit froma pronunciation lexicon. We thus see a great potential that for speech produced in natural interaction,ASR will profit from a hybrid model constructed from a data-driven/transformer-based and a theory-drivencomponent, including linguistic knowledge.On a wider scale, the methods presented in this thesis are of interest to speech scientists and tech-nologist from various different domains who deal with scenarios of 1) low resources, 2) high degree ofvariation and 3) a language not resourced with of-the-shelf models. These methods are currently alreadyapplied to phonetic analyses of pathological voices (MedUni Vienna), prosody-informed syntactic modelsfor dialogue (University of M ¨unster), dementia prediction (PMU Salzburg) and ASR of Hungarian conver-sational speech (HUN-REN Budapest). Besides this directly observable impact, the works presented hereintroduce a new perspective: Whereas the high degree of variation in conversational speech has primarilybeen seen as a “nasty” problem by ASR engineers, I view it as an additional source of information, as anadditional cue to meaning and communicative function. This change in perspective will continue to guidemy research plans.
AB - In the last decade, conversational speech has received a lot of attention among speech scientists. On theone hand, accurate automatic speech recognition (ASR) systems are essential for conversational dialoguesystems, as these become more interactional and social rather than solely transactional. On the otherhand, linguists study natural conversations, as they reveal additional insights to controlled experimentsabout how speech processing works. Several studies have indicated that the perceptual system accessesmeaning from speech by using the most salient sensory information from any combination of levels/layersof formal linguistic analysis, in line with the theoretical model Polysp by Hawkins and Smith (2001). Thismodel reminds of the cross-layered optimization principle in wireless communications. It was introducedas an alternative to the Open Systems Interconnection (OSI) model, where one layer provides servicesonly to its upper layer while exclusively receiving services from the layer below. I use the term cross-layerto refer to this view of how humans access meaning and to the system architecture of envisioned ASRsystems.This thesis summarizes works that I conducted with my team in the last decade aiming at increasingour understanding of pronunciation variation and prosodic variation in spontaneous conversations andusing gained knowledge to improve models for conversational speech processing, specifically for ASR. Wecreated speech resources of conversational Austrian German, suitable for investigations in speech scienceand technology and investigated which aspects of variation in Austrian German are variety-specific andwhich are typical for conversational speech in general. For our experiments, we carefully balance betweendata-driven and knowledge-based approaches, because in addition to improving ASR, we also aim atincreasing our knowledge on human speech processing, and because we deal with a language varietyof limited speech resources. Our detailed analyses of ASR errors show that in conversational speech,transformer-based ASR systems outperform HMM-based systems on average, but not for the frequentlyoccurring short utterances and utterance-fragments, where a classical HMM-based system can profit froma pronunciation lexicon. We thus see a great potential that for speech produced in natural interaction,ASR will profit from a hybrid model constructed from a data-driven/transformer-based and a theory-drivencomponent, including linguistic knowledge.On a wider scale, the methods presented in this thesis are of interest to speech scientists and tech-nologist from various different domains who deal with scenarios of 1) low resources, 2) high degree ofvariation and 3) a language not resourced with of-the-shelf models. These methods are currently alreadyapplied to phonetic analyses of pathological voices (MedUni Vienna), prosody-informed syntactic modelsfor dialogue (University of M ¨unster), dementia prediction (PMU Salzburg) and ASR of Hungarian conver-sational speech (HUN-REN Budapest). Besides this directly observable impact, the works presented hereintroduce a new perspective: Whereas the high degree of variation in conversational speech has primarilybeen seen as a “nasty” problem by ASR engineers, I view it as an additional source of information, as anadditional cue to meaning and communicative function. This change in perspective will continue to guidemy research plans.
U2 - 10.3217/ktsd2-w5919
DO - 10.3217/ktsd2-w5919
M3 - Habilitation
PB - TU Graz Repository, Library & Archives
ER -