Tech

Language database improves automatic speech recognition of Austrian German

Share
Share
speech
Credit: Pixabay/CC0 Public Domain

Second-language speakers who come to Austria with a good knowledge of German usually find it difficult to understand the local dialects. Similarly, speech recognition systems often fail to decode regionally accented word choice and pronunciation.

Barbara Schuppler from the Signal Processing and Speech Communication Laboratory at Graz University of Technology (TU Graz), together with researchers from the Know Center and the University of Graz, has investigated the complexity of conversational speech, built up a database of conversations in Austrian German and gained new knowledge about how to improve speech recognition.

The results were recently published in the paper “What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures” in the journal Computer Speech & Language.

Free-flowing conversations in the recording studio

One of the main aims of the project was to improve the accuracy of automatic speech recognition (ASR) systems in spontaneous conversations with speakers from Austria. The team focused on the challenges posed by spontaneity, short sentences, overlapping speakers and dialectal accent in everyday conversations.

In order to have a suitable database, the researchers set up the GRASS database (Graz corpus of read and spontaneous speech). It contains recordings of 38 speakers, which include both read texts and spontaneous conversations in which two people who knew each other well spoke freely for an hour in the recording studio without being given a topic.

Since the same speakers were recorded in both speaking styles, the research team was able to eliminate the influence of speaker identity and recording quality on ASR performance.

Based on the database, the team compared various ASR architectures, including the long-established HMM models (hidden Markov models) and the relatively new transformer-based models. This showed that transformer-based models, such as the Whisper speech recognition system, work very well for longer sentences with a lot of context, but have problems with short, fragmentary sentences that frequently occur in conversations.

Traditional HMM-based systems that were explicitly trained with pronunciation variations proved to be more robust for short sentences and dialectal language. The researchers therefore want to pursue a hybrid system approach that combines the strengths of both architectures. They have already combined a transformer model with a knowledge-based lexicon and a statistical language model, thereby achieving significant improvements.

Possible use in medical diagnostics

The team also analyzed how characteristics such as speech rate, intonation and word choice influence the accuracy of speech recognition. These findings can contribute to the development of ASR systems that better understand human speech in all its nuances.

The team plans to continue research in these areas and incorporate the findings into the development of new, more robust speech recognition systems. However, the results of the project also have interesting potential applications beyond this, particularly in the fields of medical diagnostics and human-computer interaction.

In the future, ASR systems could be used to recognize dementia or epilepsy based on speech patterns in spontaneous conversations or to make interaction with social robots more natural.

“Spontaneous speech, especially in dialogue, has completely different characteristics compared to a recited or read speech,” says Schuppler. “By analyzing human-human communication in particular, we have gained important findings in our project that also help us technically and open up new areas of application.

“Together with partners from the PMU Salzburg, Med Uni Graz and Med Uni Vienna, we are already working on follow-up projects to create socially relevant applications based on the foundations we have created in the Austrian Science Fund project.”

More information:
Julian Linke et al, What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures, Computer Speech & Language (2024). DOI: 10.1016/j.csl.2024.101738

Provided by
Graz University of Technology


Citation:
Language database improves automatic speech recognition of Austrian German (2024, December 12)
retrieved 12 December 2024
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
Comino teams up with Puget systems so you can buy silent dual-CPU, 8-GPU rack workstations for much cheaper
Tech

Comino teams up with Puget systems so you can buy silent dual-CPU, 8-GPU rack workstations for much cheaper

Comino Grando Server delivers extreme GPU performance for deep learning tasks High-speed...

Elecom’s 9,000mAh sodium-ion battery is costly and bulky but it will last longer and is much safer than lithium-ion ones
Tech

Elecom’s 9,000mAh sodium-ion battery is costly and bulky but it will last longer and is much safer than lithium-ion ones

Elecom’s 9,000mAh sodium-ion battery offers superior safety and longevity The first sodium-ion...

Basalt fabric-based cathode enhances solar-powered wastewater treatment
Tech

Basalt fabric-based cathode enhances solar-powered wastewater treatment

(A) Fabrication of the BF-CNT/AgNWs composite cathode material; (B) Experimental setup of...