Tech

Researcher develops ‘SpeechSSM,’ opening up possibilities for a 24-hour AI voice assistant

LovabledanielsJuly 4, 202556 Views

Researcher develops 'SpeechSSM,' opening up possibilities for a 24-hour AI voice assistant — Windowing strategy for (a) tokenizing and (b) decoding long-form speech to enable extrapolation in decoding lengths. Credit: *arXiv* (2024). DOI: 10.48550/arxiv.2412.18603

Recently, spoken language models (SLMs) have been highlighted as next-generation technology that surpasses the limitations of text-based language models by learning human speech without text to understand and generate linguistic and non-linguistic information.

However, existing models show significant limitations in generating long-duration content required for podcasts, audiobooks, and voice assistants.

Ph.D. candidate, Sejin Park, from Professor Yong Man Ro’s research team at the Korea Advanced Institute of Science and Technology’s (KAIST) School of Electrical Engineering, has succeeded in overcoming these limitations by developing “SpeechSSM,” which enables consistent and natural speech generation without time constraints.

The work has been published on the arXiv preprint server and is set to be presented as at ICML (International Conference on Machine Learning) 2025.

A major advantage of SLMs is their ability to directly process speech without intermediate text conversion, leveraging the unique acoustic characteristics of human speakers, allowing for the rapid generation of high-quality speech even in large-scale models.

However, existing models faced difficulties in maintaining semantic and speaker consistency for long-duration speech due to increased “speech token resolution” and memory consumption when capturing very detailed information by breaking down speech into fine fragments.

SpeechSSM employs a “hybrid structure” that alternately places “attention layers” focusing on recent information and “recurrent layers” that remember the overall narrative flow (long-term context). This allows the story to flow smoothly without losing coherence even when generating speech for a long time.

Furthermore, memory usage and computational load do not increase sharply with input length, enabling stable and efficient learning and the generation of long-duration speech.

SpeechSSM effectively processes unbounded speech sequences by dividing speech data into short, fixed units (windows), processing each unit independently, and then combining them to create long speech.

Additionally, in the speech generation phase, it uses a “Non-Autoregressive” audio synthesis model (SoundStorm), which rapidly generates multiple parts at once instead of slowly creating one character or one word at a time, enabling the fast generation of high-quality speech.

While existing models typically evaluated short speech models of about 10 seconds, Se Jin Park created new evaluation tasks for speech generation based on their self-built benchmark dataset, “LibriSpeech-Long,” capable of generating up to 16 minutes of speech.

Compared to PPL (Perplexity), an existing speech model evaluation metric that only indicates grammatical correctness, she proposed new evaluation metrics such as “SC-L (semantic coherence over time)” to assess content coherence over time, and “N-MOS-T (naturalness mean opinion score over time)” to evaluate naturalness over time, enabling more effective and precise evaluation.

Through these new evaluations, it was confirmed that speech generated by the SpeechSSM spoken language model consistently featured specific individuals mentioned in the initial prompt, and new characters and events unfolded naturally and contextually consistently, despite long-duration generation.

This contrasts sharply with existing models, which tended to easily lose their topic and exhibit repetition during long-duration generation.

Sejin Park explained, “Existing spoken language models had limitations in long-duration generation, so our goal was to develop a spoken language model capable of generating long-duration speech for actual human use.”

She added, “This research achievement is expected to greatly contribute to various types of voice content creation and voice AI fields like voice assistants, by maintaining consistent content in long contexts and responding more efficiently and quickly in real time than existing methods.”

This research, with Se Jin Park as the first author, was conducted in collaboration with Google DeepMind.

More information:
Se Jin Park et al, Long-Form Speech Generation with Spoken Language Models, arXiv (2024). DOI: 10.48550/arxiv.2412.18603

Accompanying demo: SpeechSSM Publications.

Journal information:
arXiv

Provided by
The Korea Advanced Institute of Science and Technology (KAIST)

Citation:
Researcher develops ‘SpeechSSM,’ opening up possibilities for a 24-hour AI voice assistant (2025, July 4)
retrieved 4 July 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Previous post IAEA inspectors depart Tehran after US-Israel-Iran conflict | Nuclear Weapons News

Next post One tick and ‘anti-Semitic’ fruit: The curse of being Palestinian | Israel-Palestine conflict

Self-funded investigation into microSD cards finds 51 failed under load, while many performance and capacity claims proved to be fake

MicroSD card survey tested 200 models to uncover fakes, performance gaps, and...

Lovabledaniels

Tech

Pilot program integrates AI-generated notes with human community notes on X platform

An expansion of the Community Notes pipeline from “all-human” to a hybrid...

Lovabledaniels

Tech

Robotic probe quickly measures semiconductor properties to accelerate solar panel development

Credit: Pixabay/CC0 Public Domain Scientists are striving to discover new semiconductor materials...

Lovabledaniels

Tech

Hunting for early Prime Day deals? Beware, scammers have set up thousands of fake Amazon sites – here’s what to look out for

NordVPN’s researchers saw more than 100,000 malicious websites with an Amazon theme...

Lovabledaniels

Weekly update

Playing games with robots makes people see them as more humanlike

Big Zulu: “I Don’t Live in Kwesta’s House — I Bought the House Cash”

Spider-Man: Brand New Day will start filming in August, and I’m desperate to have some of my biggest questions answered about the Marvel movie

Weekly Newsletter

Researcher develops ‘SpeechSSM,’ opening up possibilities for a 24-hour AI voice assistant

Leave a comment

Leave a Reply Cancel reply

Explore more

Big Zulu: “I Don’t Live in Kwesta’s House — I Bought the House Cash”

Spider-Man: Brand New Day will start filming in August, and I’m desperate to have some of my biggest questions answered about the Marvel movie

Alibaba and Wix join forces, promise great things for SMBs

Elon Musk revives third party idea after ‘One Big Beautiful Bill’ passes | Elon Musk News

Self-funded investigation into microSD cards finds 51 failed under load, while many performance and capacity claims proved to be fake

Pilot program integrates AI-generated notes with human community notes on X platform

Robotic probe quickly measures semiconductor properties to accelerate solar panel development

Hunting for early Prime Day deals? Beware, scammers have set up thousands of fake Amazon sites – here’s what to look out for

Get to Know Us

Let's keep in touch