Seed-ASR:

Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

[Paper]


Seed Team

ByteDance

Abstract. Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.

Contents

System Overview

Figure 1.The model framework used in Seed-ASR. When contexts are provided, the instruction is "There are relevant contexts, transcribe the speech into text:". Otherwise, the instruction is "Transcribe the speech into text:".

Model & Evaluation

Context-aware Ability

Context-Aware Type Demo Explanation
The content of dialogue history In the first round of conversation,
"庞葱 (Pang Cong)" is mistakenly recognized
as the homophone "庞冲 (Pang Chong)"
without contextual knowledge.
When further discussing the characters
in "The Romance of the Three Kingdoms"
with Doubao, "庞冲 (Pang Chong)" is
recognized correctly in the second attempt,
as the conversation history mentioning
this name is added to the recognition
prompt as context.
The name of agent For the conversational agent "枫丹 (Feng Dan)",
its nickname "枫丹 (Feng Dan)" will be added to
the recognition prompt as context to improve
the accuracy of recognizing the name of
the conversational agent.
However, without relevant background
knowledge provided in the prompt,
the name of the agent may be
recognized as other semantically
reasonable homophones.
The decription information of agent When talking with the
conversational agent "顾易 (Gu Yi)",
the agent's description text will
be added to the recognition prompt
as context to improve the accuracy of
recognizing descriptions related to
the conversational agent.
The modification history record In the first video, professional phrase
in skiing such as the "立刃 (li ren)" may be
recognized as the homophone "利刃 (li ren)".
However, users will correct the wrong recognition
results in the subtitles. These modifications,
such as the transformation from "利刃" to "立刃"
will be used as recognition prompts when
recognizing the second video,
so the same errors in
the second video will be avoided.
The name of meeting attendees When inviting attendees to a Lark meeting,
all attendees' names will be used as context.
When the same attendee's name
appears again, the recognition result
will be corrected.

Transcripts of Seed-ASR on Multi-domain Set

Audio Seed-ASR
Sample1
Sample2
Sample3
Sample4
Sample5
Sample6
Sample7*
Sample8

Transcripts of Seed-ASR on Multi-dialect Set

Dialect Audio Reference Seed-ASR
Wuu
Cantonese
Sichuan
Jlua
Zgyu
Xiang

Transcripts of Seed-ASR on Multi-accent Set

Accent Audio Seed-ASR
Yunnan Accent
Gansu Accent
Henan Accent
Jiangxi Accent
Anhui Accent
Hunan Accent

Transcripts of Seed-ASR on Hardcase Set

Audio Seed-ASR
Sample1
Sample2
Sample3*
Sample4

Transcripts of Seed-ASR on Speech with Background Noise

Audio Seed-ASR
Sample1
Sample2
Sample3
Sample4

*Sample from "Conneau, Alexis, et al. "Fleurs: Few-shot learning evaluation of universal representations of speech." 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023.".