Seed-ASR

Seed-ASR:

Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

[Paper]

Seed Team

ByteDance

Abstract. Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.

Contents

System Overview
Model & Evaluation

System Overview

Figure 1.The model framework used in Seed-ASR. When contexts are provided, the instruction is "There are relevant contexts, transcribe the speech into text:". Otherwise, the instruction is "Transcribe the speech into text:".

Model & Evaluation

Context-aware Ability

Context-Aware Type	Demo	Explanation
The content of dialogue history		In the first round of conversation, "庞葱 (Pang Cong)" is mistakenly recognized as the homophone "庞冲 (Pang Chong)" without contextual knowledge. When further discussing the characters in "The Romance of the Three Kingdoms" with Doubao, "庞冲 (Pang Chong)" is recognized correctly in the second attempt, as the conversation history mentioning this name is added to the recognition prompt as context.
The name of agent		For the conversational agent "枫丹 (Feng Dan)", its nickname "枫丹 (Feng Dan)" will be added to the recognition prompt as context to improve the accuracy of recognizing the name of the conversational agent. However, without relevant background knowledge provided in the prompt, the name of the agent may be recognized as other semantically reasonable homophones.
The decription information of agent		When talking with the conversational agent "顾易 (Gu Yi)", the agent's description text will be added to the recognition prompt as context to improve the accuracy of recognizing descriptions related to the conversational agent.
The modification history record		In the first video, professional phrase in skiing such as the "立刃 (li ren)" may be recognized as the homophone "利刃 (li ren)". However, users will correct the wrong recognition results in the subtitles. These modifications, such as the transformation from "利刃" to "立刃" will be used as recognition prompts when recognizing the second video, so the same errors in the second video will be avoided.
The name of meeting attendees		When inviting attendees to a Lark meeting, all attendees' names will be used as context. When the same attendee's name appears again, the recognition result will be corrected.

Transcripts of Seed-ASR on Multi-domain Set

	Audio	Seed-ASR
Sample1		八月吃板栗生活更给力哦粒上皇冰魔栗最后一波福利来了门店三十五抖音下单只需要十七块九哦夏天怎么能少的了粒上皇的冰魔栗呢
Sample2		猪前肩肉去搭配啊特别的香你们一定要去试一下
Sample3		前辈们给他上一课
Sample4		是过年玩烟花还是元旦玩烟花
Sample5		香奶奶家做工高不高级姐妹们非常高级百分之九十五都是香奶奶家做工纯手工缝制的包边包条的设计防盗扣子五年十年不会掉水晶的双子标认证丹麦进口的小母貂金标认证带有咱们家专柜金标发货不剪标德国双股姐妹织法珠光面料不起球不起电不刮丝三防防的面料漂洋过海过来的琉璃
Sample6		i could have my gucci on
Sample7^*		rumah sakit telah mengikuti protokol untuk pengendalian infeksi termasuk memisahkan pasien dari orang lain untuk mencegah kemungkinan menulari mereka
Sample8		persib vs persija malam ini

Transcripts of Seed-ASR on Multi-dialect Set

Dialect	Reference	Seed-ASR
Wuu	侬如果欢喜我就快点帮我表白唻嗯人搿辈子末总归要尝试一下拨美女拒绝个味道个呀侬哪能还勿来帮我表白啦	侬如果欢喜我就快点帮我表白唻人搿辈子末总归要尝试一下拨美女拒绝个味道个呀侬哪能还勿来帮我表白啦
Cantonese	做兄弟在心中你 feel 唔到我讲一万句都系废做兄弟在心中你 feel 唔到嘅我讲一万个理由都系废你讲	做兄弟在心中你 feel 唔到我讲一万句都系废做兄弟在心中你 feel 唔到嘅我讲一万个理由都系废你讲
Sichuan	哦这样的一个鸡腿儿电子挡把哈操控很好的很多喜欢宝马的朋友都是喜欢它这个电子档把哈对不对然后还带有换挡膜片儿嗯多功能方向盘儿带换挡膜片儿啊什么定速的巡航啊对不对还有这边儿嵌入式的中控大屏啦智龙的空调啊双座椅加热啊对吧主驾驶座椅双座椅嗯双电动座椅调节而且主驾驶座椅还带有一个两档的记忆好不好	这样的一个鸡腿电子档把哈操控很好的很多喜欢宝马的朋友都是喜欢他那个电子档把哈对不对然后还带有换挡拨片啊多功能方向盘带换挡拨片啊什么定速的巡航啊对不对然后这边嵌入式的中控大屏啊自动的空调啊双座椅加热啊对吧主驾驶座椅双座椅呃双电动座椅调节而且主驾驶座椅还带有一个两档的记忆好不好
Jlua	大坝顶上有平道小路崎岖没必要横批光明正大	大坝顶上有平道小路崎岖没必要横批光明正大
Zgyu	去啊额开了一段然后额把高速开完导下去奏叫伢开回来走的低速额额么动车额说你给咱开上然后额奏在后头坐着兀面大车多很赫赫有名陶艺村赫赫有名的陶艺村额么去过么	去呀额开了一段然后额把高速开完倒下去奏叫伢开回来走滴低速额额么动车额说你给咱开上然后额奏在后头坐着兀岸大车多很赫赫有名滴大陶艺村赫赫有名滴陶艺村额么去过么
Xiang	恁以为我是不想开空调啵啧条件不允许嘞啊钱一回进咖嗒货唻进一屋的搿号长袖子啊外边嘞三四十度自家啊热得搿宝一样不歇气的抹汗	恁以为我是不想开空调啵条件不允许嘞啊钱一回进咖哒货嘞进一屋的搿号长袖子啊外边嘞三四十度自家热得搿宝一样不歇气的抹汗

Transcripts of Seed-ASR on Multi-accent Set

Accent	Audio	Seed-ASR
Yunnan Accent		坐班车随人带车问题解决了可到新螺蛳湾的南站上车
Gansu Accent		这个季节沿途还可以找到真正的虫草和松茸
Henan Accent		你有没有推荐给我干小活的瓦工师傅啊
Jiangxi Accent		但是如果不太较真秋水广场赣江市民公园都可以
Anhui Accent		总的来讲我是两边不讨好问你啊是我是不是云黑啊
Hunan Accent		坚持放管结合放管并举的原则积极推行权力清单制度

Transcripts of Seed-ASR on Hardcase Set

	Audio	Seed-ASR
Sample1		like i had the em in dark industry
Sample2		definitely is to drive the em solution selling
Sample3^*		gereja gereja tradisional sering mengadakan malam paskah pada sabtu malam selama akhir pekan paskah jemaat gereja semacam ini sering spontan merayakan kebangkitan kristus saat tengah malam
Sample4		como pegar o porco no jurassic parks

Transcripts of Seed-ASR on Speech with Background Noise

	Audio	Seed-ASR
Sample1		我就下播了然后就想看咱们家房子中午十一点啊来到咱们直播间有北戴河女孩分享给你们大家看看咱们家民宿啊然后让你们多方面的去了解看看民宿啊看看楼下停车位看完停车位呢带你们去海边去转圈明天有雨吗明天是晴天明天是晴
Sample2		小小的花园里面挖啊挖呀挖种小小的种子开小小的花在大一点的花园里面挖啊挖呀挖种大一点的种子开大一点的花在大大的花园里面挖啊挖呀挖大大的种子开大大的花
Sample3		and i can't film my family cause that would be a little weird let us see
Sample4		alright alright force first tortilla shell or maybe i can do this better so then you can see better

^*Sample from "Conneau, Alexis, et al. "Fleurs: Few-shot learning evaluation of universal representations of speech." 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023.".