厉害了!看嘴型竟然就能识别发音

你可以通过嘴唇的形状来判断人们在说什么,这就是嘴唇识别。

[En]

You can tell what people are saying by the shape of the mouth, which is lip recognition.

唇语识别并非最近才出现的技术,早在2003年,Intel就开发了唇语识别软件AVSR。2016 年,Google DeepMind的唇语识别技术就已经可以支持17500个词,新闻测试集识别准确率达到了50%以上。

唇语识别就是让AI”光看嘴型就知道你在说什么”

嘴唇识别技术的原理是利用机器视觉技术从图像中识别人脸,判断说话的人,并提取此人口型的连续变化特征。

[En]

The principle of lip recognition technology is to use machine vision technology to recognize the face from the image, judge the person who is speaking, and extract the continuous mouth shape change features of this person.

然后,将不断变化的口型特征输入到嘴唇识别模型中,识别出对应的发音。最后,根据识别的发音计算最可能的自然语言句子。

[En]

Then, the continuously changing mouth shape features are input into the lip recognition model to identify the corresponding pronunciation. Finally, the most likely natural language sentence is calculated according to the identified pronunciation.

去年,国内知名AI企业搜狗与清华天工研究院合作,在语音和唇语的多模态识别方面取得了重大成果,相关论文《基于模态注意力的端到端音视觉语音识别》已经发表在去年的学术会议ICASSP上。

文章指出,单纯依靠语音的识别方法存在一个缺陷,即在噪声环境中不能保持较高的识别精度。

[En]

The paper mentions that the recognition method which relies solely on speech has a defect, that is, it can not maintain a high recognition accuracy in a noisy environment.

视觉识别方法不受周围声音的影响,当人们听不清对方说话时,人们自然会盯着说话人的嘴巴,听力障碍者通过嘴唇进行交流。

[En]

The visual recognition method is not affected by the ambient sound, when people can not hear each other clearly, people will naturally stare at the mouth of the speaker, and people with hearing impairment communicate through lips.

搜狗研究人员想到,如果让AI也能把这两种方法结合起来,即所谓的”多模态”识别,就能提高语音识别的准确率。

在非特定人开放口语测试集上,搜狗唇语识别系统已经达到60%以上的准确率,超过Google发布的英文唇语系统50%以上的准确率。在垂直场景如车载、智能家居等场景下,搜狗唇语识别系统甚至已经达到90%的准确率。

搜狗在第四届世界互联网大会上推出的嘴唇识别系统

[En]

Lip recognition system presented by Sogou at the 4th World Internet Congress

作为人机交互的一种形式,未来的唇语识别技术可以辅助语音交互和图像识别,可以广泛应用于日常生活、安防、公益等领域。

[En]

As one of the forms of human-computer interaction, lip recognition technology in the future can assist voice interaction and image recognition, and can be widely used in daily life, security, public welfare and other fields.

搜狗相关负责人在2017年互联网大会上明确提出,希望唇语识别技术能够帮助听障人士”翻译”正常人语言,通过唇读技术把语音转换成文字,帮助他们更好地了解世界。

在车辆场景中,当环境噪声过大时,会干扰语音指令,唇语识别技术可以避免干扰,保证人车交互的准确性和稳定性。

[En]

In the vehicle scene, when the ambient noise is too large, it will interfere with the voice instructions, and the lip recognition technology can avoid the interference and ensure the accuracy and stability of human-vehicle interaction.

在安防领域,由于大多数监控只有摄像头而没有麦克风,这给案件分析带来了很多问题,唇形识别技术可以帮助公安人员获取重要的语音信息,为公共安全提供有效支撑。

[En]

In the field of security, because most surveillance only have cameras but no microphones, which brings a lot of problems to case analysis, lip recognition technology can help public security personnel to obtain important speech information and provide effective support for public safety.

可以预期,随着嘴唇识别技术的加入,公安人员可以通过该平台锁定视频中犯罪嫌疑人的语言记录,这将大大有助于犯罪侦查的发展。

[En]

It can be expected that with the addition of lip recognition technology, public security personnel can lock the language records of criminal suspects in the video through the platform, which will greatly contribute to the development of crime investigation.

在道路、会议室、火车站等嘈杂场景中,唇语识别有助于避免音频噪声对用户获取语音内容的影响,确保视频或语音交流顺畅。

[En]

In noisy scenes such as roads, conference rooms and railway stations, lip recognition can help avoid the impact of audio noise on users’ access to speech content and ensure smooth video or voice communication.

虽然嘴唇识别已经得到了广泛的应用,但它的发展仍然很困难。

[En]

Although lip recognition is widely used, it is still difficult to develop.

由于嘴唇识别是一种基于机器视觉和自然语言处理的技术,所以它比语音识别困难得多。

[En]

Because lip recognition is a technology based on machine vision and natural language processing, it is much more difficult than speech recognition.

一般来说,唇语识别系统使用复杂的端到端深度神经网络技术来对语言唇语序列进行建模,并使用数千小时的真实唇语数据对其进行训练。

[En]

In general, lip recognition systems use complex end-to-end depth neural network technology to model language lip sequences and train them with thousands of hours of real lip data.

数据堂深耕于AI数据领域近十年,一直致力于为全球人工智能企业提供专业的数据服务,行业内高标准的《156小时唇形同步多模态视频数据》和《1,998人唇语视频数据》广受重视和好评,能够助力唇语识别技术落地更多应用场景。

156小时唇形同步多模态视频数据

这是一套由250人参与录制语音以及相匹配的唇语视频。工作人员使用多设备同步录制,录制内容包括普通话的短指令和口语句子,通过脉冲信号进行精准对齐,句准确率不低于95%。数据可用于唇语识别、语音图像领域的多模态学习算法研究。

1,998人唇语视频数据

数据由1998人参与录制,数据集中包含41,866段视频,总时长为86小时56分钟1.52秒。数据多样性涵盖多种场景、多个年龄段、多个时间段。

在每段视频中,收集者阅读8位阿拉伯数字。标签器标注视频的拍摄时间和阅读内容,准确率不低于95%。该数据可用于嘴唇识别任务场景。

[En]

In each video, the collector reads 8 digits of Arabic numerals. The tagger marks the shooting time and reading content of the video with an accuracy of no less than 95%. This data can be used for lip recognition task scenarios.

业内人士预计,鉴于嘴唇识别技术在公共安全、身份识别、残疾人教育、军事等领域的竞争力,它可能会打开万亿级大数据市场。

[En]

Industry insiders predict that lip recognition technology may open up the trillion-level big data market in view of its competitiveness in public safety, identification, disability education, military and other fields.

但鉴于语言环境的复杂性,唇语识别要投入实战还需要一段时间,大数据、视觉分析、人工智能技术等领域的融合研究还有待进一步加强。

[En]

However, in view of the complexity of the language environment, it will take some time for lip recognition to be put into actual combat, and the fusion research in the fields of big data, visual analysis and artificial intelligence technology needs to be further strengthened.

Original: https://blog.csdn.net/weixin_44532659/article/details/119674318
Author: 数据堂官方账号
Title: 厉害了!看嘴型竟然就能识别发音

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/515316/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球