ASR项目实战-后处理

这个环节要处理的重要特点是分词、断句、标点符号、大小写、数字格式规范化等。

[En]

The important features to be dealt with in this link are participle, sentence breakage, punctuation, case, number format normalization and so on.

和NLP、搜索等场景下的分词含义不同。对于拼音类的语言,比如英语、法语等,句子由多个单词组成,语音输出的结果,需要按需在各个单词之间补充或者去掉空格。对于中文来说,字和词之间不以空格作为边界,因此分词的意义不明显。

从狭义上讲,语音识别只处理语音到文本的转换。由于语音数据中缺乏一些明确的提示信息,导致语音识别结果中缺少断句信息。用户在使用语音识别结果时,缺少断文会降低阅读体验,增加理解难度,限制语音识别的使用。为了解决这一问题,有必要在语音识别的输出中加入必要的断句信息,如标点符号,将输出分为句子和段落。

[En]

In a narrow sense, speech recognition only deals with the conversion from speech to text. Due to the lack of some clear prompt information in the speech data, there is a lack of broken sentence information in the results of speech recognition. When users use the results of speech recognition, the lack of broken text will reduce the reading experience, increase the difficulty of comprehension, and limit the use of speech recognition. In order to solve this problem, it is necessary to add the necessary sentence breakage information to the output of speech recognition, such as punctuation marks to divide the output into sentences and paragraphs.

标点符号

合理的标点符号有助于用户正确理解语音识别的结果,改善语音识别产品的体验。

[En]

Reasonable punctuation will help users to properly understand the results of speech recognition and improve the experience of speech recognition products.

可以使用基于规则的方法或机器学习模型来向语音识别文本添加标点符号。

[En]

Rule-based methods or machine learning models can be used to add punctuation to speech recognition text.

考虑到人们在说话时,自然会在有标点符号的地方添加一些停顿,所以我们可以利用这一特征在语音识别的结果文本中添加标点符号,这是基于规则的方法的实现假设。

[En]

Considering that when people speak, they will naturally add some pauses where there are punctuation marks, so we can use this feature to add punctuation marks to the result text of speech recognition, which is the implementation hypothesis of the rule-based method.

本方法存在的问题:

  • 只能使用逗号和句点等简单标点符号。
    [En]

    only simple punctuation marks such as commas and periods can be implemented.*

  • 要求演讲者以相对稳定的速度说话,以免在一段话的过程中说得太快或太慢,这会导致这种方法失败。
    [En]

    the speaker is required to speak at a relatively stable speed so that he or she will not speak quickly or slowly in the process of a paragraph, which will lead to the failure of this method.*

在实施基于规则的方法时:

[En]

When implementing a rule-based approach:

  • 语音识别环节每个元素对应的时间偏移量是必填项,即开始时间和结束时间,元素包括文本和静音片段。
    [En]

    the time offset corresponding to each element of the speech recognition link is required, that is, the start time and the end time, and the elements include text and mute segments.*

  • 配置静音段的持续时间以区分逗号和句点。
    [En]

    configure the duration of the mute segment to distinguish between commas and periods.

    配置信息可以实现为系统级参数,这将要求所有语音数据遵循相同的特征,这显然不够灵活。此外,给出一组能够满足大多数应用场景的配置参数是一项复杂的任务。

    [En]

    Configuration information can be implemented as system-level parameters, which will require all voice data to follow the same characteristics, which is obviously not flexible enough. In addition, it is a complex task to give a set of configuration parameters that can meet most of the application scenarios.

    如果配置信息被实现为会话级参数,则相关信息由呼叫者提供给后处理系统,这可以提供一定的灵活性以适应不同说话者的速度特征。这就要求呼叫者事先知道语音数据中说话人的特征,否则会影响标点符号的准确性。

    [En]

    If the configuration information is implemented as a session-level parameter, the relevant information is provided by the caller to the post-processing system, which can provide some flexibility to adapt to the characteristics of the speed of different speakers. This requires the caller to know the characteristics of the speaker in the voice data in advance, otherwise it will affect the accuracy of punctuation.*

该方法同时对文本及其标点符号进行建模,训练专门的语言模型,并插入后处理过程,即语音识别环节输出的文本、断句、时间偏移等信息,作为标点符号推理过程的输入,共同辅助标点符号模块的工作。

[En]

The method models the text and its punctuation marks at the same time, trains a special language model, and plugs in the post-processing process, that is, the text, broken sentence, time offset and other information output by the speech recognition link, as the input of the punctuation inference process, together to assist the punctuation module work.

与基于规则的方法相比,该方法似乎具有相对的适应性,但实施难度更大,例如:

[En]

Compared with the rule-based method, this method seems to be relatively adaptable, but it is more difficult to implement, such as:

  • 机器学习模型建模、训练、训练数据收集和标注等传统困难。
    [En]

    traditional difficulties such as machine learning model modeling, training, training data collection and labeling.*

  • 如果语音识别的结果文本准确率不足,单词错误率过高,会导致标点符号模型在工作中无法获得准确的输入,从而影响输出质量。
    [En]

    if the accuracy of the result text of speech recognition is insufficient, the word error rate is too high, which will cause the punctuation model to be unable to get accurate input at work, thus affecting the output quality.*

  • 断句、说话人的特点、口语化表达,如重复词语、重复词语、词与词之间不稳定的沉默块等,会导致断句判断错误,从而影响标点模型的准确性。
    [En]

    broken sentences, the characteristics of the speaker, colloquial expressions, such as repeated words, repeated words, unstable silent blocks between words, etc., will lead to errors in the judgment of broken sentences, thus affecting the accuracy of the punctuation model.*

  • 时间偏移量,该功能与标点符号模块的集成。时间偏移信息作为标点符号模块的输入输出,标点符号模块将标点符号信息添加到输入的时间偏移量信息中。在融合过程中,要考虑对现有时差信息的有效利用。同时,在输出结果中,注意不要错误地修改文本的时间偏移量,造成明显的误差。
    [En]

    time offset, the integration of this feature and punctuation module. The time offset information is used as the input and output of the punctuation module, and the punctuation module will add the punctuation information to the input time offset information. In the process of integration, it is necessary to consider the effective use of the existing time offset information. At the same time, in the output results, pay attention to not mistakenly modify the time offset of the text, resulting in obvious errors.*

大小写

对于中文来说,语言本身缺乏大小写的特点,所以不需要考虑。然而,对于英语、法语等拼音语言来说,同一个词的意思、不同的大写形式可能会有明显的差异。这对于以中文为母语的人来说是很难理解的。

[En]

For Chinese, the language itself lacks the characteristics of uppercase and lowercase, so it does not need to be considered. However, for pinyin languages such as English, French, etc., the meaning of the same word, different capitalization forms may be significantly different. This is difficult for people who use Chinese as their mother tongue to understand.

作为语音识别产品,在处理英语、法语等具有大小写特征的语言时,为了提高可读性,需要根据语言本身的特点考虑相应的解决方案。

[En]

As a speech recognition product, when dealing with languages with uppercase and lowercase features such as English and French, in order to improve readability, we need to consider the corresponding solutions according to the characteristics of the language itself.

要解决这一问题,还有基于规则的方法和基于机器学习模型的方法。

[En]

To solve this problem, there are also rule-based methods and machine learning model-based methods.

对于英语、法语和其他语言,有一些简单的规则可以遵循。例如:

[En]

For English, French and other languages, there are some simple rules to follow. For example:

  • 对于出现在句子开头的单词,要求首字母大写,句子中的单词一般为小写。
    [En]

    for words that appear at the beginning of a sentence, the first letter is required to be capitalized, and the words in the sentence are generally lowercase.*

  • 习惯表达法中,个别单词使用大写,比如英语中的 I,在句中任意位置,均需要使用大写。
  • 人名和地名,每个组成部分的首字母都必须大写。
    [En]

    names of persons and place names, capitalization of the initials of each component is required.*

  • 缩写词,比如 NBAUSALGTMASAPKIA等,需要全字母大写。
  • 特定的表达习惯,比如 Presidentpresident
  • 等等。

然而,人类语言并不是一成不变的,对应的规则很难列举,特例太多,很难判断一个句子中适用了哪些规则和哪些规则。例如:

[En]

However, human language is not immutable, the corresponding rules are difficult to enumerate, and there are too many special cases, so it is difficult to judge which and which rules are applied in a sentence. For example:

  • 处理人名和地名的规则很少,而且数量很多
    [En]

    there are few rules for dealing with people and place names, and there are a large number of them.*

  • 缩略语与域和用户组有很强的关联性,有大量的缩略语。
    [En]

    acronyms are strongly related to domain and user groups, and there are a large number of acronyms.*

  • 表达习惯,比如 Presidentpresident,这两个单词拼写一致,发音一致,但出现在句子中,含义不同。

因此,考虑到上述特点,基于规则的方法在场景中的使用相对有限。

[En]

Therefore, considering the above characteristics, the rule-based method is relatively limited in the use of scenarios.

采用机器学习模型的方法对语音识别结果进行预测,并输出基本结果。同时,结合标点符号的输出结果,将两者结合起来输出最终结果。

[En]

The method of machine learning model is used to predict the results of speech recognition, and the basic results are output. at the same time, combined with the output results of punctuation marks, the two are integrated to output the final results.

对于机器学习部分,在收集数据、标注数据和设计模型时,添加每个单词的大小写特征,然后使用训练后的模型预测文本中每个单词的大小写。

[En]

For the part of machine learning, when collecting data, tagging data and designing a model, add the case characteristics of each word, and then use the trained model to predict the uppercase and lowercase of each word in the text.

在实施此功能时遇到困难,例如:

[En]

Difficulties in implementing this feature, such as:

  • 机器学习模型建模、训练、训练数据收集和标注等传统困难。
    [En]

    traditional difficulties such as machine learning model modeling, training, training data collection and labeling.*

  • 如何将时间偏移和标点符号的识别结果整合为大小写识别的输入,确保单词在句子边界和句子中间都能得到正确的结果。
    [En]

    how to integrate the recognition results of time offset and punctuation marks as input of uppercase and lowercase recognition to ensure that words can get correct results at the boundary of the sentence and in the middle of the sentence.*

数字格式归一

业界流行的方法是基于规则的文本处理,即将正则表达式和编码相结合,对语音识别结果中的数字、物理单位等信息进行处理,并翻译成相应的书面表达。

[En]

The popular method in the industry is to process text based on rules, that is, to combine regular expressions and codes to deal with numbers, physical units and other information in speech recognition results, and translate them into corresponding written expressions.

在处理这一问题时,可以考虑采用基于机器学习的方法来实现,但从实践结果来看,效果很差且不稳定,不符合商业要求。

[En]

When dealing with this problem, we can consider the method based on machine learning to achieve, but from the results of practice, the effect is very poor and unstable, does not meet the commercial requirements.

以数字的表达为例,不同的语言有自己的特点,同一种语言在不同的地区有不同的特点,没有统一的规则。

[En]

Taking the expression of numbers as an example, different languages have their own characteristics, and the same language has different characteristics in different regions, and there are no unified rules.

考虑到在书面场景中使用数字时,一般都是以阿拉伯数字的形式书写,而且每种语言的阅读习惯都比较稳定,因此基于规则的方法简单有效,从实用角度来看,可以覆盖80%的常见场景。

[En]

Considering that when numbers are used in written scenarios, they are generally written in the form of Arabic numerals, and the reading habits of each language are relatively stable, so the rule-based method is simple and effective, from a practical point of view, it can cover 80% of the common scenarios.

在实施此功能时遇到困难,例如:

[En]

Difficulties in implementing this feature, such as:

  • 不同语言的人阅读数字的方式不同,需要为不同的语言定制处理模块
    [En]

    people in different languages have different ways to read numbers, so it is necessary to customize the processing module for different languages.*

  • 给定一种语言,其数字的读取与地区和人口密切相关,因此需要相关规则并提供有针对性的实施。
    [En]

    given a language, the reading of its numbers is strongly related to region and population, so it needs relevant rules and provides targeted implementation.*

  • 语音识别环节输出的识别文本作为输入,因此语音识别环节输出结果的准确性将影响数字格式标准化的准确性。如果语音识别结果中存在错误,则可能需要进行一些调整和兼容,以提高最终结果的准确性。
    [En]

    the recognition text output from the speech recognition link is used as the input, so the accuracy of the output result of the speech recognition link will affect the accuracy of digital format normalization. If there are errors in the results of speech recognition, some adaptation and compatibility may need to be done to improve the accuracy of the final results.*

Original: https://blog.csdn.net/babyblue_963/article/details/113732406
Author: 小南家的青蛙
Title: ASR项目实战-后处理

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/526900/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球