
1. 文本预处理概述



Like machine learning tasks, the first step of natural language processing tasks is text (data) preparation or text (data) preprocessing. The process of text preprocessing is shown in the following figure:


Text preprocessing is bounded by word segmentation steps. The previous text standardization and text cleaning are corpus-level (chapter-level) granularity text processing, and then word cleaning, standardization and text representation are word-level granularity text processing.

语料级文本处理的作用对象是数据集中的每一篇语料,它比单词级文本处理效率更高,并且可以提前去除影响分词效果的障碍(如:英文中按空格分词,但与单词直接相邻的逗号等标点会产生非标准单词的分词结果(’word,’ 标准形式应该是’word’))。



Word-level text processing is performed after the corpus is segmented, and its processing object is every word in each corpus. It mainly carries out four major steps: word filtering, standardization of word writing (such as the unification of capital numbers and Arabic numerals, the unification of different tenses of English words, the unity of voice writing forms, etc.), spelling error correction and text representation.

; 2. 文本标准化

2.1 字符编码标准化(全角英文字符转半角)

在计算机中,所有中文字符都是全角字符,而英文字母、阿拉伯数字及符号有全角和半角两种unicode编码方式。它们的全角字符unicode编码从65281~65374 (十六进制 0xFF01 ~ 0xFF5E),半角字符unicode编码从33~126 (十六进制 0x21~ 0x7E);而空格符比较特殊,全角unicode编码为12288 (0x3000),半角为32 (0x20)。

可见 除空格符外,每个全角字符的unicode编码等于其半角字符的unicode编码加65248,因此字符unicode编码标准化实现代码如下:

def full_to_half(text:str):
    _text = ""
    for char in text:
        inside_code = ord(char)
        if inside_code == 12288:
            inside_code = 32
        elif 65281  inside_code  65374:
            inside_code -= 65248
        _text += chr(inside_code)
    return _text

2.2 英文大小写字母统一化


def upper2lower(text:str):
    return text.lower()

2.3 中文繁简字统一化





The code for the unification of complex and simplified Chinese is as follows:

from opencc import OpenCC

def chinese_standard(text:str, conversion='t2s'):
    cc = OpenCC(conversion)
    return cc.convert(text)

3. 文本清洗

文本清洗中,常通过Unicode码过滤来去除非文本内容。Unicode码表中,中日韩统一表意文字字符区间为 4E00~9FA5,半角英文字母、阿拉伯数字及符号的字符区间为 0x21~0x7E,所以标准文本字符范围为 [ 4E00 , 9FA5 ] ∪ [ 0x21 , 0x7E ] [\text{4E00}, \text{9FA5}] \cup[\text{0x21}, \text{0x7E}][4E00 ,9FA5 ]∪[0x21 ,0x7E ]。



Non-text content filtering and punctuation filtering are implemented with regular expressions, as follows:

import re
def clear_character(text):

    pattern = [
    return re.sub('|'.join(pattern), '', text)

4. 分词

敬请详见作者文章: 文本表示:分词.

5. 词的清洗

敬请详见作者文章: 文本表示:词的清洗.

6. 词的标准化

敬请详见作者文章: 文本表示:词的标准化.

7. 拼写纠错

敬请详见作者文章: 文本预处理:拼写纠错.

