(八)PositionRank代码解读(三)

2021SC@SDUSC

简介

本文将分析process_data数据处理模块。

read_input_file方法

该方法用于文件的读取,除了路径判断是否存在以外还需注意decode方法的第二个参数”ignore”,标识忽略无法解析的二进制编码,如果不忽略,遇到错误二进制编码时会报错。

def read_input_file(this_file):
    if os.path.exists(this_file):
        with codecs.open(this_file, "rb") as f:
            b = f.read()
            text = b.decode('utf-8','ignore')
    else:
        text = None

    return text

read_gold_file方法

该方法用于读取关键词标注文件,将关键词读取到列表gold_list中。

def read_gold_file(this_gold):
    if os.path.exists(this_gold):
        with codecs.open(this_gold, "rb") as f:
            b_list = f.readlines()
            gold_list = []
            for b in b_list:
                s = b.decode('utf-8','ignore')
                gold_list.append(s)
        f.close()
    else:
        gold_list = None

    return gold_list

word_tokenize方法

该方法用于分词,基于NLTK分词器,可以选择语言。

NLTK简介:

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

NLTK是一款用于构建Python自然语言处理的平台,提供分类、分词、词干还原、标记、解析、语义推理等功能。

(八)PositionRank代码解读(三)
def word_tokenize(text, language="english", preserve_line=False):
"""
    text可以是句子也可以是一整段文本
    :param text: 源文本
    :type text: str
    :param language: Punk语料库中的模型
    :type language: str
    :param preserve_line: 是否先对文本进行分句操作
    :type preserve_line: bool
"""
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [
        token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
    ]

filter_candidates方法

该方法用于基于多重标准对候选词进行过滤。

def filter_candidates(tokens, stopwords_file=None, min_word_length=2,valid_punctuation='-')

参数介绍

tokens: 等待被过滤的词集合
stopwords_file: 停用词所在文件
min_word_length: 过滤掉长度小于该参数的候选词
valid_punctuation:过滤掉包含非有效符号的单词,有效符号:连词符”-”
encoding=’utf-8′

具体分析

def filter_candidates(tokens, stopwords_file=None, min_word_length=2, valid_punctuation='-'):

    stopwords_list = []
    if stopwords_file is None:
        from nltk.corpus import stopwords
        stopwords_list = set(stopwords.words('english'))
    else:
        with codecs.open(stopwords_file, 'rb', encoding='utf-8') as f:
            f.readlines()
        f.close()

        for line in f:
            stopwords_list.append(line)

    indices = []

    for i, c in enumerate(tokens):

        if c in stopwords_list:
            indices.append(i)

        elif len(c) < min_word_length:
            indices.append(i)

        elif c in ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']:
            indices.append(i)

        else:

            letters_set = set([u for u in c])

            if letters_set.issubset(punctuation):
                indices.append(i)

            elif re.match(r'^[a-zA-Z0-9%s]*$' % valid_punctuation, c):
                pass
            else:
                indices.append(i)

    dels = 0

    for index in indices:
        offset = index - dels
        del tokens[offset]
        dels += 1

    return tokens

MyCorpus类

简介

用于解析提供的路径所对应的文档集合,将一篇文档的token列表作为返回值。

参数介绍

path_to_data: 文档集合路径
dictionary: 词和id的映射关系
length: 文档数量

class MyCorpus(object):

    def __init__(self, path_to_data, dictionary, length=None, encoding='utf-8'):
        """初始化参数"""
        self.path_to_data = path_to_data
        self.dictionary = dictionary
        self.length = length
        self.encoding = encoding
        self.index_filename = {}

    def __iter__(self):

        index = 0

        for filename, text, tokens in itertools.islice(iter_data(self.path_to_data, self.encoding), self.length):
            self.index_filename[index] = filename
            index += 1
            yield self.dictionary.doc2bow(tokens)

    def __len__(self):
        if self.length is None:
            self.length = sum(1 for doc in self)
        return self.length

Original: https://blog.csdn.net/Simonsdu/article/details/121317526
Author: Simonsdu
Title: (八)PositionRank代码解读(三)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/544731/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球