（八）PositionRank代码解读（三）

2023年5月30日下午7:19 • 人工智能 • 阅读 96

2021SC@SDUSC

简介

本文将分析process_data数据处理模块。

read_input_file方法

该方法用于文件的读取，除了路径判断是否存在以外还需注意decode方法的第二个参数”ignore”，标识忽略无法解析的二进制编码，如果不忽略，遇到错误二进制编码时会报错。

def read_input_file(this_file):
    if os.path.exists(this_file):
        with codecs.open(this_file, "rb") as f:
            b = f.read()
            text = b.decode('utf-8','ignore')
    else:
        text = None

    return text

read_gold_file方法

该方法用于读取关键词标注文件，将关键词读取到列表gold_list中。

def read_gold_file(this_gold):
    if os.path.exists(this_gold):
        with codecs.open(this_gold, "rb") as f:
            b_list = f.readlines()
            gold_list = []
            for b in b_list:
                s = b.decode('utf-8','ignore')
                gold_list.append(s)
        f.close()
    else:
        gold_list = None

    return gold_list

word_tokenize方法

该方法用于分词，基于NLTK分词器，可以选择语言。

NLTK简介：

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

NLTK是一款用于构建Python自然语言处理的平台，提供分类、分词、词干还原、标记、解析、语义推理等功能。

def word_tokenize(text, language="english", preserve_line=False):
"""
    text可以是句子也可以是一整段文本
    :param text: 源文本
    :type text: str
    :param language: Punk语料库中的模型
    :type language: str
    :param preserve_line: 是否先对文本进行分句操作
    :type preserve_line: bool
"""
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [
        token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
    ]

filter_candidates方法

该方法用于基于多重标准对候选词进行过滤。

def filter_candidates(tokens, stopwords_file=None, min_word_length=2,valid_punctuation='-')

参数介绍

tokens: 等待被过滤的词集合
stopwords_file: 停用词所在文件
min_word_length: 过滤掉长度小于该参数的候选词
valid_punctuation:过滤掉包含非有效符号的单词，有效符号：连词符”-”
encoding=’utf-8′

具体分析

def filter_candidates(tokens, stopwords_file=None, min_word_length=2, valid_punctuation='-'):

    stopwords_list = []
    if stopwords_file is None:
        from nltk.corpus import stopwords
        stopwords_list = set(stopwords.words('english'))
    else:
        with codecs.open(stopwords_file, 'rb', encoding='utf-8') as f:
            f.readlines()
        f.close()

        for line in f:
            stopwords_list.append(line)

    indices = []

    for i, c in enumerate(tokens):

        if c in stopwords_list:
            indices.append(i)

        elif len(c) < min_word_length:
            indices.append(i)

        elif c in ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']:
            indices.append(i)

        else:

            letters_set = set([u for u in c])

            if letters_set.issubset(punctuation):
                indices.append(i)

            elif re.match(r'^[a-zA-Z0-9%s]*$' % valid_punctuation, c):
                pass
            else:
                indices.append(i)

    dels = 0

    for index in indices:
        offset = index - dels
        del tokens[offset]
        dels += 1

    return tokens

MyCorpus类

简介

用于解析提供的路径所对应的文档集合，将一篇文档的token列表作为返回值。

参数介绍

path_to_data: 文档集合路径
dictionary: 词和id的映射关系
length: 文档数量

class MyCorpus(object):

    def __init__(self, path_to_data, dictionary, length=None, encoding='utf-8'):
        """初始化参数"""
        self.path_to_data = path_to_data
        self.dictionary = dictionary
        self.length = length
        self.encoding = encoding
        self.index_filename = {}

    def __iter__(self):

        index = 0

        for filename, text, tokens in itertools.islice(iter_data(self.path_to_data, self.encoding), self.length):
            self.index_filename[index] = filename
            index += 1
            yield self.dictionary.doc2bow(tokens)

    def __len__(self):
        if self.length is None:
            self.length = sum(1 for doc in self)
        return self.length

Original: https://blog.csdn.net/Simonsdu/article/details/121317526
Author: Simonsdu
Title: （八）PositionRank代码解读（三）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/544731/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Qt编写视频监控系统67-录像计划（支持64通道7*24录像设置）

一、前言录像计划这个功能一直挂了很久，之前做的也都有保存视频文件功能，其中还分了三大种，第一种是手动开启和停止录像；第二种是按照指定时长比如10s保存文件；第三种是定时30分钟一…

人工智能 2023年7月29日
0072
盘点66个Pandas函数，轻松搞定“数据清洗”！

大家好，我是小五🧐 之前，主要是针对字符串进行一系列的操作。在此基础上我又扩展了几倍，全文较长，建议先收藏。今天我们重新盘点66个Pandas函数合集，包括数据预览、数值数据操作…

人工智能 2023年7月8日
0071
Anaconda+CUDA+cuDNN+Tensorflow2.0环境搭建

Anaconda+CUDA+cuDNN+Tensorflow2.0环境搭建前言搭建环境初期遇到的问题参考硬件环境 Anaconda3 * Win10 – 安装使…

人工智能 2023年5月26日
00104
灰色预测法 —— python

目录 1.简介 2.算法详解 2.1 生成累加数据 2.2 累加后的数据表达式 2.3 求解2.2的未知参数 3.实例分析 3.1 导入数据 3.2 进行累加数据 3.3 求解系数…

人工智能 2023年6月13日
0072
基于lingo的线性回归和非线性回归

线性回归和非线性回归的lingo编程实现目录 1.介绍 1.1 使用工具 1.2 lingo求解回归方程的特点 2.回归方程的求解 2.1线性回归 2.2非线性回归 3.模型推广…

人工智能 2023年6月17日
0064
开源分布式图数据库的思考和实践

本文首发于 Nebula Graph Community 公众号本文整理自 DTCC 主题演讲【开源分布式图数据库的思考和实践】目录目录图数据库市场的现状图数据库的优势 …

人工智能 2023年6月10日
0072
count(1)、count(*) 与 count(列) 的区别？

一、那种count的性能最好呢？先说结论：count(*) = count(1) > count(索引列) > count(字段) count()是神魔？count(…

人工智能 2023年6月30日
0091
深度学习—— 多层感知器 MLP

多层感知器 MLP MLP是一种前向结构的人工神经网络，映射一组输入向量到一组输出向量。MLP可以被看作是一个有向图，由多个节点层组成，每一层连接到下一层解决的问题：分类问题 M…

人工智能 2023年6月15日
0087
Windows取证——隐藏术

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月29日
0058
狂肝两万字带你用pytorch搞深度学习！！！

深度学习基础知识和各种网络结构实战 … 狂肝两万字带你用pytorch搞深度学习！！！深度学习前言一、基本数据：Tensor * 1.1 Tensor的创建 1…

人工智能 2023年7月4日
0074
机器视觉3：光度立体技术

（1）实验素材中提供了花瓶，球，莫扎特半身像的高度数据（vase.mat, mozart.mat, sphere.mat）。利用load命令将高度数据加载后，对高度数据求出梯度，并…

人工智能 2023年6月18日
0080
多元线性回归分析spss结果解读_spss多元线性回归结果解读

本文收集整理关于spss多元线性回归结果解读的相关议题，使用内容导航快速到达。内容导航： Q1：请高手帮忙分析下SPSS的多元线性回归结果吧~急啊~~~ 你的回归方法是直接进入法…

人工智能 2023年6月18日
0080
DistMult 论文笔记

EMBEDDING ENTITIES AND RELATIONS FOR LEARNING AND INFERENCE IN KNOWLEDGE BASES – Abs…

人工智能 2023年5月27日
00112
神经网络样本太少怎么办,神经网络训练样本太少

以下那些分类算法可以较好地避免样本不平衡问题A KNN BSVM C Bayes D神经网络答案选A，求解释。 KNN只是取了最近的几个样本点做平均而已，离预测数据较远的训练数…

人工智能 2023年6月16日
0099
【蓝桥杯考前一天总结PYthon终结篇】

最短路之Floyd：适用领域:既可以是有向图也可以是无向图,权重可以为负，通常用来求各顶点之间的距离（多源）缺点就是时间复杂度高，加上Python本身跑得慢….就祈…

人工智能 2023年7月6日
0093
python 决策树分类泰坦尼克生存预测

决策树二分类之泰坦尼号克生存预测一、项目简介 * 1.1 项目背景 1.2 目标问题 1.3 字段描述二、训练集（train）建模 * 2.1 导入相关库 2.2 自定义函数 …

人工智能 2023年7月2日
0091

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

（八）PositionRank代码解读（三）

简介

read_input_file方法

read_gold_file方法

word_tokenize方法

NLTK简介：

filter_candidates方法

参数介绍

具体分析

MyCorpus类

简介

参数介绍

大家都在看