环境配置 | 有关NLP的库安装学习使用示例，原理解释及出错解析

2023年6月27日上午7:42 • 人工智能 • 阅读 76

1.Spacy库学习

1.1.介绍

spacy: 文本预处理库，Python和Cython中的高级自然语言处理库，它建立在最新的研究基础之上，从一开始就设计用于实际产品。spaCy带有预先训练的统计模型和单词向量，目前支持 20多种语言的标记。它具有世界上速度最快的句法分析器，用于标签的卷积神经网络模型，解析和命名实体识别以及与深度学习整合。它是在MIT许可下发布的商业开源软件。【1】

1.2.安装

win10,pycharm,anaconda的虚拟环境（要注意pip和conda不能重复）

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

1.3.示例使用

不同语言所需要的额外依赖库

1.3.1.英文分词的实现

import spacy # &#x5BFC;&#x5305;

#########&#x82F1;&#x6587;&#x5206;&#x8BCD;##########
&#x52A0;&#x8F7D;&#x82F1;&#x6587;&#x6A21;&#x578B;
nlp = spacy.load("en_core_web_sm")

&#x4F7F;&#x7528;&#x6A21;&#x578B;&#xFF0C;&#x4F20;&#x5165;&#x53E5;&#x5B50;&#x5373;&#x53EF;
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
&#x83B7;&#x53D6;&#x5206;&#x8BCD;&#x7ED3;&#x679C;
print([token.text for token in doc])

结果

1.3.2.中文分词及单词编码的实现

#########&#x5BF9;&#x4E2D;&#x6587;&#x8FDB;&#x884C;&#x5206;&#x8BCD;&#x548C;Word Embedding##########
import spacy # &#x5BFC;&#x5305;
&#x52A0;&#x8F7D;&#x6A21;&#x578B;&#xFF0C;&#x5E76;&#x6392;&#x9664;&#x6389;&#x4E0D;&#x9700;&#x8981;&#x7684;components
nlp1 = spacy.load("zh_core_web_sm", exclude=("tagger", "parser", "senter", "attribute_ruler", "ner"))
&#x5BF9;&#x53E5;&#x5B50;&#x8FDB;&#x884C;&#x5904;&#x7406;
doc = nlp1("&#x81EA;&#x7136;&#x8BED;&#x8A00;&#x5904;&#x7406;&#x662F;&#x8BA1;&#x7B97;&#x673A;&#x79D1;&#x5B66;&#x9886;&#x57DF;&#x4E0E;&#x4EBA;&#x5DE5;&#x667A;&#x80FD;&#x9886;&#x57DF;&#x4E2D;&#x7684;&#x4E00;&#x4E2A;&#x91CD;&#x8981;&#x65B9;&#x5411;&#x3002;&#x5B83;&#x7814;&#x7A76;&#x80FD;&#x5B9E;&#x73B0;&#x4EBA;&#x4E0E;&#x8BA1;&#x7B97;&#x673A;&#x4E4B;&#x95F4;&#x7528;&#x81EA;&#x7136;&#x8BED;&#x8A00;&#x8FDB;&#x884C;&#x6709;&#x6548;&#x901A;&#x4FE1;&#x7684;&#x5404;&#x79CD;&#x7406;&#x8BBA;&#x548C;&#x65B9;&#x6CD5;&#x3002;")
for&#x5FAA;&#x73AF;&#x83B7;&#x53D6;&#x6BCF;&#x4E00;&#x4E2A;token&#x4E0E;&#x5B83;&#x5BF9;&#x5E94;&#x7684;&#x5411;&#x91CF;
for token in doc:
    # &#x8FD9;&#x91CC;&#x4E3A;&#x4E86;&#x65B9;&#x4FBF;&#x5C55;&#x793A;&#xFF0C;&#x53EA;&#x622A;&#x53D6;5&#x4F4D;&#xFF0C;&#x4F46;&#x5B9E;&#x9645;&#x8BE5;&#x6A21;&#x578B;&#x5C06;&#x4E2D;&#x6587;&#x8BCD;&#x7F16;&#x7801;&#x6210;&#x4E86;96&#x7EF4;&#x7684;&#x5411;&#x91CF;
    print(token.text, token.tensor[:5])

结果

1.3.3.韩语分词及单词编码的实现

########&#x5BF9;&#x97E9;&#x8BED;&#x53E5;&#x6CD5;&#x4F9D;&#x5B58;&#x89E3;&#x6790;##########
 #&#xFF08;&#x865A;&#x62DF;&#x73AF;&#x5883;&#x4E2D;&#x97E9;&#x8BED;&#x6A21;&#x578B;&#x4E0B;&#x8F7D;&#x547D;&#x4EE4;&#xFF09;python -m spacy download ko_core_news_sm

import spacy # &#x5BFC;&#x5305;
from spacy.lang.ko.examples import sentences

nlp2 = spacy.load("ko_core_news_sm")
doc = nlp2(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

结果

可参考【2】

1.3.4.检测英文主题及实体类型

import spacy

Load the English NLP model
nlp = spacy.load('en_core_web_sm')

The text we want to examine
text = """London is the capital and most populous city of England and
the United Kingdom. Standing on the River Thames in the south east
of the island of Great Britain, London has been a major settlement
for two millennia. It was founded by the Romans, who named it Londinium.

"""

Parse the text with spaCy. This runs the entire pipeline.

doc = nlp(text)

'doc' now contains a parsed version of text. We can use it to do anything we want!

For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

得到一个在我们的文档中检测到的命名实体和实体类型的列表：

1.3.5.词汇与文本相似度

import spacy
#python -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg")
&#x8BCD;&#x6C47;&#x8BED;&#x4E49;&#x76F8;&#x4F3C;&#x5EA6;(&#x5173;&#x8054;&#x6027;)

banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']

print(dog.similarity(animal), dog.similarity(fruit))  # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal))  # 0.67148364 0.2427285

&#x6587;&#x672C;&#x8BED;&#x4E49;&#x76F8;&#x4F3C;&#x5EA6;(&#x5173;&#x8054;&#x6027;)
target = nlp("Cats are beautiful animals.")

doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")

1.4.实现原理

组件：tok2vec，标记器，形态化器，解析器，词形还原器（trainable_lemmatizer），senter，ner。

spaCy的处理过程（Processing Pipeline）

当调用文本时，spaCy 首先标记文本以生成对象。然后通过几个不同的步骤进行处理 – 这也是称为 处理管道。训练管道使用的管道通常包括标记器、词形还原器、分析器和实体识别器。每个管道组件返回已处理的、然后将其传递给下一个组件。

tok2vec:

1.5.错误修正

错误1

在pip install spacy后，运行出现没有spacy.load()时

卸载spacy

pip uninstall spacy

然后重新安装

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

错误原因分析：错误是由将文件命名为”spacy”引起的，显然它会产生命名冲突。

解决方案：修改文件名spacy.py,不能与spacy库同名。

错误2

实现代码 python -m spacy download en_core_web_sm，出现错误如下

E:\Anaconda3\envs\tf24\lib\site-packages\h5py__init__.py:39: UserWarning: h5py is running against HDF5 1.10.5 when it was built against 1.10.6, this may cause problems
‘{0}.{1}.{2}’.format(version.hdf5_built_version_tuple)
Warning! HDF5 library version mismatched error*
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.

Data corruption or segmentation faults may occur if the application continues.

This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.

You should recompile the application or check your shared library related
settings such as ‘LD_LIBRARY_PATH’.

You can, at your own risk, disable this warning by setting the environment
variable ‘HDF5_DISABLE_VERSION_CHECK’ to a value of ‘1’.

Setting it to 2 or higher will suppress the warning messages totally.

Headers are 1.10.6, library is 1.10.5

错误原因分析：pycharm会对库版本更新，升级新的版本，导致 版本不匹配。

解决方案：(我的版本h5py-2.10.0 和 tensorflow-2.4.0 Python3.7)

卸载pip uninstall h5py

安装pip install h5py==2.10.0

修改后成功！！

2.Textacy学习

用于执行各种自然语言处理任务的Python库，建立在高性能spaCy库的基础上，在 spaCy 之上实现了几种常见的 数据抽取算法。

示例

import spacy
import textacy.extract

Load the large English NLP model
nlp = spacy.load('en_core_web_sm')

The text we want to examine
text = """London is the capital and most populous city of England and the United Kingdom.

Standing on the River Thames in the south east of the island of Great Britain,
London has been a major settlement for two millennia. It was founded by the Romans,
who named it Londinium.

"""

Parse the document with spaCy
doc = nlp(text)

Extract semi-structured statements
statements = textacy.extract.semistructured_statements(doc, "London")

Print the results
print("Here are the things I know about London:")

for statement in statements:
    subject, verb, fact = statement
print(f" - {fact}")

错误1

Traceback (most recent call last):
File “G:/NLP/bert-master/bert-master/nlpbase/textacypre.py”, line 18, in

参考文献

【1】Trained Models & Pipelines · spaCy Models Documentation

【2】恩田 / 梅卡布科 / README.md — 比特桶 (bitbucket.org)

【3】英语文本处理工具库——spaCy – 简书 (jianshu.com)

Original: https://blog.csdn.net/weixin_44649780/article/details/127808263
Author: 夏天｜여름이다
Title: 环境配置 | 有关NLP的库安装学习使用示例，原理解释及出错解析

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/654547/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Python绘制loss曲线、准确率曲线

Python 绘制 loss 曲线、准确率曲线使用 python 绘制网络训练过程中的的 loss 曲线以及准确率变化曲线，这里的主要思想就时先把想要的损失值以及准确率值保存下来…

人工智能 2023年6月26日
0068
2022福大数学建模赛题B题-BP神经网络多分类（基于Tensorflow）-附python代码

题目3：请根据附件 2 所提供的部分食物寒热属性（分为三类：性平、性温热、性凉寒），对附件 1 中的食物进行分类，判断这些食物是属于性平、性温热或性凉寒中哪一类，并说明你分…

人工智能 2023年7月14日
0079
多分类问题的“宏平均”（macro-average）与“微平均”(micro-average)

机器学习中的监督学习主要包括分类问题和回归问题，二分类问题是多分类问题的基础。对于二分类问题，在测试数据集上度量模型的预测性能表现时，常选择Precision（准确率）, Reca…

人工智能 2023年6月15日
0080
山东大学2019级机器学习实验一思路分析

山东大学2019级机器学习实验一思路分析实验说明中间的字段说明就不再写上了 ; 实验分析前提说明，默认大家已经掌握了贝叶斯模型是什么，怎么使用贝叶斯模型，不知道的可以点击链接…

人工智能 2023年6月25日
0091
使用BERT + Bi-LSTM + CRF 实现命名实体识别

文章目录 1.前言 2.数据预处理 * – 2.1本地查看数据转换后的结果 3.构建数据集 4.数据集分割 5.模型架构 * – 5.1模型初始化 5.2前…

人工智能 2023年5月27日
0075
【中秋征文】使用Python创意中秋节画月饼《花好月圆》

大家好，我是猿童学🐵，又是一年中秋至——花好月圆夜，祝大家中秋节快乐！欢迎收看中秋创造第一期。今年是我在CSDN第一次过中秋节，特意为此去学习了用Python来画月饼，不仅可以学习…

人工智能 2023年7月5日
0094
常用归一化/正则化层：InstanceNorm1d、InstanceNorm2d、

归一化零、前言 * 1.官网链接 2.归一化公式 3.介绍一、InstanceNorm1d * 1. 介绍 2.实例二、InstanceNorm2d * 1. 介绍 2.实例…

人工智能 2023年7月26日
0073
视觉SLAM十四讲笔记-2

视觉SLAM十四讲笔记-2 文章目录视觉SLAM十四讲笔记-2 * 第二讲-初识SLAM – 2.1 引子 2.2 经典视觉SLAM框架 + 2.2.1 视觉里程计 …

人工智能 2023年6月10日
0067
yolo数据集标注软件安装+使用流程

目录一、数据集标注软件 1.LabelImg 2.Make-sense 二、软件使用流程一、数据集标注软件 1.LabelImg LabelImg这个标注软件算是比较主流的数据…

人工智能 2023年7月21日
0056
代码随想录算法训练营第53天 | 1143.最长公共子序列 1035.不相交的线 53. 最大子序和

代码随想录系列文章目录动态规划篇 —— 线性dp 文章目录代码随想录系列文章目录 1143.最长公共子序列 1035.不相交的线 53.最大子序和 1143.最长公共子序列题…

人工智能 2023年6月27日
0075
图片相似度识别算法,百度图片识别算法

图像识别算法都有哪些图像识别算法：1人脸识别类（Eigenface，Fisherface算法特别多），人脸检测类（j-v算法，mtcnn)2车牌识别类，车型识别类（cnn）3字符…

人工智能 2023年6月16日
0080
毫米波雷达器件ADC原始数据捕获

毫米波雷达器件ADC原始数据捕获摘要本应用程序报告演示了如何解释使用Capture Demo或Mmwave Studio捕获的原始模数转换器(ADC)数据。针对不同的硬件设置，…

人工智能 2023年6月18日
0088
深度学习目标检测数据VisDrone2019（to yolo / voc / coco）—MMDetection数据篇

1、VisDrone2019数据集介绍配备摄像头的无人机(或通用无人机)已被快速部署到广泛的应用领域，包括农业、航空摄影、快速交付和监视。因此，从这些平台上收集的视觉数据的自动理…

人工智能 2023年6月16日
0086
一、强化学习及MountainCar-v0 Example

一、强化学习及MountainCar-v0 Example 强化学习讨论的问题是一个智能体 (agent) 怎么在一个复杂不确定的环境 (environment) 里面去极大化它…

人工智能 2023年5月26日
00117
【论文笔记】基于聚类特征深度LSTM的语音情感识别

Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiL…

人工智能 2023年5月31日
0079
Ubuntu20.04配置pytorch深度学习环境（亲测有效）

Ubuntu20.04深度学习GPU环境配置首先将NVIDIA驱动安装好，这是一切开始的前提！！！！ 1.背景深度学习环境配置真是令人头大的一件事，在配置的过程中遇到了很多坑，…

人工智能 2023年7月23日
0091

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31