spacy库的安装与使用_python spacy库使用总结【待完善】

2023年6月4日下午3:39 • 人工智能 • 阅读 101

spacy库的使用说明

1.安装

2.用法

2.1 word tokenize(doc: token)

2.2 英文断句(doc.sents: sent)

2.3 词干化(doc: token, token_lemma_, token_lemma)

2.4 词性标注(doc: token,token.pos_,token.pos)

2.5 命名实体识别(doc.ents:ent, ent.label_, ent.label)

2.6 名词短语提取(doc.noun_chunks)

2.7 基于词向量计算两个单词的相似度 (doc[index_i].similarity(doc[index_j]))

1.安装

见另一篇python spacy安装问题末尾总结。

2.用法

spaCy 是一个Python自然语言处理工具包，诞生于2014年年中，号称”Industrial-Strength Natural Language Processing in Python”，是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能，这个区别于学术性质更浓的Python NLTK，因此具有了业界应用的实际价值。

import spacy

nlp = spacy.load(en_core_web_em)

官方文档见spacy(https://spacy.io/usage/linguistic-features)

主要支持英语和德语。

功能包括word tokenize, 英文断句，词干化，词性标注，命名实体识别，名词短语提取，相似度计算……

2.1 word tokenize(doc: token)

将英文单词和标点符号都分离出来，如果含有中文，则中文以多个文字之间的空格分词。

In [3]: test_doc = nlp(u”it’s word tokenize test for spacy”)

In [4]: print(test_doc)

it’s word tokenize test for spacy

In [5]: for token in test_doc:

print(token)

…:

word

tokenize

test

for

spacy

test_doc是 spacy.tokens.doc.Doc 对象。

2.2 英文断句(doc.sents: sent)

In [6]: test_doc = nlp(u’Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.’)

In [7]: for sent in test_doc.sents:

print(sent)

…:

Natural language processing (NLP) deals with the application of computational models to text or speech data.

Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.

NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.

From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

2.3 词干化(doc: token, token_lemma_, token_lemma)

词干化是指词形还原，比如将复数还原为单数，将过去式还原为一般式，将am,is,are还原为be……等处理。

In [8]: test_doc = nlp(u”you are best. it is lemmatize test for spacy. I love these books”)

In [9]: for token in test_doc:

print(token, token.lemma_, token.lemma)

…:

(you, u’you’, 472)

(are, u’be’, 488)

(best, u’good’, 556)

(., u’.’, 419)

(it, u’it’, 473)

(is, u’be’, 488)

(lemmatize, u’lemmatize’, 1510296)

(test, u’test’, 1351)

(for, u’for’, 480)

(spacy, u’spacy’, 173783)

(., u’.’, 419)

(I, u’i’, 570)

(love, u’love’, 644)

(these, u’these’, 642)

(books, u’book’, 1011)

2.4 词性标注(doc: token,token.pos_,token.pos)

In [10]: for token in test_doc:

print(token, token.pos_, token.pos)

….:

(you, u’PRON’, 92)

(are, u’VERB’, 97)

(best, u’ADJ’, 82)

(., u’PUNCT’, 94)

(it, u’PRON’, 92)

(is, u’VERB’, 97)

(lemmatize, u’ADJ’, 82)

(test, u’NOUN’, 89)

(for, u’ADP’, 83)

(spacy, u’NOUN’, 89)

(., u’PUNCT’, 94)

(I, u’PRON’, 92)

(love, u’VERB’, 97)

(these, u’DET’, 87)

(books, u’NOUN’, 89)

2.5 命名实体识别(doc.ents:ent, ent.label_, ent.label)

In [11]: test_doc = nlp(u”Rami Eid is studying at Stony Brook University in New York”)

In [12]: for ent in test_doc.ents:

print(ent, ent.label_, ent.label)

….:

(Rami Eid, u’PERSON’, 346)

(Stony Brook University, u’ORG’, 349)

(New York, u’GPE’, 350)

2.6 名词短语提取(doc.noun_chunks)

In [13]: test_doc = nlp(u’Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.’)

In [14]: for np in test_doc.noun_chunks:

print(np)

….:

Natural language processing

Natural language processing (NLP) deals

the application

computational models

text

speech

data

Application areas

NLP

automatic (machine) translation

languages

……

2.7 基于词向量计算两个单词的相似度 (doc[index_i].similarity(doc[index_j]))

In [15]: test_doc = nlp(u”Apples and oranges are similar. Boots and hippos aren’t.”)

In [16]: apples = test_doc[0]

In [17]: print(apples)

Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)

oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)

Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)

hippos

In [24]: apples.similarity(oranges)

Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)

Out[25]: 0.038474555379008429

参考自 http://www.52nlp.cn/tag/python-spacy

Original: https://blog.csdn.net/weixin_42509766/article/details/111907306
Author: 王润壮
Title: spacy库的安装与使用_python spacy库使用总结【待完善】

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/568821/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

零售药店管理系统 Python+MySQL数据库

零售药店管理系统【Python+数据库】 1 背景介绍结合本学期所学的Python和数据库完成一个零售药店管理系统 2 问题定义零售药店管理系统通过录入零售药品信息、员工信息…

人工智能 2023年7月30日
0063
tensor的索引、切片、拼接和压缩等

ensor的索引、切片和拼接一、相关命令命令1：拼接-torch.cat() 格式： torch.cat(tensors, dim=0, out=None) → Tensor …

人工智能 2023年6月17日
0084
MMDetection——GPU训练

${CONFIG_FILE}：config/里面的文件 config/faster_rcnn_r50_fpn_1x_coco.py ${CHECKPOINT_FILE}：模型权重所…

人工智能 2023年7月24日
0044
python绘制热度图(heatmap)

1、简单的代码 from matplotlib import pyplot as plt import seaborn as sns import numpy as np impo…

人工智能 2023年7月4日
0052
Arcface详解(通透、清晰)

文章目录 (一)、研究背景 (二)、论文详解 * 1.1、Abstract 1.2、Introduction (三)、Arcface Loss代码详解人脸识别中Softmax-b…

人工智能 2023年5月28日
0078
【目标检测】(13) 先验框解码，调整预测框，附TensorFlow完整代码

各位同学好，今天和大家分享一下如何使用 TensorFlow对 YOLOV3 和 YOLOV4 网络的输出特征进行解码，微调每个先验框的坐标和宽高，使其逼近真实标签框。 YOLO…

人工智能 2023年5月25日
0091
人工智能在脑电情感分类上干了啥？(一)

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年7月2日
0061
【毕业设计】基于大数据的抖音短视频数据分析与可视化 – python 大数据可视化

文章目录 0 前言 1 课题背景 2 数据清洗 3 数据可视化 * 地区-用户观看时间分界线每周观看观看路径发布地点视频时长整体点赞、完播 4 进阶分析 * 相关性分…

人工智能 2023年7月14日
0096
【前端工程化】使用jest单元测试，提高效率的方法

每日鸡汤：你总要努力追上那个曾经被赋予众望的自己吧假设一个场景：在一个庞大的项目中，领导给你提了一个小需求，让你写一个方法，判断两个数组内容是否相等，你写完了之后，你会怎么测试…

人工智能 2023年5月30日
00115
[学习笔记]neo4j离线导入csv文件——neo4j import

官方文档一、准备好所需的csv文件，文件格式为：节点[node:ID,name,:LABEL]以及关系[:START_ID,:END_ID,:TYPE]，文件存放在neo4j安…

人工智能 2023年5月27日
0082
NLP-词性标注-隐马尔可夫模型实现

NLP-词性标注-隐马尔可夫模型实现一、词性标注二、HMM词性标注构建 * 1.词性标注任务目标 2. 模型状态集合 3. 观察状态集合 4. 状态转移概率分布矩阵 5. 观测…

人工智能 2023年5月31日
0061
FigDraw 9. SCI 文章绘图之韦恩图 (Vennplot)

这期来聊聊韦恩图，这种图形虽然简单，但是也是文章中很常见的，今天就来看看 CNS 级别文章中的Venn该怎么绘制？前言维恩图用于展示在不同的事物群组（集合）之间的数学或逻辑联系…

人工智能 2023年7月16日
0056
yolov3 darknet 转 TVM 推理输出、一文读懂

🥇 版权: 本文由【墨理学AI】原创、首发、各位大佬、敬请查阅🎉 声明: 作为全网 AI 领域干货最多的博主之一，❤️ 不负光阴不负卿 ❤️ 🍊 计算机视觉： Yolo专栏、一文…

人工智能 2023年7月12日
0049
树莓派+MediaPipe+PCA9685+自制摄像机云台实现人脸跟踪移动

目录前言一、准备材料二、代码部分 * 前言 1.引入库 2.客户端（即PC端） 3.服务端（即树莓派端）演示前言博主闲得无聊，利用某宝几十块钱的机械臂自制了一个摄像头云…

人工智能 2023年7月19日
00101
R语言线性模型逐步回归

从大量候选变量中选择最终的预测变量有以下两种流行的方法：逐步回归法（stepwise method）和全子集回归（all-subsets regression）。逐步回归逐步回…

人工智能 2023年6月17日
0064
PyTorch中的Tensor与Numpy中的数组有何区别

Tensor与数组的区别 Tensor是PyTorch中的主要数据结构，类似于NumPy中的数组。尽管它们在很多方面都十分相似，但Tensor与数组之间存在一些关键区别。介绍 T…

人工智能 2024年1月3日
0060

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

spacy库的安装与使用_python spacy库使用总结【待完善】

大家都在看