Python自然语言处理:文档相似度计算(gensim.models)

本文对Python的第三方库gensim中的文档相似度计算方法进行探索。

官方文档见:

https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/models


import jieba
import os
import jieba.posseg as pseg
from gensim import corpora, models, similarities
import math
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import datetime
import seaborn as sns
sns.set(font='SimSun',font_scale=1.5, palette="muted", color_codes=True, style = 'white')

plt.rcParams['font.sans-serif']=['SimSun']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['mathtext.fontset'] = 'cm'

from scipy import sparse

整个流程:


df = pd.read_csv('noun_index.csv')
text = df[['text_need']].to_list()
texts = [eval(i) for i in texts]

dictionary = corpora.Dictionary(texts)

feature_cnt = len(dictionary.token2id.keys())

corpus = [dictionary.doc2bow(text) for text in texts]

tfidf_model = models.TfidfModel(corpus)

corpus_tfidf = tfidf_model[corpus]

index = similarities.SparseMatrixSimilarity(corpus_tfidf, num_features=len(dictionary.keys()))

① 计算corpus中的一个文档与corpus中每个文档的相似度:

i = 0
doc_text_vec = corpus[i]
sim = index[tfidf_model[doc_text_vec]]

计算的结果sim是:

array([9.9999988e-01, 2.3108754e-08, 1.1747384e-02, ..., 1.2266420e-01,
       1.4046666e-02, 9.9481754e-02], dtype=float32)

② 计算任意一个字符串与corpus中每个文档的相似度:


test_string='少年进步则国进步'
test_doc_list=[word for word in jieba.cut(test_string)]
test_doc_vec = dictionary.doc2bow(test_doc_list)
sim=index[tfidf_model[test_doc_vec]]

返回的结果sim是:

array([0.        , 0.        , 0.        , ..., 0.        , 0.01903304,
       0.        ], dtype=float32)

与tf-idf模型的前半部分处理过程一致,但是不将频率向量训练为tf-idf向量。


df = pd.read_csv('noun_index.csv')
text = df[['text_need']].to_list()
texts = [eval(i) for i in texts]

dictionary = corpora.Dictionary(texts)

feature_cnt = len(dictionary.token2id.keys())

corpus = [dictionary.doc2bow(text) for text in texts]

from scipy import sparse
vector = sparse.dok_matrix((1,len(dictionary)), dtype=np.float32)
result = corpus[0]
for i in range(len(result)):
    vector[0,result[i][0]] = result[i][-1]

vector1 = sparse.dok_matrix((1,len(dictionary)), dtype=np.float32)
result1 = corpus[1]
for i in range(len(result1)):
    vector1[0,result1[i][0]] = result1[i][-1]`

from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(vector,vector1)

返回的结果sim为:

array([[0.32762548]], dtype=float32)

与tf-idf模型的前半部分处理过程类似,但是不采用频率向量,而是采用0-1向量(出现该单词取1否则为0),也不将二元向量训练为tf-idf向量。


df = pd.read_csv('noun_index.csv')
text = df[['text_need']].to_list()
texts = [eval(i) for i in texts]

dictionary = corpora.Dictionary(texts)

feature_cnt = len(dictionary.token2id.keys())

corpus = [dictionary.doc2idx(text) for text in texts]

from scipy import sparse
vector = sparse.dok_matrix((1,len(dictionary)), dtype=np.float32)
result = corpus[0]
for i in range(len(result)):
    vector[0,result[i]] = 1

vector1 = sparse.dok_matrix((1,len(dictionary)), dtype=np.float32)
result1 = corpus[1]
for i in range(len(result1)):
    vector1[0,result[i]] = 1

from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(vector,vector1)

返回的结果sim为:

array([[0.5463583]], dtype=float32)

给定一个文档集合,计算出由神经网络映射出的每个 的向量表示(向量的长度自己指定)

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
vector = model.wv['computer']
vector1 = model.wv['system']
np.dot(vector,vector1)/(np.linalg.norm(vector)*np.linalg.norm(vector1))
>>> 0.21617143
sim = model.wv.most_similar('computer', topn=10)

返回的结果sim是:

[('system', 0.21617142856121063),
 ('survey', 0.044689204543828964),
 ('interface', 0.015203374437987804),
 ('time', 0.0019510634010657668),
 ('trees', -0.03284314647316933),
 ('human', -0.0742427185177803),
 ('response', -0.09317589551210403),
 ('graph', -0.09575346112251282),
 ('eps', -0.10513807088136673),
 ('user', -0.16911624372005463)]

Doc2vec也可以叫做 Paragraph Vector、Sentence Embeddings,它可以获得 词、句子、段落和文档的向量表达,是Word2Vec的拓展。

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
vector = model.infer_vector(["system", "response"])
vector1 =  model.infer_vector(['human', 'interface', 'computer'])
from scipy import spatial
sim = 1 - spatial.distance.cosine(vector, vector1)

返回的结果sim是:

0.44926005601882935

相比于词袋模型,N元模型考虑了词与前后词之间的联系,gensim.models.phrases模型可以构建和实现bigram,trigram,quadgram等,提取文档中经常出现的2个词,3个词,4个词。


df = pd.read_csv('noun_index.csv')
text = df[['text_need']].to_list()
texts = [eval(i) for i in texts]

bigram = models.Phrases(texts)
texts = [bigram[line] for line in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

from scipy import sparse
vector = sparse.dok_matrix((1,len(dictionary)), dtype=np.float32)
result = corpus[0]
for i in range(len(result)):
    vector[0,result[i][0]] = result[i][-1]

vector1 = sparse.dok_matrix((1,len(dictionary)), dtype=np.float32)
result1 = corpus[1]
for i in range(len(result1)):
    vector1[0,result1[i][0]] = result1[i][-1]

from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(vector,vector1)

返回的结果sim为:

array([[0.3840464]], dtype=float32)

Original: https://blog.csdn.net/sinat_36115361/article/details/124062551
Author: sinat_36115361
Title: Python自然语言处理:文档相似度计算(gensim.models)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/528034/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球