1 所需软件及packages
1.1 软件/程序
- Anaconda (官网)
- (里头的)spider
- (里头的)prompt或win+R 👉 cmd
1.2 packages
- spaCy:有各种语言nlp的分析包,官网查看安装方式;我选的conda安装 – German – efficiency,官网提供的代码为 (在梯子加持下安装很顺利):
conda install -c conda-forge spacy
python -m spacy download de_core_news_sm
- csv(python自带)
- re(py自带)
spacy用于分析德语文本(词汇量更大一版)的包,安装了快一个小时也没成功,不推荐了:
de_dep_news_trf
2 完整代码
import spacy
import csv
import re
nlp = spacy.load('de_core_news_sm')
file = open(r'填文件路径', mode='r', encoding='utf-8', errors='ignore')
fileContent = file.read()
fileContent = re.sub ('[=#*%]', ' ', fileContent)
fileContent = re.sub ('[、?!:""{}【】,。;()《》•]', ' ', fileContent)
fileContent = re.sub ('[‚²´„>><<€©]', ' ', fileContent)
fileContent = re.sub ('[łŁâóôźëśîčšŰ]', 'xx', fileContent)
fileContent = re.sub ('[ʃɛəɔↄæχçɪʝ]', ' ', fileContent, flags=re.I)
fileContent = re.sub (" − ", " ", fileContent)
fileContent = re.sub ("−"," ", fileContent)
fileContent = re.sub (u"[\u4e00-\u9fa5\u3040-\u309f\u30a0-\u30ff]+"," ", fileContent)
lemma = nlp(fileContent)
list = []
for token in lemma:
newLemma = token.lemma_
newLemma = newLemma.rstrip()
if any(newLemma):
list.append([newLemma])
with open('自定义文件名.csv', 'w', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerows(list)
with open('自定义文件名.txt', mode='w', encoding='utf-8') as s:
for token in lemma:
s.write(token.lemma_ + '\r\n')
感谢程序员男票。
以上。
Original: https://blog.csdn.net/ICHhassPROGRAMM/article/details/121439963
Author: ICHhassPROGRAMM
Title: 2021.11.21 以为不再用python分析语料库的我又开始了作死的全过程——用spacy给德语txt文档lemmatize并将结果写入csv及txt(二)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/548033/
转载文章受原作者版权保护。转载请注明原作者出处!