2021.11.21 以为不再用python分析语料库的我又开始了作死的全过程——用spacy给德语txt文档lemmatize并将结果写入csv及txt(二)

1 所需软件及packages

1.1 软件/程序

  1. Anaconda (官网)
  2. (里头的)spider
  3. (里头的)prompt或win+R 👉 cmd

1.2 packages

  1. spaCy:有各种语言nlp的分析包,官网查看安装方式;我选的conda安装 – German – efficiency,官网提供的代码为 (在梯子加持下安装很顺利):

conda install -c conda-forge spacy
python -m spacy download de_core_news_sm

  1. csv(python自带)
  2. re(py自带)

spacy用于分析德语文本(词汇量更大一版)的包,安装了快一个小时也没成功,不推荐了:

de_dep_news_trf

2 完整代码

import spacy
import csv
import re

nlp = spacy.load('de_core_news_sm')

file = open(r'填文件路径', mode='r', encoding='utf-8', errors='ignore')
fileContent = file.read()

fileContent = re.sub ('[=#*%]', ' ', fileContent)
fileContent = re.sub ('[、?!:""{}【】,。;()《》•]', ' ', fileContent)
fileContent = re.sub ('[‚²´­„>><<€©]', ' ', fileContent)
fileContent = re.sub ('[łŁâóôźëśîčšŰ]', 'xx', fileContent)
fileContent = re.sub ('[ʃɛəɔↄæχçɪʝ]', ' ', fileContent, flags=re.I)
fileContent = re.sub (" − ", " ", fileContent)
fileContent = re.sub ("−"," ", fileContent)
fileContent = re.sub (u"[\u4e00-\u9fa5\u3040-\u309f\u30a0-\u30ff]+"," ", fileContent)

lemma = nlp(fileContent)

list = []

for token in lemma:
    newLemma = token.lemma_
    newLemma = newLemma.rstrip()
    if any(newLemma):
        list.append([newLemma])

with open('自定义文件名.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerows(list)

with open('自定义文件名.txt', mode='w', encoding='utf-8') as s:
    for token in lemma:
        s.write(token.lemma_ + '\r\n')

感谢程序员男票。
以上。

Original: https://blog.csdn.net/ICHhassPROGRAMM/article/details/121439963
Author: ICHhassPROGRAMM
Title: 2021.11.21 以为不再用python分析语料库的我又开始了作死的全过程——用spacy给德语txt文档lemmatize并将结果写入csv及txt(二)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/548033/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球