word processing in nlp with tensorflow

2023年6月4日上午10:30 • 人工智能 • 阅读 180

through tokenier and Serialization achieve word processing for train neuro network，i use some sample with tensorflow to introduce.

Preprocessing

Tokenizer

source code：https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py#L490-L519

some important functions and variables

init
def fit_on_texts(self, texts) #texts can be a string or a list of strings or a list of list of strings
self.word_index # the type of variance is dictonary, which contain a specific word subject to a unique index
self.index_word #r eserve the key and value of the word_index

sample

import tensorflow as tf
  from tensorflow import keras
  # the package which can tokenizer
  from tensorflow.keras.preprocessing.text import Tokenizer
  '''
    transform the word into number
  '''
  sentences= ['i love my dog', 'i love my cat','you love my dog!']
  tokenizer = Tokenizer(num_words = 100)
  tokenizer.fit_on_texts(sentences)
  word_index = tokenizer.word_index
  print(word_index)
  # get the result {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Serialization

texts_to_sequences(self,texts) # transforms each text in texts to a sequence of integers.
tf.keras.preprocessing.sequence.pad_sequences( sequences, maxlen=None, dtype=’int32′,padding=’pre’, truncating=’pre’, value=0.) # make the sentences with same length.
sorce code https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/preprocessing/sequence.py#L88-L154

sample

sentences= ['i love my dog', 'i love my cat','you love my dog!','do you think my dog is amazing']
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
 '''
   result is [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4], [6, 2, 4]]
   which is not encoding for amazing, because it's not appear in fit texts
 '''

To solve this problem，we can set a oovin tokenizer to encode a word which not appear before.

tokenizer = Tokenizer(num_words = 100, oov_token = "")
'''
    restart the code,we can get the result
    [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5], [1, 7, 1, 3, 5, 1, 1]]
'''

but each sequences has the different length of the series, it’s difficult for train a neuro network,so we need make the sequnces has the same length.

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences,
                                 padding = 'post',   # right padding
                                 maxlen = 5,         # max len of senquence
                                 truncating = 'post') # right cut
padded_sequences
'''
then we can get the result
array([[5, 3, 2, 4, 0],
       [5, 3, 2, 7, 0],
       [6, 3, 2, 4, 0],
       [8, 6, 9, 2, 4]])
'''

Original: https://www.cnblogs.com/linkcxt/p/14986517.html
Author: linkcxt
Title: word processing in nlp with tensorflow

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/567165/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

关于Retinex理论的一些理解

目前一直在参与关于Retinex的相关课题，并完成了许多模型的构建，本文以个人的见解介绍Retinex的相关理论1. 基本原理Retinex理论是上世纪八十年代由land等人提出的…

人工智能 2023年7月26日
0058
我的第二次知识图谱问答（末尾gan货）

这是知识图谱问答博客的系列二，相比于上一篇博客我的第一次知识图谱问答，区别在于，创建知识图谱的方式不一样、意图识别+槽位提取的方法不同。另外总结与展望干货满满的。这次要实现的…

人工智能 2023年6月1日
0098
模式识别作业–人脸识别（python+PCA+pytorch神经网络）

模式识别作业–人脸识别（python+PCA+pytorch） 1、实验原理该实验通过PCA降维+BP神经网络的算法实现对人脸数据集中人脸数据的识别 2、实验步骤 1、图片预处理…

人工智能 2023年7月12日
0063
tf.squeeze()与tf.expand_dims()

tf.squeeze()与tf.expand_dims()在变换维度时经常使用，今天来做下总结记录。 tf.squeeze(a,-1)与tf.expand_dims(a,-1)这里…

人工智能 2023年5月25日
0057
python库安装中Microsoft Visual C++ is required解决方法

在用pycharm过程中，用pip去安装一些第三方包的时候会出现如下错误，缺少C++编译器，因为有些程序需要使用，没有C++接口会报错，查阅相关资料及自己的解决方案 error: …

人工智能 2023年6月4日
0087
【OpenCV】OpenCV-Python实现相机标定+利用棋盘格相对位姿估计

写在前面：这次要实现的功能：实时检测棋盘格相对于摄像头的距离以及位姿。为此主要步骤可分为以下三个步骤：标定图片的拍摄、相机的标定、以及棋盘格位姿的实时解算。目录 1. 标定图片…

人工智能 2023年7月19日
0049
（刘二大人）PyTorch深度学习实践-多分类问题（Minist）

1.首先解决加载数据集缓慢以及不成功问题去Minist官网下载四个数据集，放到你的项目文件中，最好放在MINIST/raw文件夹中，切忌不要随便解压，这里我的路径为E:\lear…

人工智能 2023年7月1日
00115
【深入理解C++】内存布局

1. C 语言中的指针和 _内存_泄漏 ……………………………

人工智能 2023年6月27日
0086
训练神经网络解决二分类问题的原理

昨日训练一个二分类的神经网络，最后一层忘记加sigmoid，发现自己一直做回归的任务，对分类这块还真不太熟练，因此写下这篇博文作为回顾。 KL散度是机器学习中常用的一个指标，用于衡…

人工智能 2023年7月2日
0065
R语言批量把dataframe多个分类变量因子化处理、批量把多个分类变量转换为因子变量

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年7月1日
00113
在Python里使用ChatGPT

前言近来 chatGPT挺火的，也试玩了一下，确实挺有意思。这里记录一下在 Python中如何去使用 chatGPT。本篇文章的实现100%基于 chatGPT，我是搬运工无疑…

人工智能 2023年7月31日
0089
64行代码实现简单人脸识别

文章目录一、技术栈二、准备数据集(图片）三、测试三、代码实现四、总结 ; 一、技术栈 1、首先，最重要的一点！！！你的设备要有摄像头！！！2、Python库：Dlib库…

人工智能 2023年5月28日
0080
2022-2028年中国医疗信息化行业深度调研及投资前景预测报告

【报告类型】产业研究【出版时间】即时更新（交付时间约3个工作日）【发布机构】智研瞻产业研究院【报告格式】PDF版本报告介绍了医疗信息化行业相关概述、中国医疗信息化行业运行环…

人工智能 2023年7月18日
0060
数据增强综述及albumentations代码使用

数据增强综述及albumentations代码使用基于基本图形处理的数据增强基于深度学习的数据增强其他讨论 albumentations代码使用 * 1.像素级变换 &#82…

人工智能 2023年6月22日
0077
Unity场景优化工具：Mesh Baker 基础教程（贴图篇）

目录前言一、Mash Baker是什么？二、使用步骤 1.打开场景 2.将Texture Baker添加到场景中 3.使用Texture Baker生成贴图集 4.烘焙新的模…

人工智能 2023年7月30日
0073
R计算回归模型Mallows’ Cp指标

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月18日
0084

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

word processing in nlp with tensorflow

Tokenizer

some important functions and variables

sample

Serialization

sample

大家都在看