NLP的Token embedding和位置embedding

2023年5月24日下午6:42 • 人工智能 • 阅读 109

Token Enbedding，也是字符转向量的一种常用做法。

import tensorflow as tf

model_name = "ted_hrlr_translate_pt_en_converter"
tokenizers = tf.saved_model.load(model_name)

sentence = "este é um problema que temos que resolver."
sentence = tf.constant(sentence)
sentence = sentence[tf.newaxis]
sentence = tokenizers.pt.tokenize(sentence).to_tensor()
print(sentence.shape)
print(sentence)

(1, 11)
tf.Tensor([[ 2 125 44 85 231 84 130 84 742 16 3]], shape=(1, 11), dtype=int64)

start_end = tokenizers.en.tokenize([''])[0]
print(start_end)
start = start_end[0][tf.newaxis]
print(start)
end = start_end[1][tf.newaxis]
print(end)

tf.Tensor([2 3], shape=(2,), dtype=int64)
tf.Tensor([2], shape=(1,), dtype=int64)
tf.Tensor([3], shape=(1,), dtype=int64)

token这个词有占用的意思，即该向量被该词占用。

和例1一样是个葡萄牙语翻译为英语的例子

import logging
import tensorflow_datasets as tfds
logging.getLogger('tensorflow').setLevel(logging.ERROR)  # suppress warnings
import tensorflow as tf

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

for pt_examples, en_examples in train_examples.batch(3).take(1):
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))

for en in en_examples.numpy():
  print(en.decode('utf-8'))

model_name = "ted_hrlr_translate_pt_en_converter"
tokenizers = tf.saved_model.load(model_name)
encoded = tokenizers.en.tokenize(en_examples)

for row in encoded.to_list():
  print(row)

round_trip = tokenizers.en.detokenize(encoded)
for line in round_trip.numpy():
  print(line.decode('utf-8'))

e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n’t test for curiosity .
[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n ‘ t test for curiosity .

tokens = tokenizers.en.lookup(encoded)
print(tokens)

embedding——嵌入式，可以理解为低位信息嵌入至高维空间。

import tensorflow as tf

model_name = "ted_hrlr_translate_pt_en_converter"
tokenizers = tf.saved_model.load(model_name)

d_model = 128
input_vocab_size=tokenizers.pt.get_vocab_size().numpy()

embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)

x = tf.constant([[2, 87, 90, 107, 76, 129, 1852, 30,0, 0, 0, 3]])
x = embedding(x)

print(input_vocab_size)
print(x.shape)
print(x)

7765
(1, 12, 128)
tf.Tensor(
[[[-0.02317628 0.04599813 -0.0104699 … -0.03233253 -0.02013252
0.00171118]
[-0.02195768 0.0341222 0.00689759 … -0.00260416 0.02308804
0.03915772]
[-0.00282265 0.03714179 -0.03591241 … -0.03974506 -0.04376533
0.03113948]
…

[-0.0277048 -0.03750116 -0.03355522 … -0.00703954 -0.02855991
0.00357056]
[-0.0277048 -0.03750116 -0.03355522 … -0.00703954 -0.02855991
0.00357056]
[ 0.04611469 0.04663144 0.02595479 … -0.03400488 -0.00206001
-0.03282105]]], shape=(1, 12, 128), dtype=float32)

此例将文本长度为12的向量embedding为高维12×128

transformer的位置embedding，实际算法通常根据深度d_model先计算好1000个位置编码，而计算时根据实时的输入长度截取

import numpy as np
import tensorflow as tf

d_model = 128
position = 1000

def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)
  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
  pos_encoding = angle_rads[np.newaxis, ...]
  return tf.cast(pos_encoding, dtype=tf.float32)

x = tf.constant([[2, 87, 90, 107, 76, 129, 1852, 30,0, 0, 0, 3]])
seq_len = tf.shape(x)[1]
print(seq_len)
pos_encoding = positional_encoding(position, d_model)
print(pos_encoding.shape)
pe = pos_encoding[:, :seq_len, :]
print(pe.shape)

tf.Tensor(12, shape=(), dtype=int32)
(1, 1000, 128)
(1, 12, 128)

Original: https://blog.csdn.net/Arctic_Beacon/article/details/122415392
Author: 飞行codes
Title: NLP的Token embedding和位置embedding

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/508887/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

U-Net网络

U-Net普遍应用在生物医学影像领域，其在架构设计和其他利用卷积神经网络基于像素的图像分割方面更成功，它甚至对有限数据集的图像更有效。U-Net 的命名源自它的结构，它的网络结构可…

人工智能 2023年6月16日
0068
fast-rcnn详解

Fast R-CNN 算法及训练过程 R-CNN显著提升了目标检测算法的性能，但因为计算过于复杂，耗时很长，所以在实际的应用系统中，大都无法使用。经过分析可知，R-CNN的复杂性主…

人工智能 2023年7月26日
0076
目标检测（三）传统目标检测与识别的特征提取——基于HOG特征的目标检测原理

目录简介提取HOG特征的步骤 1、预处理获取要计算其特征的输入图像 2、计算图像的梯度 3、计算8×8细胞梯度直方图 4、直方图归一化 5、计算HOG特征向量 Opencv利用…

人工智能 2023年7月10日
0086
使用DGL表示一个图

说明：这个系列来自于DGL上面的A Blitz Introduction to DGL。如果看英文习惯的小伙伴还是建议直接看官网文档。 How Does DGL Represent…

人工智能 2023年7月14日
0079
R语言构建logistic回归模型并评估模型：模型预测结果抽样、可视化模型分类预测的概率分布情况、使用WVPlots包绘制ROC曲线并计算AUC值

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年7月2日
00118
机器学习中的数学——距离定义（二十七）：巴氏距离（Bhattacharyya Distance）

在统计中，巴氏距离（Bhattacharyya Distance）测量两个离散或连续概率分布的相似性。它与衡量两个统计样品或种群之间的重叠量的巴氏系数密切相关。巴氏距离和巴氏系数以…

人工智能 2023年6月24日
0083
Python精讲：在Python中遍历字典的三大方法详解

欢迎你来到站长在线的站长学堂学习Python知识，本文学习的是《在Python中遍历字典的三大方法详解》。本知识点主要内容有：使用字典对象的items()方法可以遍历字典的项和字典…

人工智能 2023年7月6日
0084
Spring Security整体架构之认证和授权

本文内容来自王松老师的《深入浅出Spring Security》，自己在学习的时候为了加深理解顺手抄录的，有时候还会写一些自己的想法。在具体学习Spring Security各种…

人工智能 2023年6月26日
0097
【python代码实现】人工神经网络分类算法及其实战案例（股票价格波动分析）

目录前置知识 * 1、前言 2、人工神经网络模型 – 2.1、神经元模型与单层神经网络 2.2、多层人工神经网络模型人工神经网络分类算法 * 1、构建数据集 2、响…

人工智能 2023年7月14日
0075
如何在jupyter中运行创建的虚拟环境（用于tensorflow）

1.创建虚拟环境(以 py36 为例) 打开anaconda>>environment>>create>>创建虚拟环境的名字，并选择python…

人工智能 2023年7月14日
0057
(二)匈牙利算法简介

1.历史匈牙利算法是一种在多项式时间内求解任务分配问题的组合优化算法，广泛应用在运筹学领域，美国数学家哈罗德·库恩于1955年提出该算法，之所以被称作匈牙利算法是因为算法很大一…

人工智能 2023年7月26日
0079
《python深度学习》笔记（六）：二分类问题

二分类问题可能是应用最广泛的机器学习问题，它指的是所有数据的标签就只有两种，正面或者负面。在这个例子中，我们学习根据电影评论的文字内容将其划分为正面或者负面。数据集介绍：本节…

人工智能 2023年7月13日
0085
1.SpringBoot基础入门之HelloWorld

一：创建Maven工程二：添加依赖官网文档最为致命 We need to start by creating a Maven pom.xml file. The pom.xml…

人工智能 2023年6月29日
0099
使用格拉姆角场(GAF)以将时间序列数据转换为图像

这篇文章将会详细介绍格拉姆角场（Gramian Angular Field），并通过代码示例展示”如何将时间序列数据转换为图像”。 Gramian Ang…

人工智能 2023年6月13日
00102
python DataFrame的shift()方法

在python数据分析中，可以使用shift()方法对DataFrame对象的数据进行位置的前滞、后滞移动。 ; 语法 DataFrame.shift(periods=1, fre…

人工智能 2023年7月6日
0084
图数据库——Neo4j

目录图数据库——Neo4j * Neo4j的下载 Neo4j CQL – Neo4j CQL数据类型 Neo4j CQL命令 + CREATE命令 MATCH命令 R…

人工智能 2023年6月1日
0087

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

NLP的Token embedding和位置embedding

大家都在看