Tensorflow2.0学习-加载和预处理数据 (七)

2023年5月24日下午7:50 • 人工智能 • 阅读 111

import tensorflow as tf
AUTOTUNE = tf.data.experimental.AUTOTUNE

数据准备

import pathlib
data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
                                         fname='flower_photos', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)

for item in data_root.iterdir():
    print(item)

import random
all_image_paths = list(data_root.glob('*/*'))
all_image_paths = [str(path) for path in all_image_paths]
random.shuffle(all_image_paths)

image_count = len(all_image_paths)
print(image_count)

原来的网页是用了 IPython包，但是得用 jupyter才能显示，这里我改成 matplotlib

import matplotlib.pyplot as plt
from PIL import Image

def display(path):
    img = Image.open(path)
    plt.imshow(img)
    plt.show()

def caption_image(image_path):
    image_rel = pathlib.Path(image_path).relative_to(data_root)
    return "Image (CC BY 2.0) " + ' - '.join(attributions[str(image_rel).replace('\\', '/')].split(' - ')[:-1])

for n in range(3):
    image_path = random.choice(all_image_paths)
    display(image_path)
    print(caption_image(image_path))
    print()

首先，这些花的照片相当不错。

[En]

For one thing, the pictures of these flowers are pretty good.

打印出与图片对应的标签，然后给它们贴上一个数字。

[En]

Print out the labels corresponding to the picture, and then attach a number to them.

label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
print(label_names)

label_to_index = dict((name, index) for index, name in enumerate(label_names))
print(label_to_index)

则对应每一张图片的标签值

[En]

Then correspond to the label value of each picture

all_image_labels = [label_to_index[pathlib.Path(path).parent.name]
                    for path in all_image_paths]

print("First 10 labels indices: ", all_image_labels[:10])

现在标签数组 all_image_labels和图片路径数组 all_image_paths都有了。

这允许对图片进行格式化、标准大小和标准化。

[En]

This allows the picture to be formatted, standard size and normalized.

def preprocess_image(image):
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, [192, 192])
  image /= 255.0

  return image

def load_and_preprocess_image(path):
  image = tf.io.read_file(path)
  return preprocess_image(image)

import matplotlib.pyplot as plt

image_path = all_image_paths[0]
label = all_image_labels[0]

plt.imshow(load_and_preprocess_image(img_path))
plt.grid(False)
plt.xlabel(caption_image(img_path))
plt.title(label_names[label].title())
print()

tf.data.Dataset

让我们来看看官方的加载工具。

[En]

Let’s take a look at the official loading tool.

首先将所有图片的路径押进 TensorSliceDataset里。再用 map动态加载格式化图片。

path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)

import matplotlib.pyplot as plt

plt.figure(figsize=(8,8))
for n, image in enumerate(image_ds.take(4)):
  plt.subplot(2,2,n+1)
  plt.imshow(image)
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  plt.xlabel(caption_image(all_image_paths[n]))
  plt.show()

既然图片可以，那么标签也可以。

[En]

Since the picture can, so can the label.

label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))
for label in label_ds.take(10):
  print(label_names[label.numpy()])

然后用zip打包起来，这样 image_label_ds，出来就有图和标签了。

image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

注意：当你拥有形似 all_image_labels 和 all_image_paths的数组，tf.data.dataset.Dataset.zip 的替代方法是将这对数组切片。

ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)
image_label_ds

跑起来

BATCH_SIZE = 32

ds = image_label_ds.shuffle(buffer_size=image_count)
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)

ds = ds.prefetch(buffer_size=AUTOTUNE)

在随机缓冲区完全为空之前，被打乱的数据集不会报告数据集的结尾。Dataset（数据集）由 .repeat重新启动，导致需要再次等待随机缓冲区被填满。
可以通过使用 tf.data.Dataset.apply 方法和融合过的tf.data.experimental.shuffle_and_repeat 函数来解决

ds = image_label_ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
ds

直接就是存在 tf.keras.applications里的一个副本，这个 MobileNetV2是不可训练的

mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False)
mobile_net.trainable=False

看下网络传入数据的要求

help(tf.keras.applications.mobilenet_v2.preprocess_input)

要变成（-1，1）

def change_range(image,label):
  return 2*image-1, label

keras_ds = ds.map(change_range)

来一个批次先看看


image_batch, label_batch = next(iter(keras_ds))
feature_map_batch = mobile_net(image_batch)
print(feature_map_batch.shape)

(32, 6, 6, 1280)

然后处理它，制作你自己的模型，然后输出。

[En]

Then deal with it, make your own model, and then output it.

model = tf.keras.Sequential([
  mobile_net,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(len(label_names), activation = 'softmax')])

logit_batch = model(image_batch).numpy()

print("min logit:", logit_batch.min())
print("max logit:", logit_batch.max())
print()

print("Shape:", logit_batch.shape)

min logit: 0.004120019
max logit: 0.6654783
Shape: (32, 5)

对模型进行一些设置，然后运行一点

[En]

Make some settings for the model and run a little bit

model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=["accuracy"])
model.fit(ds, epochs=1, steps_per_epoch=3)

3/3 [==============================] - 0s 165ms/step - loss: 2.0662 - accuracy: 0.1667
Out[37]: <tensorflow.python.keras.callbacks.History at 0x16f3f3d2ac0>

可以加入缓存，提高训练效率，GPU不同等待CPU填完数据再运行。

ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)

首先，有图片和标签，图片被替换为图片存储的地址。可以保证地址有标签，使用时，直接映射和读取图片.

[En]

First of all, there are pictures and tags, and the picture is replaced by the address where the picture is stored. Can guarantee an address a label, when using, directly map and read the picture.


path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))

image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)

ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)

import functools

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

数据准备

下载数据，这个数据集是关于泰坦尼克号幸存者的名单。

[En]

Download data, this data set is about the list of Titanic survivors.

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

np.set_printoptions(precision=3, suppress=True)

有了数据后，将数据导入到dataset 的构造函数中 tf.data.experimental.make_csv_dataset

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12,
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

打印信息看一看，这是一本大词典，其中键-值对对应着数组。

[En]

Print the information and take a look, it is a big dictionary, in which the key-value pair corresponds to the array.

examples, labels = next(iter(raw_train_data))
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)

EXAMPLES:
 OrderedDict([
('sex', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'male', b'male', b'female', b'male', b'male', b'male', b'male',
       b'female', b'male', b'female', b'male', b'female'], dtype=object)>),
('age', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([25., 23., 28., 35., 28., 47., 35., 45., 19., 31., 29., 32.],
      dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0])>),
('parch', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])>),
('fare', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([ 13.   ,  63.358,   7.879, 512.329,   7.896,   9.   ,   7.125,
       164.867,  10.171, 113.275,  27.721,  13.   ], dtype=float32)>),
('class', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Second', b'First', b'Third', b'First', b'Third', b'Third',
       b'Third', b'First', b'Third', b'First', b'Second', b'Second'],
      dtype=object)>),
('deck', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'unknown', b'D', b'unknown', b'B', b'unknown', b'unknown',
       b'unknown', b'unknown', b'unknown', b'D', b'unknown', b'unknown'],
      dtype=object)>),
('embark_town', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Queenstown', b'Cherbourg',
       b'Southampton', b'Southampton', b'Southampton', b'Southampton',
       b'Southampton', b'Cherbourg', b'Cherbourg', b'Southampton'],
      dtype=object)>),
('alone', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'y', b'n', b'y', b'y', b'y', b'y', b'y', b'n', b'y', b'n', b'n',
       b'y'], dtype=object)>)])

LABELS:
 tf.Tensor([0 1 1 1 0 0 0 1 0 1 0 1], shape=(12,), dtype=int32)

数据预处理

tf.feature_column.indicator_column我的感觉就是可以把分类的列变成onehot的感觉，比如性别列下有男有女，那么就定义男（1，0），女就是（0，1）。

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}

categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

如果浮点数据需要标准化，也就是归一化，否则不同列的值差别太大。

[En]

If floating-point data needs to be standardized, that is, normalized, otherwise the values of different columns are too different.

现在创建一个数值列的集合。tf.feature_columns.numeric_column API 会使用 normalizer_fn参数。在传参的时候使用 functools.partial，functools.partial 由使用每个列的均值进行标准化的函数构成。

def process_continuous_data(mean, data):

  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

numerical_columns = []

for feature in MEANS.keys():
  num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))
  numerical_columns.append(num_col)

模型准备

预处理层
tf.keras.layers.DenseFeatures的作用就是将列的数据变成单个的 Tensor

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)

model = tf.keras.Sequential([
  preprocessing_layer,
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

跑起来

train_data = raw_train_data.shuffle(500)
test_data = raw_test_data

model.fit(train_data, epochs=20)

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))

predictions = model.predict(test_data)

for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))

import numpy as np
import tensorflow as tf

数据准备

DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'

path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
  train_examples = data['x_train']
  train_labels = data['y_train']
  test_examples = data['x_test']
  test_labels = data['y_test']

出来的数据格式为（6000，28，28），也是用 tf.data.Dataset.from_tensor_slices来处理。

train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))

BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer=tf.keras.optimizers.RMSprop(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['sparse_categorical_accuracy'])

跑起来

model.fit(train_dataset, epochs=10)

model.evaluate(test_dataset)

如果数据格式比较规则，就可以了。

[En]

If the data format is relatively regular, it will be OK.

import pandas as pd
import tensorflow as tf

数据准备

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/tf-datasets/titanic/train.csv')
df = pd.read_csv(csv_file)
print(df.head())
print(df.dtypes)

打印信息

   survived     sex   age  ...     deck  embark_town  alone
0         0    male  22.0  ...  unknown  Southampton      n
1         1  female  38.0  ...        C    Cherbourg      n
2         1  female  26.0  ...  unknown  Southampton      y
3         1  female  35.0  ...        C  Southampton      n
4         0    male  28.0  ...  unknown   Queenstown      y

survived                int64
sex                    object
age                   float64
n_siblings_spouses      int64
parch                   int64
fare                  float64
class                  object
deck                   object
embark_town            object
alone                  object

其中有一些字符串数据，我们将其转换为离散值。

[En]

There is some string data in it, which we turn into discrete values.

df['sex'] = pd.Categorical(df['sex'])
df['sex'] = df.sex.cat.codes

再打印依次

   survived  sex   age  n_siblings_spouses  ...  class     deck  embark_town alone
0         0    1  22.0                   1  ...  Third  unknown  Southampton     n
1         1    0  38.0                   1  ...  First        C    Cherbourg     n
2         1    0  26.0                   0  ...  Third  unknown  Southampton     y
3         1    0  35.0                   1  ...  First        C  Southampton     n
4         0    1  28.0                   0  ...  Third  unknown   Queenstown     y

果然性别（sex）一列变成数字了，将其他的列也变一下，并且把class这个列删除掉，class这个名字比较敏感。

df['deck'] = pd.Categorical(df['deck'])
df['deck'] = df.deck.cat.codes

df['embark_town'] = pd.Categorical(df['embark_town'])
df['embark_town'] = df.embark_town.cat.codes

df['alone'] = pd.Categorical(df['alone'])
df['alone'] = df.alone.cat.codes
df.drop('class', axis=1, inplace=True)

print(df.head())

将alone字段设置为要预测的值， df.pop()选哪个哪个就是标签值，只能选一次。进入到 tf.data.Dataset.from_tensor_slices函数中，打印的时候采用 take这个函数。

alone = df.pop('survived')
dataset = tf.data.Dataset.from_tensor_slices((df.values, alone.values))
for feat, targ in dataset.take(5):
    print ('Features: {}, Target: {}'.format(feat, targ))

随机读取一下，并设定批次。PS. 这就没有设置repeat和cache。
批次为一，shuffle的值是整体数据长度。

train_dataset = dataset.shuffle(len(df)).batch(1)

模型准备并跑起来

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model

model = get_compiled_model()
model.fit(train_dataset, epochs=15)

代替特征列

您还可以使用词典将数据传输到模型。

[En]

You can also use a dictionary to transfer data to the model.

键值对的值采用 tf.keras.layers.Input(shape=(), name=key)函数做的。

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

效果是一样的

({'survived': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1])>,
'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0])>,
'age': <tf.Tensor: shape=(16,), dtype=float32, numpy=array([22., 38., 26., 35., 28.,  2., 27., 14.,  4., 20., 39., 14.,  2., 28., 31., 28.], dtype=float32)>,
'n_siblings_spouses': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 0, 1, 0, 3, 0, 1, 1, 0, 1, 0, 4, 0, 1, 0])>,
'parch': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 5, 0, 1, 0, 0, 0])>,
'fare': <tf.Tensor: shape=(16,), dtype=float32, numpy=array([ 7.25  , 71.2833,  7.925 , 53.1   ,  8.4583, 21.075 , 11.1333, 30.0708, 16.7   ,  8.05  , 31.275 ,  7.8542, 29.125 , 13.  , 18.  ,  7.225 ], dtype=float32)>,
'deck': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([7, 2, 7, 2, 7, 7, 7, 7, 6, 7, 7, 7, 7, 7, 7, 7])>,
'embark_town': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 0])>}, <tf.Tensor: shape=(16,), dtype=int8, numpy=array([0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1], dtype=int8)>)

为了高效地读取数据，比较有帮助的一种做法是对数据进行序列化并将其存储在一组可线性读取的文件（每个文件100-200MB）中。这尤其适用于通过网络进行流式传输的数据。这种做法对缓冲任何数据预处理也十分有用。
TFRecord 格式是一种用于存储二进制记录序列的简单格式。
协议缓冲区是一个跨平台、跨语言的库，用于高效地序列化结构化数据。

[En]

The protocol buffer is a cross-platform, cross-language library for efficiently serializing structured data.
协议消息由 .proto 文件定义，这通常是了解消息类型最简单的方法。
tf.Example 消息（或 protobuf）是一种灵活的消息类型，表示 {“string”: value} 映射。它专为TensorFlow 而设计，并被用于 TFX 等高级 API。
本笔记本将演示如何创建、解析和使用 tf.Example 消息，以及如何在 .tfrecord 文件之间对 tf.Example消息进行序列化、写入和读取。
注：这些结构虽然有用，但并不是强制的。您无需转换现有代码即可使用 TFRecord，除非您正在使用 tf.data且读取数据仍是训练的瓶颈。有关数据集性能的提示，请参阅数据输入流水线性能。

我的理解是，就像我们从互联网上下载文件一样，我们使用流媒体来一点点下载，这样如果它坏了，我们可以继续下去，而不必从头开始。数据是一样的。

[En]

My understanding is that just like we download files from the Internet, we use streaming to download little by little, so that if it is broken, we can go on and on, without having to start from scratch. The data is the same.

这个先了解

原文件中的一行为一个样本，这适用于大多数的基于行的文本数据

import tensorflow as tf

import tensorflow_datasets as tfds
import os

数据准备

下载规整好的数据。

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

parent_dir = os.path.dirname(text_dir)

将数据压进 tf.data.Dataset.map，用 tf.data.TextLineDataset读取数据，

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

让我们来看看数据是什么样子的。

[En]

Let’s take a look at what the data looks like.

for ex in all_labeled_data.take(5):
    print(ex)

numpy就是每个Tensor 值。

(<tf.Tensor: shape=(), dtype=string, numpy=b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'His wrath pernicious, who ten thousand woes'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"Caused to Achaia's host, sent many a soul">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Illustrious into Ades premature,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'And Heroes gave (so stood the will of Jove)'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

有了文字，那就把文字变成数字，方便计算机计算，盲猜是onehot或者向量嵌入。
官网例子是 tfds.features.text.Tokenizer()需要改成 tfds.deprecated.text.Tokenizer()，这个库更新的也太快了吧，官网例子都跟不上。


tokenizer = tfds.deprecated.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy())
    vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
print(vocab_size)

官网的例子 tfds.features.text.TokenTextEncoder需要改成 tfds.deprecated.text.TokenTextEncoder，采用 tfds.deprecated.text.TokenTextEncoder将样本进行编码。PS.应该现在有更好的函数吧，要不也不能废弃。

encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set)
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)
encoded_example = encoder.encode(example_text)
print(encoded_example)

Tensorflow的思想总是在用到的时候再做处理。我写PyTorch的时候总是喜欢把数据都处理的特别稳妥后再处理。

def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

def encode_map_fn(text, label):

  encoded_text, label = tf.py_function(encode,
                                       inp=[text, label],
                                       Tout=(tf.int64, tf.int64))

  encoded_text.set_shape([None])
  label.set_shape([])

  return encoded_text, label

all_encoded_data = all_labeled_data.map(encode_map_fn)

tf.data.Dataset.take 和 tf.data.Dataset.skip可以看作是从 tf.data.Dataset拿多少数据出来以及略去多少剩下的全要。还是 tf.data.Dataset.padded_batch好用，全都自动化，不用自己去写补全代码，直接就是自动填充。因为文本不像是图片，长短不一，也不可以去拉伸什么的，只能在后面补零成相同的大小。

train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)

看下数据奥


sample_text, sample_labels = next(iter(test_data))

print(sample_text[0], sample_labels[0])

结果，我们可以看到，所有的零都加在了它的后面。

[En]

As a result, we can see that all the zeros are added behind it.

(<tf.Tensor: shape=(15,), dtype=int64, numpy=
 array([ 8132, 15145,  4866, 10461,  7732,   465, 17108, 13725,     0,
            0,     0,     0,     0,     0,     0], dtype=int64)>,
 <tf.Tensor: shape=(), dtype=int64, numpy=1>)

零也是一个token，那么词汇表的大小就要加一了

vocab_size += 1

模型准备

果然还是用向量嵌入的方式，接下来就是LSTM

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 64))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

for units in [64, 64]:
    model.add(tf.keras.layers.Dense(units, activation='relu'))

model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

跑起来

model.fit(train_data, epochs=3, validation_data=test_data)

eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))

Eval loss: 0.38248714804649353, Eval accuracy: 0.8285999894142151

Tensorflow真的强大，但是前提要熟悉它的一些API。

Original: https://blog.csdn.net/u010095372/article/details/124519459
Author: 赫凯
Title: Tensorflow2.0学习-加载和预处理数据 (七)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/509157/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

20个Pandas数据实战案例，收获多多

今天我们讲一下pandas当中的数据过滤内容下面小编会给出大概20个案例来详细说明数据过滤的方法，首先我们先建立要用到的数据集，代码如下 import pandas as pd …

人工智能 2023年7月16日
0053
深度学习的超参数调整

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月15日
0067
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0； 4.00 GiB total capacity； 2.44

调试手写数字识别代码时出现的问题，将cpu的代码改用gpu训练时虽然可以训练，详见上一条博客(Mnist手写数字识别cpu训练与gpu训练)，但是会出现Error。查找资料后以下是…

人工智能 2023年7月20日
0059
PyTorch深度学习实践——多分类问题

多分类问题多分类问题 Softmax 在Minist数据集上实现多分类问题作业课程来源：PyTorch深度学习实践——河北工业大学《PyTorch深度学习实践》完结合集_哔…

人工智能 2023年6月4日
0094
DDPM代码详细解读(2)：Unet结构、正向和逆向过程、IS和FID测试、EMA优化

以下是将 Unet_和门 _结构_结合的 _PyTorch 代码： import torch import torch.nn as nn import torch.nn.funct…

人工智能 2023年7月21日
0042
去雾论文Real-time Defogging of Single Image of IoTs-based Surveillance Video Based on MAP

Liu X. Real-time Defogging of Single Image of IoTs-based Surveillance Video Based on MAP[J…

人工智能 2023年7月2日
0067
将时间序列转化为监督学习问题

这里提供两种不同的数据划分方式，看喜好选择了鸭 pandas的shift()函数 import pandas as pd df = pd.DataFrame() df["…

人工智能 2023年7月6日
0072
Python深度学习：OpenCV图像处理实战 HSV处理，图像旋转平移（读书笔记）

今天我们就来继续看一看OpenCV的使用。第四篇一、图片的自由缩放以及边缘裁剪二、图像色调的调整三、图像的旋转、平移和翻转四、使用OpenCV扩大图像数据库 * 1、色彩…

人工智能 2023年6月20日
00223
数据分析实战项目-用户行为分析（Python）

数据分析步骤1:明确项目背景和需求提出问题和应用模型 1.本次分析的目的是为了通过对某电商用户的行为进行分析，从而找到提升GMV方法。思路：项目GMV的拆解公式为：GMV=UV…

人工智能 2023年6月19日
0089
OpenCV实战之人脸美颜美型算法

人脸美颜美型是十分常见的图像处理功能，应用于手机拍照、视频直播、视频会议等图像视频流处理领域。如下图所示是一款拍照软件中针对人脸美颜美型功能的具体介绍。人脸美颜美型是一个综合性较…

人工智能 2023年6月29日
0079
CVPR2019领域自适应/语义分割：Adapting Structural Information across Domains for Boosting Sema适应结构信息跨领域促进语义分割

CVPR2019 All about Structure: Adapting Structural Information across Domains for Boosting …

人工智能 2023年6月22日
0089
对知识推理的认识的相关论文

2020-11-25-周报v1 论文一（对知识推理的认识）中文引用格式: 官赛萍,靳小龙,贾岩涛,王元卓,程学旗.面向知识图谱的知识推理研究进展.软件学报,2018,29(…

人工智能 2023年6月1日
0082
如何实现声音克隆？

人工智能 2023年5月23日
0094
使用 resnet50 网络训练多分类模型完整代码

零、导包准备 import torch from torchvision import datasets, models, transforms import torch.nn a…

人工智能 2023年7月3日
00107
python打包技巧：彻底解决pyinstaller打包exe文件太大的问题

之前也写过很多的小工具，粉丝朋友们应该都知道在本公众号内回复任意关键字即可获取以往的工具源码或是exe可执行应用。【阅读全文】因为以往发过的小工具基本都是几十MB大小的exe应…

人工智能 2023年7月6日
00132
Torch-TensorRT安装、简单使用

参考：https://github.com/pytorch/TensorRT 1、Torch-TensorRT安装 ***因为需要各种tensort、cuda等环境配置，所以可以直…

人工智能 2023年7月22日
0065

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tensorflow2.0学习-加载和预处理数据 (七)

数据准备

tf.data.Dataset

跑起来

数据准备

数据预处理

模型准备

跑起来

数据准备

跑起来

数据准备

模型准备并跑起来

代替特征列

数据准备

模型准备

跑起来

大家都在看