Tensorflow2.0学习-加载和预处理数据 (七)

import tensorflow as tf
AUTOTUNE = tf.data.experimental.AUTOTUNE

数据准备

import pathlib
data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
                                         fname='flower_photos', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)
for item in data_root.iterdir():
    print(item)

import random
all_image_paths = list(data_root.glob('*/*'))
all_image_paths = [str(path) for path in all_image_paths]
random.shuffle(all_image_paths)

image_count = len(all_image_paths)
print(image_count)

原来的网页是用了 IPython包,但是得用 jupyter才能显示,这里我改成 matplotlib

import matplotlib.pyplot as plt
from PIL import Image

def display(path):
    img = Image.open(path)
    plt.imshow(img)
    plt.show()

def caption_image(image_path):
    image_rel = pathlib.Path(image_path).relative_to(data_root)
    return "Image (CC BY 2.0) " + ' - '.join(attributions[str(image_rel).replace('\\', '/')].split(' - ')[:-1])

for n in range(3):
    image_path = random.choice(all_image_paths)
    display(image_path)
    print(caption_image(image_path))
    print()

首先,这些花的照片相当不错。

[En]

For one thing, the pictures of these flowers are pretty good.

打印出与图片对应的标签,然后给它们贴上一个数字。

[En]

Print out the labels corresponding to the picture, and then attach a number to them.

label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
print(label_names)

label_to_index = dict((name, index) for index, name in enumerate(label_names))
print(label_to_index)

则对应每一张图片的标签值

[En]

Then correspond to the label value of each picture

all_image_labels = [label_to_index[pathlib.Path(path).parent.name]
                    for path in all_image_paths]

print("First 10 labels indices: ", all_image_labels[:10])

现在标签数组 all_image_labels和图片路径数组 all_image_paths都有了。

这允许对图片进行格式化、标准大小和标准化。

[En]

This allows the picture to be formatted, standard size and normalized.

def preprocess_image(image):
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, [192, 192])
  image /= 255.0

  return image

def load_and_preprocess_image(path):
  image = tf.io.read_file(path)
  return preprocess_image(image)

import matplotlib.pyplot as plt

image_path = all_image_paths[0]
label = all_image_labels[0]

plt.imshow(load_and_preprocess_image(img_path))
plt.grid(False)
plt.xlabel(caption_image(img_path))
plt.title(label_names[label].title())
print()

tf.data.Dataset

让我们来看看官方的加载工具。

[En]

Let’s take a look at the official loading tool.

首先将所有图片的路径押进 TensorSliceDataset里。再用 map动态加载格式化图片。

path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)

import matplotlib.pyplot as plt

plt.figure(figsize=(8,8))
for n, image in enumerate(image_ds.take(4)):
  plt.subplot(2,2,n+1)
  plt.imshow(image)
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  plt.xlabel(caption_image(all_image_paths[n]))
  plt.show()

既然图片可以,那么标签也可以。

[En]

Since the picture can, so can the label.

label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))
for label in label_ds.take(10):
  print(label_names[label.numpy()])

然后用zip打包起来,这样 image_label_ds,出来就有图和标签了。

image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

注意:当你拥有形似 all_image_labels 和 all_image_paths的数组,tf.data.dataset.Dataset.zip 的替代方法是将这对数组切片。

ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)
image_label_ds

跑起来

BATCH_SIZE = 32

ds = image_label_ds.shuffle(buffer_size=image_count)
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)

ds = ds.prefetch(buffer_size=AUTOTUNE)

在随机缓冲区完全为空之前,被打乱的数据集不会报告数据集的结尾。Dataset(数据集)由 .repeat重新启动,导致需要再次等待随机缓冲区被填满。
可以通过使用 tf.data.Dataset.apply 方法和融合过的tf.data.experimental.shuffle_and_repeat 函数来解决

ds = image_label_ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
ds

直接就是存在 tf.keras.applications里的一个副本,这个 MobileNetV2是不可训练的

mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False)
mobile_net.trainable=False

看下网络传入数据的要求

help(tf.keras.applications.mobilenet_v2.preprocess_input)

要变成(-1,1)

def change_range(image,label):
  return 2*image-1, label

keras_ds = ds.map(change_range)

来一个批次先看看


image_batch, label_batch = next(iter(keras_ds))
feature_map_batch = mobile_net(image_batch)
print(feature_map_batch.shape)
(32, 6, 6, 1280)

然后处理它,制作你自己的模型,然后输出。

[En]

Then deal with it, make your own model, and then output it.

model = tf.keras.Sequential([
  mobile_net,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(len(label_names), activation = 'softmax')])

logit_batch = model(image_batch).numpy()

print("min logit:", logit_batch.min())
print("max logit:", logit_batch.max())
print()

print("Shape:", logit_batch.shape)
min logit: 0.004120019
max logit: 0.6654783
Shape: (32, 5)

对模型进行一些设置,然后运行一点

[En]

Make some settings for the model and run a little bit

model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=["accuracy"])
model.fit(ds, epochs=1, steps_per_epoch=3)
3/3 [==============================] - 0s 165ms/step - loss: 2.0662 - accuracy: 0.1667
Out[37]: <tensorflow.python.keras.callbacks.History at 0x16f3f3d2ac0>

可以加入缓存,提高训练效率,GPU不同等待CPU填完数据再运行。

ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)

首先,有图片和标签,图片被替换为图片存储的地址。可以保证地址有标签,使用时,直接映射和读取图片.

[En]

First of all, there are pictures and tags, and the picture is replaced by the address where the picture is stored. Can guarantee an address a label, when using, directly map and read the picture.


path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))

image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)

ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
import functools

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

数据准备

下载数据,这个数据集是关于泰坦尼克号幸存者的名单。

[En]

Download data, this data set is about the list of Titanic survivors.

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

np.set_printoptions(precision=3, suppress=True)

有了数据后,将数据导入到dataset 的构造函数中 tf.data.experimental.make_csv_dataset

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12,
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

打印信息看一看,这是一本大词典,其中键-值对对应着数组。

[En]

Print the information and take a look, it is a big dictionary, in which the key-value pair corresponds to the array.

examples, labels = next(iter(raw_train_data))
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)
EXAMPLES:
 OrderedDict([
('sex', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'male', b'male', b'female', b'male', b'male', b'male', b'male',
       b'female', b'male', b'female', b'male', b'female'], dtype=object)>),
('age', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([25., 23., 28., 35., 28., 47., 35., 45., 19., 31., 29., 32.],
      dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0])>),
('parch', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])>),
('fare', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([ 13.   ,  63.358,   7.879, 512.329,   7.896,   9.   ,   7.125,
       164.867,  10.171, 113.275,  27.721,  13.   ], dtype=float32)>),
('class', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Second', b'First', b'Third', b'First', b'Third', b'Third',
       b'Third', b'First', b'Third', b'First', b'Second', b'Second'],
      dtype=object)>),
('deck', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'unknown', b'D', b'unknown', b'B', b'unknown', b'unknown',
       b'unknown', b'unknown', b'unknown', b'D', b'unknown', b'unknown'],
      dtype=object)>),
('embark_town', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Queenstown', b'Cherbourg',
       b'Southampton', b'Southampton', b'Southampton', b'Southampton',
       b'Southampton', b'Cherbourg', b'Cherbourg', b'Southampton'],
      dtype=object)>),
('alone', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'y', b'n', b'y', b'y', b'y', b'y', b'y', b'n', b'y', b'n', b'n',
       b'y'], dtype=object)>)])

LABELS:
 tf.Tensor([0 1 1 1 0 0 0 1 0 1 0 1], shape=(12,), dtype=int32)

数据预处理

tf.feature_column.indicator_column我的感觉就是可以把分类的列变成onehot的感觉,比如性别列下有男有女,那么就定义男(1,0),女就是(0,1)。

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}

categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

如果浮点数据需要标准化,也就是归一化,否则不同列的值差别太大。

[En]

If floating-point data needs to be standardized, that is, normalized, otherwise the values of different columns are too different.

现在创建一个数值列的集合。tf.feature_columns.numeric_column API 会使用 normalizer_fn参数。在传参的时候使用 functools.partial,functools.partial 由使用每个列的均值进行标准化的函数构成。

def process_continuous_data(mean, data):

  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

numerical_columns = []

for feature in MEANS.keys():
  num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))
  numerical_columns.append(num_col)

模型准备

预处理层
tf.keras.layers.DenseFeatures的作用就是将列的数据变成单个的 Tensor

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)

model = tf.keras.Sequential([
  preprocessing_layer,
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

跑起来

train_data = raw_train_data.shuffle(500)
test_data = raw_test_data

model.fit(train_data, epochs=20)

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))

predictions = model.predict(test_data)

for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))
import numpy as np
import tensorflow as tf

数据准备

DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'

path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
  train_examples = data['x_train']
  train_labels = data['y_train']
  test_examples = data['x_test']
  test_labels = data['y_test']

出来的数据格式为(6000,28,28),也是用 tf.data.Dataset.from_tensor_slices来处理。

train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))
BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer=tf.keras.optimizers.RMSprop(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['sparse_categorical_accuracy'])

跑起来

model.fit(train_dataset, epochs=10)

model.evaluate(test_dataset)

如果数据格式比较规则,就可以了。

[En]

If the data format is relatively regular, it will be OK.

import pandas as pd
import tensorflow as tf

数据准备

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/tf-datasets/titanic/train.csv')
df = pd.read_csv(csv_file)
print(df.head())
print(df.dtypes)

打印信息

   survived     sex   age  ...     deck  embark_town  alone
0         0    male  22.0  ...  unknown  Southampton      n
1         1  female  38.0  ...        C    Cherbourg      n
2         1  female  26.0  ...  unknown  Southampton      y
3         1  female  35.0  ...        C  Southampton      n
4         0    male  28.0  ...  unknown   Queenstown      y

survived                int64
sex                    object
age                   float64
n_siblings_spouses      int64
parch                   int64
fare                  float64
class                  object
deck                   object
embark_town            object
alone                  object

其中有一些字符串数据,我们将其转换为离散值。

[En]

There is some string data in it, which we turn into discrete values.

df['sex'] = pd.Categorical(df['sex'])
df['sex'] = df.sex.cat.codes

再打印依次

   survived  sex   age  n_siblings_spouses  ...  class     deck  embark_town alone
0         0    1  22.0                   1  ...  Third  unknown  Southampton     n
1         1    0  38.0                   1  ...  First        C    Cherbourg     n
2         1    0  26.0                   0  ...  Third  unknown  Southampton     y
3         1    0  35.0                   1  ...  First        C  Southampton     n
4         0    1  28.0                   0  ...  Third  unknown   Queenstown     y

果然性别(sex)一列变成数字了,将其他的列也变一下,并且把class这个列删除掉,class这个名字比较敏感。

df['deck'] = pd.Categorical(df['deck'])
df['deck'] = df.deck.cat.codes

df['embark_town'] = pd.Categorical(df['embark_town'])
df['embark_town'] = df.embark_town.cat.codes

df['alone'] = pd.Categorical(df['alone'])
df['alone'] = df.alone.cat.codes
df.drop('class', axis=1, inplace=True)

print(df.head())

将alone字段设置为要预测的值, df.pop()选哪个哪个就是标签值,只能选一次。进入到 tf.data.Dataset.from_tensor_slices函数中,打印的时候采用 take这个函数。

alone = df.pop('survived')
dataset = tf.data.Dataset.from_tensor_slices((df.values, alone.values))
for feat, targ in dataset.take(5):
    print ('Features: {}, Target: {}'.format(feat, targ))

随机读取一下,并设定批次。PS. 这就没有设置repeat和cache。
批次为一,shuffle的值是整体数据长度。

train_dataset = dataset.shuffle(len(df)).batch(1)

模型准备并跑起来

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model

model = get_compiled_model()
model.fit(train_dataset, epochs=15)

代替特征列

您还可以使用词典将数据传输到模型。

[En]

You can also use a dictionary to transfer data to the model.

键值对的值采用 tf.keras.layers.Input(shape=(), name=key)函数做的。

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

效果是一样的

({'survived': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1])>,
'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0])>,
'age': <tf.Tensor: shape=(16,), dtype=float32, numpy=array([22., 38., 26., 35., 28.,  2., 27., 14.,  4., 20., 39., 14.,  2., 28., 31., 28.], dtype=float32)>,
'n_siblings_spouses': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 0, 1, 0, 3, 0, 1, 1, 0, 1, 0, 4, 0, 1, 0])>,
'parch': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 5, 0, 1, 0, 0, 0])>,
'fare': <tf.Tensor: shape=(16,), dtype=float32, numpy=array([ 7.25  , 71.2833,  7.925 , 53.1   ,  8.4583, 21.075 , 11.1333, 30.0708, 16.7   ,  8.05  , 31.275 ,  7.8542, 29.125 , 13.  , 18.  ,  7.225 ], dtype=float32)>,
'deck': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([7, 2, 7, 2, 7, 7, 7, 7, 6, 7, 7, 7, 7, 7, 7, 7])>,
'embark_town': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 0])>}, <tf.Tensor: shape=(16,), dtype=int8, numpy=array([0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1], dtype=int8)>)

为了高效地读取数据,比较有帮助的一种做法是对数据进行序列化并将其存储在一组可线性读取的文件(每个文件100-200MB)中。这尤其适用于通过网络进行流式传输的数据。这种做法对缓冲任何数据预处理也十分有用。
TFRecord 格式是一种用于存储二进制记录序列的简单格式。
协议缓冲区是一个跨平台、跨语言的库,用于高效地序列化结构化数据。

[En]

The protocol buffer is a cross-platform, cross-language library for efficiently serializing structured data.

协议消息由 .proto 文件定义,这通常是了解消息类型最简单的方法。
tf.Example 消息(或 protobuf)是一种灵活的消息类型,表示 {“string”: value} 映射。它专为TensorFlow 而设计,并被用于 TFX 等高级 API。
本笔记本将演示如何创建、解析和使用 tf.Example 消息,以及如何在 .tfrecord 文件之间对 tf.Example消息进行序列化、写入和读取。
注:这些结构虽然有用,但并不是强制的。您无需转换现有代码即可使用 TFRecord,除非您正在使用 tf.data且读取数据仍是训练的瓶颈。有关数据集性能的提示,请参阅数据输入流水线性能。

我的理解是,就像我们从互联网上下载文件一样,我们使用流媒体来一点点下载,这样如果它坏了,我们可以继续下去,而不必从头开始。数据是一样的。

[En]

My understanding is that just like we download files from the Internet, we use streaming to download little by little, so that if it is broken, we can go on and on, without having to start from scratch. The data is the same.

这个先了解

原文件中的一行为一个样本,这适用于大多数的基于行的文本数据

import tensorflow as tf

import tensorflow_datasets as tfds
import os

数据准备

下载规整好的数据。

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

parent_dir = os.path.dirname(text_dir)

将数据压进 tf.data.Dataset.map,用 tf.data.TextLineDataset读取数据,

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

让我们来看看数据是什么样子的。

[En]

Let’s take a look at what the data looks like.

for ex in all_labeled_data.take(5):
    print(ex)

numpy就是每个Tensor 值。

(<tf.Tensor: shape=(), dtype=string, numpy=b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'His wrath pernicious, who ten thousand woes'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"Caused to Achaia's host, sent many a soul">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Illustrious into Ades premature,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'And Heroes gave (so stood the will of Jove)'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

有了文字,那就把文字变成数字,方便计算机计算,盲猜是onehot或者向量嵌入。
官网例子是 tfds.features.text.Tokenizer()需要改成 tfds.deprecated.text.Tokenizer(),这个库更新的也太快了吧,官网例子都跟不上。


tokenizer = tfds.deprecated.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy())
    vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
print(vocab_size)

官网的例子 tfds.features.text.TokenTextEncoder需要改成 tfds.deprecated.text.TokenTextEncoder,采用 tfds.deprecated.text.TokenTextEncoder将样本进行编码。PS.应该现在有更好的函数吧,要不也不能废弃。

encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set)
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)
encoded_example = encoder.encode(example_text)
print(encoded_example)

Tensorflow的思想总是在用到的时候再做处理。我写PyTorch的时候总是喜欢把数据都处理的特别稳妥后再处理。

def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

def encode_map_fn(text, label):

  encoded_text, label = tf.py_function(encode,
                                       inp=[text, label],
                                       Tout=(tf.int64, tf.int64))

  encoded_text.set_shape([None])
  label.set_shape([])

  return encoded_text, label

all_encoded_data = all_labeled_data.map(encode_map_fn)

tf.data.Dataset.taketf.data.Dataset.skip可以看作是从 tf.data.Dataset拿多少数据出来以及略去多少剩下的全要。还是 tf.data.Dataset.padded_batch好用,全都自动化,不用自己去写补全代码,直接就是自动填充。因为文本不像是图片,长短不一,也不可以去拉伸什么的,只能在后面补零成相同的大小。

train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)

看下数据奥


sample_text, sample_labels = next(iter(test_data))

print(sample_text[0], sample_labels[0])

结果,我们可以看到,所有的零都加在了它的后面。

[En]

As a result, we can see that all the zeros are added behind it.

(<tf.Tensor: shape=(15,), dtype=int64, numpy=
 array([ 8132, 15145,  4866, 10461,  7732,   465, 17108, 13725,     0,
            0,     0,     0,     0,     0,     0], dtype=int64)>,
 <tf.Tensor: shape=(), dtype=int64, numpy=1>)

零也是一个token,那么词汇表的大小就要加一了

vocab_size += 1

模型准备

果然还是用向量嵌入的方式,接下来就是LSTM

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 64))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

for units in [64, 64]:
    model.add(tf.keras.layers.Dense(units, activation='relu'))

model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

跑起来

model.fit(train_data, epochs=3, validation_data=test_data)

eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))
Eval loss: 0.38248714804649353, Eval accuracy: 0.8285999894142151

Tensorflow真的强大,但是前提要熟悉它的一些API。

Original: https://blog.csdn.net/u010095372/article/details/124519459
Author: 赫凯
Title: Tensorflow2.0学习-加载和预处理数据 (七)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/509157/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球