基于tensorflow实现图像描述

2023年5月24日下午8:50 • • • 阅读 100

【基础翻译自：Attention Mechanism For Image Caption Generation in Python

借鉴于：Python中图像标题生成的注意机制实战教程_Together_CZ的博客-CSDN博客】

该文内容主要是针对图像描绘从最基础的baseline:NIC模型开始到引出Attention，并且与Transformer模型进行性能比对，在源内容上进行拓展以及更新。主要针对学期大作业内容进行讲解。

任务定义：

图像描述是一门结合了计算机视觉和自然语言处理两个研究领域的技术。如何高效地输入一幅图像，分析并得到其特征向量，并通过这些特征向量生成通用的句子来正确描述图像是我们的任务。

[En]

Image description is a technology that combines the two research fields of computer vision and natural language processing. How to efficiently input an image, analyze and get its feature vectors, and through these vectors to generate general sentences to correctly describe the image is our task.

现如今较流行的机器学习语言有tensorflow和pytorch两种，在这里主要针对TensorFlow进行编码。

数据集：

该实验基于 Flicker8k 数据集，它是第一个公开的大规模图像和描述匹配的语料集，扩充版本 Flickr30k，其中每一个图像都有五个不同的标题，这些标题描述了图像中所说的实体和事件，我们主要针对 Flicker8k_dateset 和 Flickr8k.token.txt 两个文件进行分析处理，前者放置着图片，后者是每个图片对应的描述语句。通过查询，该数据集图片共有 16182 张，因为存在缺失，图像和描述语句的对应关系只生成了 40455 组，且描述语句最长达到了 33 个单词，反之最短为 2 个单词。最终我们将 dateset 最终设置成了 40000 组进行分析。

数据集的下载可以搜索Flickr8k数据集进行下载。

具体步骤如下：

1.导入所需的库

这里包含了后续所需要的各类models所需要的库，包括VGG16、LSTM等

import string
import numpy as np
import pandas as pd
from numpy import array
from PIL import Image
import pickle
import h5py
import matplotlib.pyplot as plt
import sys, time, os, warnings

warnings.filterwarnings("ignore")
import re

import keras
import tensorflow as tf
import jieba
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu

from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import  to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense, BatchNormalization
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.applications.vgg16 import VGG16, preprocess_input

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

2.数据加载与预处理

Flicker8k_Dataset为图像文档

Flicker8k.token.txt为图像字幕

image_path = "/home/lxx/data/Flicker8k_Dataset"
dir_Flickr_text = "/home/lxx/data/F_t/Flickr8k.token.txt"
jpgs = os.listdir(image_path)

print("Total Images in Dataset = {}".format(len(jpgs)))//&#x67E5;&#x770B;&#x56FE;&#x50CF;&#x6570;&#x91CF;

将图像和相对应字幕一一对应：

file = open(dir_Flickr_text, 'r')
text = file.read()
file.close()
#&#x7F16;&#x53F7;&#x56FE;&#x7247;&#x6587;&#x5B57;&#x5BF9;&#x5E94;
datatxt = []
for line in text.split('\n'):
    col = line.split('\t')
    if len(col) == 1:
        continue
    w = col[0].split("#")
    datatxt.append(w + [col[1].lower()])
print(len(datatxt))
data = pd.DataFrame(datatxt, columns=["filename", "index", "caption"])
data = data.reindex(columns=['index', 'filename', 'caption'])
data = data[data.filename != '2258277193_586949ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)

data.head()

接下来需要对词汇进行处理

针对描述语句，我们的建立所需要的词典，方便后期描述语句的转化。对于文本，我们需要进行处理，比如删除标点符号、单个字符和数字值，并将出去后的文本分割得到总数量。对于词典部分，我们录用使用率前5000 个的单词来构成 tokenizer, 即使用频率较低单词设为


#&#x8BCD;&#x6C47;
vocabulary = []
for txt in data.caption.values:
   vocabulary.extend(txt.split())
print('Vocabulary Size: %d' % len(set(vocabulary)))

&#x200B;
def remove_punctuation(text_original):#&#x5220;&#x9664;&#x6807;&#x70B9;
    text_no_punctuation = text_original.translate(string.punctuation)
    return (text_no_punctuation)

def remove_single_character(text):#&#x5220;&#x9664;&#x5355;&#x4E2A;
    text_len_more_than1 = ""
    for word in text.split():
        if len(word) > 1:
            text_len_more_than1 += " " + word
    return (text_len_more_than1)

def remove_numeric(text):#&#x5220;&#x9664;&#x6570;&#x5B57;
    text_no_numeric = ""
    for word in text.split():
        isalpha = word.isalpha()#&#x5224;&#x65AD;&#x662F;&#x5426;&#x53EA;&#x542B;&#x6709;&#x5B57;&#x6BCD;
        if isalpha:
            text_no_numeric += " " + word
    return (text_no_numeric)

def text_clean(text_original):
    text = remove_punctuation(text_original)
    text = remove_single_character(text)
    text = remove_numeric(text)
    return (text)

for i, caption in enumerate(data.caption.values):
    newcaption = text_clean(caption)
    data["caption"].iloc[i] = newcaption

clean_vocabulary = []
for txt in data.caption.values:
   clean_vocabulary.extend(txt.split())
print('Clean Vocabulary Size: %d' % len(set(clean_vocabulary)))

对于每个副标题，我们需要添加

[En]

For each subtitle, we need to add


#&#x589E;&#x52A0;<start>&#x4E0E;<end>
PATH = "/home/lxx/data/Flicker8k_Dataset"
all_captions = []
for caption in data["caption"].astype(str):
    caption = '<start> ' + caption + ' <end>'
    all_captions.append(caption)

print(all_captions[:10])

</end></start></end></start>

之后我们采取前40000个图像进行处理（批处理大小为64，则一共有625批次）

#&#x6BCF;&#x4E2A;&#x6807;&#x9898;&#x7684;&#x5BF9;&#x5E94;&#x6587;&#x4EF6;
all_img_name_vector = []
for annot in data["filename"]:
    full_image_path = PATH +"/"+ annot
    all_img_name_vector.append(full_image_path)

print(all_img_name_vector[:10])

print(f"len(all_img_name_vector) : {len(all_img_name_vector)}")
print(f"len(all_captions) : {len(all_captions)}")

#&#x4EC5;&#x53D6;40000
def data_limiter(num, total_captions, all_img_name_vector):
    train_captions, img_name_vector = shuffle(total_captions, all_img_name_vector, random_state=1)#&#x968F;&#x673A;&#x6392;&#x5217;
    train_captions = train_captions[:num]
    img_name_vector = img_name_vector[:num]
    return train_captions, img_name_vector

train_captions, img_name_vector = data_limiter(40000, all_captions, all_img_name_vector)

3.模型构建

在这里我们将采用两种结构来定义图像特征提取模型：VGG16与Inception_v3。

这里两个模型都是用来对图像进行分类，在图像描述中，我们需要的是最后的特征值，因此需要模型中删除了softmax层。

这里我们需要注意的是两者之间的区别：

[En]

What we need to note here is the difference between the two:

1) load_picture中VGG16统一为（224,224），而Inception_V3为（299,299）

2）Inception_V3输出模型为（8,8,2048）《=》（64,2048）

VGG16输出模型为（7,7,512）《=》（49,512）

后面会进行介绍原因，例子呈现Inception_v3。


#&#x4F7F;&#x7528;inception_v3&#x5B9A;&#x4E49;
def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

image_features_extract_model.summary()

模型结构————inception_v3

模型结构————VGG16

接下来，让我们将每个图片名称映射到要加载图片的函数。

[En]

Next, let’s map each picture name to the function that you want to load the picture.

#&#x6620;&#x5C04;
encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(64)

我们将特征存储在各自的.npy文件中，然后将这些特征通过编码器传递。

#&#x63D0;&#x53D6;&#x7279;&#x5F81;&#x5E76;&#x5C06;&#x5176;&#x5B58;&#x50A8;&#x5728;&#x5404;&#x81EA;&#x7684;.npy&#x6587;&#x4EF6;&#x4E2D;
for img, path in tqdm(image_dataset):
    batch_features = image_features_extract_model(img)
    batch_features = tf.reshape(batch_features,
                                (batch_features.shape[0], -1, batch_features.shape[3]))

    for bf, p in zip(batch_features, path):
        path_of_feature = p.numpy().decode("utf-8")
        np.save(path_of_feature, bf.numpy())

之后，我们会建立一个词典，根据出现的次数将词汇量设置为5000，并将未放置的单词设置为

[En]

After that, we will set up a dictionary, set the vocabulary to 5000 according to the number of occurrences, and set the unplaced words to

top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')#&#x5206;&#x8BCD;&#x5668;

tokenizer.fit_on_texts(train_captions)#&#x4F7F;&#x7528;&#x4E00;&#x7CFB;&#x5217;&#x6587;&#x6863;&#x6765;&#x751F;&#x6210;token&#x8BCD;&#x5178;
train_seqs = tokenizer.texts_to_sequences(train_captions)#&#x5C06;&#x591A;&#x4E2A;&#x6587;&#x6863;&#x8F6C;&#x6362;&#x4E3A;word&#x4E0B;&#x6807;&#x7684;&#x5411;&#x91CF;&#x5F62;&#x5F0F;
tokenizer.word_index['<pad>'] = 0## &#x8BCD;_&#x7D22;&#x5F15; &#x4FDD;&#x5B58;&#x6240;&#x6709;word&#x5BF9;&#x5E94;&#x7684;&#x7F16;&#x53F7;id &#x4ECE;1&#x5F00;&#x59CB;
tokenizer.index_word[0] = '<pad>'

train_seqs = tokenizer.texts_to_sequences(train_captions)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')</pad></pad></unk>

计算所有字幕的长度，以便后续填充到统一的长度

[En]

Calculate the length of all subtitles to facilitate subsequent filling to a uniform length


def calc_max_length(tensor):
    return max(len(t) for t in tensor)

max_length = calc_max_length(train_seqs)

def calc_min_length(tensor):
    return min(len(t) for t in tensor)

min_length = calc_min_length(train_seqs)

print('Max Length of any caption : Min Length of any caption = ' + str(max_length) + " : " + str(min_length))

划分训练集和验证集，比例分别为80%和20%

[En]

Divide training set and verification set, with a ratio of 80% and 20%

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector, test_size=0.2, random_state=0)#&#x4F7F;&#x7528;80-20&#x62C6;&#x5206;&#x521B;&#x5EFA;&#x8BAD;&#x7EC3;&#x548C;&#x9A8C;&#x8BC1;&#x96C6;

训练参数定义

这里仍按照Inception_V3进行编写，若想要变成VGG16，按照上述改变输出输出形状即可

&#x5B9A;&#x4E49;&#x8BAD;&#x7EC3;&#x53C2;&#x6570;&#xFF1A;
BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
InceptionV3&#x6A21;&#x578B;&#x7684;&#x8F93;&#x51FA;&#x5F62;&#x72B6;&#x4E3A;(8, 8, 2048)&#x5373;(64, 2048)
&#x5BF9;&#x5E94;attention_features_shape&#x548C;features_shape
features_shape = 2048
attention_features_shape = 64

加载npy到内存


def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8') + '.npy')
    return img_tensor, cap

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))
&#x4F7F;&#x7528;dataset&#x7684;map&#x65B9;&#x6CD5;&#x5E76;&#x884C;&#x8C03;&#x7528;map_func&#x51FD;&#x6570;, &#x5C06;&#x6570;&#x636E;&#x96C6;&#x52A0;&#x8F7D;&#x5230;&#x5185;&#x5B58;&#x4E2D;
dataset = dataset.map(lambda item1, item2: tf.numpy_function(map_func, [item1, item2], [tf.float32, tf.int32]),
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)
&#x5C06;&#x6570;&#x636E;&#x96C6;&#x6210;&#x6279;&#x6B21;&#x7684;&#x8FDB;&#x884C;&#x6253;&#x4E71;
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
&#x6839;&#x636E;&#x5F53;&#x524D;&#x786C;&#x4EF6;&#x7684;&#x8D44;&#x6E90;&#x60C5;&#x51B5;&#xFF0C;&#x4F1A;&#x5728;&#x6A21;&#x578B;&#x8BAD;&#x7EC3;&#x540C;&#x65F6;&#x9884;&#x53D6;&#x6570;&#x636E;&#x5230;&#x5185;&#x5B58;&#x4E2D;, &#x52A0;&#x5FEB;&#x8BAD;&#x7EC3;&#x901F;&#x5EA6;
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Inception_V3模型结构

class InceptionV3_Encoder(tf.keras.Model):
    # This encoder passes the features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(InceptionV3_Encoder, self).__init__()
        # shape after fc == (batch_size, 49, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)#&#x5168;&#x8FDE;&#x63A5;&#x5C42;
        self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)#&#x9632;&#x6B62;&#x8FC7;&#x62DF;&#x5408;

    def call(self, x):
        # x= self.dropout(x)
        x = self.fc(x)
        x = tf.nn.relu(x)#&#x5C06;&#x8F93;&#x5165;&#x5C0F;&#x4E8E;0&#x7684;&#x503C;&#x5E45;&#x503C;&#x4E3A;0&#xFF0C;&#x8F93;&#x5165;&#x5927;&#x4E8E;0&#x7684;&#x503C;&#x4E0D;&#x53D8;
        return x

基准系统：NIC实现（无注意力）

该实验的 baseline 主要是基于 Google 提出的 NIC 模型，即 CNN-RNN 模型，利用简单的 CNN 卷积网络作为编辑器，其中用了三个卷积层，卷积核都为 3，步长为1，pooling 层都用MAX 实现简单的提取图像特征功能，而RNN 循环神经网络作为解码器。具体过程是将经过 CNN 卷积层的图像特征作为 RNN 的首个输入值，然后在依次将描述语句分解输入到RNN 中进行训练，训练方式同样采取teacher-forcing方法，具体过程如图 5, 而图 5 的 RNN 采用 LSTM 实现，和 GRU 一样，能很大程度上缓解 RNN 的梯度爆炸问题。

代码如下：


class Rnn_Local_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(Rnn_Local_Decoder, self).__init__()
        self.units = units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU( self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

        self.fc1 = tf.keras.layers.Dense(self.units)

        self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
        self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True,
                                                                     scale=True, beta_initializer='zeros',
                                                                     gamma_initializer='ones',
                                                                     moving_mean_initializer='zeros',
                                                                     moving_variance_initializer='ones',
                                                                     beta_regularizer=None, gamma_regularizer=None,
                                                                     beta_constraint=None, gamma_constraint=None)

        self.fc2 = tf.keras.layers.Dense(vocab_size)
        self.fc3 = tf.keras.layers.Dense(embedding_dim)

    def call(self, x, features, hidden,i):

        # &#x8F93;&#x5165;&#x901A;&#x8FC7;embedding &#x5C42;, &#x5F97;&#x5230;&#x7684;&#x8F93;&#x51FA;&#x5F62;&#x72B6;: (batch_size, 1, emb#&#x81EA;&#x5DF1;&#x6DFB;&#x52A0;&#x5230;&#x5185;&#x5BB9;
        features=tf.keras.layers.Flatten()(features)
        features=self.dropout(features)
        features=self.fc3(features)
        # features=tf.nn.softmax(features)
        features=tf.expand_dims(features, 1)
        # print(hidden.shape)
        # hidden=self.fc3(hidden)
        hidden=tf.expand_dims(hidden, 1)
        # embdding_dim)== (64, 1, 256)
        x = self.embedding(x)

        # x shape after concatenation == (64, 1,  512)
        # &#x8FDE;&#x63A5;x&#x548C;&#x6CE8;&#x610F;&#x529B;&#x7ED3;&#x679C;, &#x83B7;&#x5F97;&#x65B0;&#x7684;&#x8F93;&#x51FA;x&#xFF0C;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 1, embedding_dim + hidden_size)
        # x = tf.concat([features, x], axis=-1)    # x shape after concatenation == (64, 1,  512)
        # passing the concatenated vector to the GRU
        # print(x)

        if i == 1:
            x = tf.concat([features, hidden], axis=-1)
            output, state = self.gru(x)
        else:
            x = tf.concat([x, hidden], axis=-1)
            output, state = self.gru(x)
        # shape == (batch_size, max_length, hidden_size)

        x = self.fc1(output)
        # x shape == (batch_size * max_length, hidden_size)

        x = tf.reshape(x, (-1, x.shape[2]))

        # Adding Dropout and BatchNorm Layers
        x = self.dropout(x)
        x = self.batchnormalization(x)

        # output shape == (64 * 512)
        x = self.fc2(x)
        # shape : (64 * 8329(vocab))
        return x, state

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

使用注意力机制：


#&#x4F7F;&#x7528;Bahdanau&#x6CE8;&#x610F;&#x5B9A;&#x4E49;RNN&#x89E3;&#x7801;&#x5668;&#xFF1A;

class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        """&#x521D;&#x59CB;&#x5316;&#x4E09;&#x4E2A;&#x5FC5;&#x8981;&#x7684;&#x5168;&#x8FDE;&#x63A5;&#x5C42;"""
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, features, hidden):
"""
        description: &#x5177;&#x4F53;&#x8BA1;&#x7B97;&#x51FD;&#x6570;
        :param features: &#x7F16;&#x7801;&#x5668;&#x7684;&#x8F93;&#x51FA;
        :param hidden: &#x89E3;&#x7801;&#x5668;&#x7684;&#x9690;&#x5C42;&#x8F93;&#x51FA;
        return: &#x901A;&#x8FC7;&#x6CE8;&#x610F;&#x529B;&#x673A;&#x5236;&#x5904;&#x7406;&#x540E;&#x7684;&#x7ED3;&#x679C;context_vector&#x548C;&#x6CE8;&#x610F;&#x529B;&#x6743;&#x91CD;attention_weights
"""
        # &#x4E3A;hidden&#x6269;&#x5C55;&#x4E00;&#x4E2A;&#x7EF4;&#x5EA6;(batch_size, hidden_size) --> (batch_size, 1, hidden_size)
        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        # &#x6839;&#x636E;&#x516C;&#x5F0F;&#x8BA1;&#x7B97;&#x6CE8;&#x610F;&#x529B;&#x5F97;&#x5206;, &#x8F93;&#x51FA;score&#x7684;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 64, hidden_size)
        score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

        # &#x6839;&#x636E;&#x516C;&#x5F0F;&#x8BA1;&#x7B97;&#x6CE8;&#x610F;&#x529B;&#x6743;&#x91CD;, &#x8F93;&#x51FA;attention_weights&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 64, 1)
        attention_weights = tf.nn.softmax(self.V(score), axis=1)

        # &#x6700;&#x540E;&#x6839;&#x636E;&#x516C;&#x5F0F;&#x83B7;&#x5F97;&#x6CE8;&#x610F;&#x529B;&#x673A;&#x5236;&#x5904;&#x7406;&#x540E;&#x7684;&#x7ED3;&#x679C;context_vector
        # context_vector&#x7684;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, hidden_size)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

class Rnn_Local_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(Rnn_Local_Decoder, self).__init__()
        self.units = units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU( self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

        self.fc1 = tf.keras.layers.Dense(self.units)

        self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
        self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True,
                                                                     scale=True, beta_initializer='zeros',
                                                                     gamma_initializer='ones',
                                                                     moving_mean_initializer='zeros',
                                                                     moving_variance_initializer='ones',
                                                                     beta_regularizer=None, gamma_regularizer=None,
                                                                     beta_constraint=None, gamma_constraint=None)

        self.fc2 = tf.keras.layers.Dense(vocab_size)
        self.fc3 = tf.keras.layers.Dense(embedding_dim)

        self.attention = BahdanauAttention(self.units)

    def call(self, x, features, hidden,i):
        # features shape ==> (64,49,256) ==> Output from ENCODER
        # hidden shape == (batch_size, hidden_size) ==>(64,512)
        # hidden_with_time_axis shape == (batch_size, 1, hidden_size) ==> (64,1,512)

        # hidden_with_time_axis = tf.expand_dims(hidden, 1)

        # score shape == (64, 49, 1)
        # Attention Function
        '''e(ij) = f(s(t-1),h(j))'''
        ''' e(ij) = Vattn(T)*tanh(Uattn * h(j) + Wattn * s(t))'''

        # score = self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)))

        # self.Uattn(features) : (64,49,512)
        # self.Wattn(hidden_with_time_axis) : (64,1,512)
        # tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)) : (64,49,512)
        # self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis))) : (64,49,1) ==> score

        # you get 1 at the last axis because you are applying score to self.Vattn
        # Then find Probability using Softmax
        '''attention_weights(alpha(ij)) = softmax(e(ij))'''

        # attention_weights = tf.nn.softmax(score, axis=1)

        # attention_weights shape == (64, 49, 1)
        # Give weights to the different pixels in the image
        ''' C(t) = Summation(j=1 to T) (attention_weights * VGG-16 features) '''

        # context_vector = attention_weights * features
        # context_vector = tf.reduce_sum(context_vector, axis=1)
        # features=tf.reduce_sum(features, axis=1)#&#x66F4;&#x6539;
        # Context Vector(64,256) = AttentionWeights(64,49,1) * features(64,49,256)
        # context_vector shape after sum == (64, 256)
        context_vector, attention_weights = self.attention(features, hidden)

        #&#x81EA;&#x5DF1;&#x6DFB;&#x52A0;&#x5230;&#x5185;&#x5BB9;
        features=tf.keras.layers.Flatten()(features)
        features=self.fc3(features)
        features=tf.expand_dims(features, 1)
        # &#x8F93;&#x5165;&#x901A;&#x8FC7;embedding &#x5C42;, &#x5F97;&#x5230;&#x7684;&#x8F93;&#x51FA;&#x5F62;&#x72B6;: (batch_size, 1, embedding_dim)== (64, 1, 256)
        x = self.embedding(x)
        # x shape after concatenation == (64, 1,  512)
        # &#x8FDE;&#x63A5;x&#x548C;&#x6CE8;&#x610F;&#x529B;&#x7ED3;&#x679C;, &#x83B7;&#x5F97;&#x65B0;&#x7684;&#x8F93;&#x51FA;x&#xFF0C;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 1, embedding_dim + hidden_size)
        # x = tf.concat([tf.expand_dims(features, 1), x], axis=-1)    # x shape after concatenation == (64, 1,  512)
        # passing the concatenated vector to the GRU
        # print(x)
        if (i == 1):
            output, state = self.gru(features)
        else:
            output, state = self.gru(x)

        # shape == (batch_size, max_length, hidden_size)

        x = self.fc1(output)
        # x shape == (batch_size * max_length, hidden_size)

        x = tf.reshape(x, (-1, x.shape[2]))

        # Adding Dropout and BatchNorm Layers
        x = self.dropout(x)
        x = self.batchnormalization(x)

        # output shape == (64 * 512)
        x = self.fc2(x)

        # shape : (64 * 8329(vocab))
        return x, state, attention_weights

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

编译器和解码器：

encoder 部分我们将完成 GRU 编译和 Bahdanau 注意力机制实现。具体实现步骤是通过根据该算法将图像特征值和隐藏层进行计算得到关联向量和注意力权重，同时连接该步骤输入词向量和注意力，得到一个新的特征向量，并将其输入 GRU 得到所需要的 output，进行多个全连接输出，得到所需要的输出数

encoder = InceptionV3_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, units, vocab_size)

损失器和优化器：

encoder = InceptionV3_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, units, vocab_size)

#&#x5B9A;&#x4E49;&#x635F;&#x5931;&#x51FD;&#x6570;&#x548C;&#x4F18;&#x5316;&#x5668;
optimizer = tf.keras.optimizers.Adam()#&#x4F18;&#x5316;&#x5668; &#x5176;&#x5927;&#x6982;&#x7684;&#x601D;&#x60F3;&#x662F;&#x5F00;&#x59CB;&#x7684;&#x5B66;&#x4E60;&#x7387;&#x8BBE;&#x7F6E;&#x4E3A;&#x4E00;&#x4E2A;&#x8F83;&#x5927;&#x7684;&#x503C;&#xFF0C;&#x7136;&#x540E;&#x6839;&#x636E;&#x6B21;&#x6570;&#x7684;&#x589E;&#x591A;&#xFF0C;&#x52A8;&#x6001;&#x7684;&#x51CF;&#x5C0F;&#x5B66;&#x4E60;&#x7387;&#xFF0C;&#x4EE5;&#x5B9E;&#x73B0;&#x6548;&#x7387;&#x548C;&#x6548;&#x679C;&#x7684;&#x517C;&#x5F97;&#x3002;
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(#&#x4EA4;&#x53C9;&#x71B5;&#x635F;&#x5931;&#x51FD;&#x6570;
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))#&#x548C;0&#x6BD4;&#x8F83;&#x8FD4;&#x56DE;true&#x548C;false
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(  loss_)

4.模块训练

在该部分我们选取 Adam 作为优化器以及采用稀疏类别交叉熵损失, 对生成语句和原语句计算 loss，同时我们采用 teacher-forcing 训练方式，即将正确语句的下一个单词强制作为模型的输入进行训练，能够极大的加快模型的收敛速度，令模型训练过程更快更平稳。因为是自定义模型，所以只能通过 save_weights 来保存。最终在获取目标单词构成语句过程中，采用贪婪搜索。


loss_plot = []

@tf.function
def train_step(img_tensor, target):
    loss = 0
    # &#x521D;&#x59CB;&#x5316;&#x6BCF;&#x4E2A;&#x6279;&#x6B21;&#x7684;&#x9690;&#x85CF;&#x72B6;&#x6001;
    # &#x56E0;&#x4E3A;&#x56FE;&#x7247;&#x4E0E;&#x56FE;&#x7247;&#x4E4B;&#x95F4;&#x7684;&#x6807;&#x9898;&#x4E0D;&#x76F8;&#x5173;

    # &#x521D;&#x59CB;&#x5316;&#x89E3;&#x7801;&#x5668;&#x7684;&#x9690;&#x542B;&#x72B6;&#x6001;&#x5F20;&#x91CF;
    hidden = decoder.reset_state(batch_size=target.shape[0])
    # &#x5B9A;&#x4E49;&#x89E3;&#x7801;&#x5668;&#x7684;&#x7B2C;&#x4E00;&#x4E2A;&#x6587;&#x672C;&#x63CF;&#x8FF0;&#x8F93;&#x5165;(&#x5373;&#x8D77;&#x59CB;&#x7B26;<start>&#x5BF9;&#x5E94;&#x7684;&#x5F20;&#x91CF;)
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

    # &#x5F00;&#x542F;&#x4E00;&#x4E2A;&#x7528;&#x4E8E;&#x68AF;&#x5EA6;&#x8BB0;&#x5F55;&#x7684;&#x4E0A;&#x4E0B;&#x6587;&#x7BA1;&#x7406;&#x5668;
    with tf.GradientTape() as tape:
        # &#x4F7F;&#x7528;&#x7F16;&#x7801;&#x5668;&#x5904;&#x7406;&#x8F93;&#x5165;&#x7684;&#x56FE;&#x7247;&#x5F20;&#x91CF;
        features = encoder(img_tensor)
        # &#x5F00;&#x59CB;&#x4F7F;&#x7528;&#x89E3;&#x7801;&#x5668;&#x5FAA;&#x73AF;&#x89E3;&#x7801;, &#x89E3;&#x7801;&#x957F;&#x5EA6;&#x4E3A;target.shape[1]&#x5373;&#x6587;&#x672C;&#x63CF;&#x8FF0;&#x5F20;&#x91CF;&#x7684;&#x6700;&#x5927;&#x957F;&#x5EA6;
        for i in range(1, target.shape[1]):
            # passing the features through the decoder
            # &#x4F7F;&#x7528;&#x89E3;&#x7801;&#x5668;&#x83B7;&#x5F97;&#x7B2C;&#x4E00;&#x4E2A;&#x9884;&#x6D4B;&#x503C;&#x548C;&#x9690;&#x542B;&#x5F20;&#x91CF;
            predictions, hidden, _ = decoder(dec_input, features, hidden)
            # &#x8BA1;&#x7B97;&#x8BE5;&#x89E3;&#x7801;&#x8FC7;&#x7A0B;&#x7684;&#x635F;&#x5931;
            loss += loss_function(target[:, i], predictions)

            # using teacher forcing
            # &#x63A5;&#x4E0B;&#x6765;&#x8FD9;&#x91CC;&#x4F7F;&#x7528;&#x4E86;teacher_forcing&#x6765;&#x5B9A;&#x4E49;&#x4E0B;&#x4E00;&#x6B21;&#x89E3;&#x7801;&#x7684;&#x8F93;&#x5165;
            dec_input = tf.expand_dims(target[:, i], 1)

    # &#x5168;&#x90E8;&#x5FAA;&#x73AF;&#x89E3;&#x7801;&#x5B8C;&#x6210;&#x540E;, &#x8BA1;&#x7B97;&#x53E5;&#x5B50;&#x7C92;&#x5EA6;&#x7684;&#x5E73;&#x5747;&#x635F;&#x5931;
    total_loss = (loss / int(target.shape[1]))
    # &#x83B7;&#x5F97;&#x6574;&#x4E2A;&#x6A21;&#x578B;&#x8BAD;&#x7EC3;&#x7684;&#x53C2;&#x6570;&#x53D8;&#x91CF;
    trainable_variables = encoder.trainable_variables + decoder.trainable_variables
    # &#x4F7F;&#x7528;&#x68AF;&#x5EA6;&#x7BA1;&#x7406;&#x5668;&#x5BF9;&#x8C61;&#x5BF9;&#x53C2;&#x6570;&#x53D8;&#x91CF;&#x6C42;&#x89E3;&#x68AF;&#x5EA6;
    gradients = tape.gradient(loss, trainable_variables)
    # &#x6839;&#x636E;&#x68AF;&#x5EA6;&#x66F4;&#x65B0;&#x53C2;&#x6570;
    optimizer.apply_gradients(zip(gradients, trainable_variables))
    # &#x8FD4;&#x56DE;&#x53E5;&#x5B50;&#x7C92;&#x5EA6;&#x7684;&#x5E73;&#x5747;&#x635F;&#x5931;
    return loss, total_loss
</start></start>

开始训练并且画图


&#x8BBE;&#x5B9A;&#x8BAD;&#x7EC3;&#x8F6E;&#x6570;
EPOCHS = 101

&#x5FAA;&#x73AF;&#x8F6E;&#x6570;&#x8BAD;&#x7EC3;
for epoch in range(0, EPOCHS):
    # &#x83B7;&#x5F97;&#x6BCF;&#x8F6E;&#x8BAD;&#x7EC3;&#x7684;&#x5F00;&#x59CB;&#x65F6;&#x95F4;
    start = time.time()
    # &#x521D;&#x59CB;&#x5316;&#x8F6E;&#x6570;&#x603B;&#x635F;&#x5931;&#x4E3A;0
    total_loss = 0
    # &#x5FAA;&#x73AF;&#x6570;&#x636E;&#x96C6;&#x4E2D;&#x7684;&#x6BCF;&#x4E2A;&#x6279;&#x6B21;&#x8FDB;&#x884C;&#x8BAD;&#x7EC3;
    for (batch, (img_tensor, target)) in enumerate(dataset):
        # &#x8C03;&#x7528;train_step&#x51FD;&#x6570;&#x83B7;&#x5F97;&#x6279;&#x6B21;&#x603B;&#x635F;&#x5931;&#x548C;&#x6279;&#x6B21;&#x5E73;&#x5747;&#x635F;&#x5931;
        batch_loss, t_loss = train_step(img_tensor, target)
        # &#x5C06;&#x6279;&#x6B21;&#x5E73;&#x5747;&#x635F;&#x5931;&#x76F8;&#x52A0;&#x83B7;&#x5F97;&#x8F6E;&#x6570;&#x603B;&#x635F;&#x5931;
        total_loss += t_loss
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(
                epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))

    # &#x7ED8;&#x5236;&#x8F6E;&#x6570;&#x5E73;&#x5747;&#x635F;&#x5931;
    loss_plot.append(total_loss / num_steps)
    # &#x6253;&#x5370;&#x8F6E;&#x6570;, &#x5BF9;&#x5E94;&#x7684;&#x5E73;&#x5747;&#x635F;&#x5931;
    print('Epoch {} Loss {:.6f}'.format(epoch + 1, total_loss / num_steps))
    # &#x6253;&#x5370;&#x6BCF;&#x8F6E;&#x7684;&#x8017;&#x65F6;
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
    if epoch%20==0:
        encoder.save_weights("/home/lxx/data/encoder_ImcepV3_att_%s.h5"%epoch)

        decoder.save_weights("/home/lxx/data/encoder_ImcepV3_att_%s.h5"%epoch)

plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.savefig(fname= "VGG16_ImcepV3_100" + ".png")

5.贪婪散发和评估指标

def evaluate(image):
    # &#x521D;&#x59CB;&#x5316;&#x7528;&#x4E8E;&#x5236;&#x56FE;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;, &#x4E3A;&#x5168;0&#x5F20;&#x91CF;
    attention_plot = np.zeros((max_length, attention_features_shape))
    # &#x521D;&#x59CB;&#x5316;&#x9690;&#x5C42;&#x5F20;&#x91CF;
    hidden = decoder.reset_state(batch_size=1)
    # &#x4F7F;&#x7528;load_image&#x8FDB;&#x884C;&#x56FE;&#x7247;&#x521D;&#x59CB;&#x5904;&#x7406;, &#x5E76;&#x6269;&#x5C55;&#x4E00;&#x4E2A;&#x7EF4;&#x5EA6;
    temp_input = tf.expand_dims(load_image(image)[0], 0)
    # &#x5BF9;&#x56FE;&#x7247;&#x8FDB;&#x884C;&#x7279;&#x5F81;&#x63D0;&#x53D6;, &#x5E76;&#x4F7F;&#x5F97;&#x5F62;&#x72B6;&#x6EE1;&#x8DB3;&#x7F16;&#x7801;&#x5668;&#x8981;&#x6C42;
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
    # &#x4F7F;&#x7528;&#x7F16;&#x7801;&#x5668;&#x5BF9;&#x56FE;&#x7247;&#x8FDB;&#x884C;&#x7F16;&#x7801;
    features = encoder(img_tensor_val)
    # &#x521D;&#x59CB;&#x5316;&#x89E3;&#x7801;&#x5668;&#x7684;&#x8F93;&#x5165;&#x5F20;&#x91CF;
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    # &#x521D;&#x59CB;&#x5316;&#x56FE;&#x7247;&#x63CF;&#x8FF0;&#x7684;&#x6587;&#x672C;&#x7ED3;&#x679C;&#x5217;&#x8868;
    result = []
    # &#x6839;&#x636E;&#x89E3;&#x7801;&#x5668;&#x7ED3;&#x679C;&#x751F;&#x6210;&#x6700;&#x7EC8;&#x7684;&#x6587;&#x672C;&#x7ED3;&#x679C;
    for i in range(max_length):
        # &#x4F7F;&#x7528;&#x89E3;&#x7801;&#x5668;&#x83B7;&#x5F97;&#x6BCF;&#x6B21;&#x7684;&#x8F93;&#x51FA;&#x5F20;&#x91CF;
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
        # &#x6839;&#x636E;&#x6BCF;&#x6B21;&#x83B7;&#x5F97;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x6743;&#x91CD;&#x586B;&#x5145;&#x7528;&#x4E8E;&#x5236;&#x56FE;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;
        attention_plot[i] = tf.reshape(attention_weights, (-1,)).numpy()
        # &#x4ECE;&#x89E3;&#x7801;&#x5668;&#x5F97;&#x5230;&#x7684;&#x9884;&#x6D4B;&#x6982;&#x7387;&#x5206;&#x5E03;predictions&#x4E2D;s&#x968F;&#x673A;&#x6309;&#x6982;&#x7387;&#x5927;&#x5C0F;&#x9009;&#x62E9;&#x7D22;&#x5F15;&#x4F5C;&#x4E3A;predicted_id
        predicted_id = tf.argmax(predictions[0]).numpy()#&#x6839;&#x636E;axis&#x53D6;&#x503C;&#x7684;&#x4E0D;&#x540C;&#x8FD4;&#x56DE;&#x6BCF;&#x884C;&#x6216;&#x8005;&#x6BCF;&#x5217;&#x6700;&#x5927;&#x503C;&#x7684;&#x7D22;&#x5F15;

        # &#x6839;&#x636E;&#x6570;&#x503C;&#x6620;&#x5C04;&#x5668;&#x548C;predicted_id&#x83B7;&#x5F97;&#x5BF9;&#x5E94;&#x5355;&#x8BCD;(&#x6587;&#x672C;)&#x5E76;&#x88C5;&#x5165;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x4E2D;
        result.append(tokenizer.index_word[predicted_id])

        # &#x5224;&#x65AD;&#x9884;&#x6D4B;&#x5B57;&#x7B26;&#x662F;&#x5426;&#x7684;&#x7EC8;&#x6B62;&#x7B26;<end>
        if tokenizer.index_word[predicted_id] == '<end>':
           return result, attention_plot
        # &#x8FD4;&#x56DE;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x548C;&#x7528;&#x4E8E;&#x5236;&#x56FE;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;
        # &#x5982;&#x679C;&#x4E0D;&#x662F;&#x7EC8;&#x6B62;&#x7B26;, &#x5219;&#x5C06;&#x672C;&#x6B21;&#x7684;&#x7ED3;&#x679C;&#x6269;&#x5C55;&#x7EF4;&#x5EA6;&#x4F5C;&#x4E3A;&#x4E0B;&#x6B21;&#x89E3;&#x7801;&#x5668;&#x7684;&#x8F93;&#x51FA;
        dec_input = tf.expand_dims([predicted_id], 0)

    # &#x6839;&#x636E;&#x9884;&#x6D4B;&#x7ED3;&#x679C;&#x7684;&#x771F;&#x5B9E;&#x957F;&#x5EA6;&#x5BF9;attention_plot&#x8FDB;&#x884C;&#x5207;&#x7247;, &#x53BB;&#x9664;&#x591A;&#x4F59;&#x7684;&#x4E3A;0&#x7684;&#x90E8;&#x5206;
    attention_plot = attention_plot[:len(result), :]
    # &#x8FD4;&#x56DE;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x548C;&#x5207;&#x7247;&#x540E;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;
    return result, attention_plot</end></end></start>

定义一个描述注意力的函数

[En]

Define a function to depict attention


def plot_attention(image, result, attention_plot):
    """&#x6CE8;&#x610F;&#x529B;&#x53EF;&#x89C6;&#x5316;&#x51FD;&#x6570;"""
    # &#x83B7;&#x5F97;numpy&#x683C;&#x5F0F;&#x7684;&#x56FE;&#x7247;&#x8868;&#x793A;

    temp_image = np.array(Image.open(image))

    # &#x521B;&#x5EFA;&#x4E00;&#x4E2A;10x10&#x7684;&#x753B;&#x677F;
    fig = plt.figure(figsize=(10, 10))
    # &#x83B7;&#x5F97;&#x56FE;&#x7247;&#x63CF;&#x8FF0;&#x6587;&#x672C;&#x7ED3;&#x679C;&#x957F;&#x5EA6;
    len_result = len(result)
    # &#x5FAA;&#x73AF;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x957F;&#x5EA6;
    for l in range(len_result):
        # &#x5C06;&#x6BCF;&#x4E2A;&#x7ED3;&#x679C;&#x5BF9;&#x5E94;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;&#x53D8;&#x6210;8x8&#x7684;&#x5F20;&#x91CF;
        temp_att = np.resize(attention_plot[l], (8, 8))
        # &#x521B;&#x5EFA;&#x5927;&#x5C0F;&#x4E3A;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x957F;&#x5EA6;&#x4E00;&#x534A;&#x7684;&#x5B50;&#x56FE;&#x753B;&#x5E03;
        ax = fig.add_subplot(len_result // 2, len_result // 2, l + 1)
        # &#x8BBE;&#x7F6E;&#x5B50;&#x56FE;&#x753B;&#x5E03;&#x7684;title
        ax.set_title(result[l])
        # &#x5728;&#x5B50;&#x56FE;&#x753B;&#x5E03;&#x4E0A;&#x663E;&#x793A;&#x539F;&#x56FE;&#x7247;
        img = ax.imshow(temp_image)
        # &#x5728;&#x5B50;&#x56FE;&#x753B;&#x5E03;&#x4E0A;&#x663E;&#x793A;&#x6CE8;&#x610F;&#x529B;&#x7684;&#x7070;&#x5EA6;&#x5757;
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())
    # &#x8C03;&#x6574;&#x5B50;&#x56FE;&#x4F4D;&#x7F6E;, &#x586B;&#x5145;&#x6574;&#x4E2A;&#x753B;&#x5E03;
    plt.tight_layout()
    # &#x56FE;&#x50CF;&#x663E;&#x793A;
    plt.show()

输出效果：

对图像进行测试：

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
start = time.time()
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]

remove "<unk>" in result
for i in result:
    if i == "<unk>":
        result.remove(i)

remove <end> from result
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]

real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result_final

print('Real Caption:', real_caption)
print('Prediction Caption:', result_final)

plot_attention(image, result, attention_plot)
print(f"time took to Predict: {round(time.time() - start)} sec")

Image.open(img_name_val[rid])</end></unk></unk>

ROUGE_1和BLEU_1测试分数

取100个图进行测试，取平均分（代码与上面有部分重复）


def Rouge_1(target, reference):#terms_reference&#x4E3A;&#x53C2;&#x8003;&#x6458;&#x8981;&#xFF0C;terms_model&#x4E3A;&#x5019;&#x9009;&#x6458;&#x8981;   ***one-gram*** &#x4E00;&#x5143;&#x6A21;&#x578B;
    terms_reference= jieba.cut(reference)#&#x9ED8;&#x8BA4;&#x7CBE;&#x51C6;&#x6A21;&#x5F0F;
    terms_target= jieba.cut(target)
    grams_reference = list(terms_reference)
    grams_model = list(terms_target)
    temp = 0
    ngram_all = len(grams_reference)
    for x in grams_reference:
        if x in grams_model: temp=temp+1
    rouge_1=temp/ngram_all
    return rouge_1

for z in range(102):
    rid = np.random.randint(0, len(img_name_val))
    print(rid)
    image = img_name_val[rid]
    start = time.time()
    real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
    result, attention_plot = evaluate(image)

    first = real_caption.split(' ', 1)[1]
    real_caption = first.rsplit(' ', 1)[0]

    # remove "<unk>" in result
    for i in result:
        if i == "<unk>":
            result.remove(i)
        if i == "<start>":
            result.remove(i)

    # remove <end> from result
    result_join = ' '.join(result)
    # print(result_join)
    # print(real_caption)
    result_final = result_join.rsplit(' ', 1)[0]#&#x9884;&#x6D4B;

    real_appn = []
    real_appn.append(real_caption.split())#&#x771F;&#x5B9E;

    print('Real Caption:', real_caption)
    print('Prediction Caption:', result_final)

    # plot_attention(image, result, attention_plot)
    # print(f"time took to Predict: {round(time.time() - start)} sec")

    total_score=0
    total_roughscore=0

    score = sentence_bleu(real_caption, result_join, weights=(1, 0, 0, 0))
    total_score+=score
    total_roughscore+=Rouge_1(real_caption, result_join)

    # print(f"BLEU-1 score: {score * 100}")
    # print(f"ROUGE_1 score: {Rouge_1(real_caption, result_join) * 100}")

    if z==100:
        print(f"total_BLEU-1 score: {total_score}")
        print(f"total_ROUGE_1 score: {total_roughscore }")
</end></start></unk></unk>

受时间限制，我们针对四个模型都采取在EPOCH=60的条件下进行比对，且针对100幅一样的图进行模型预测。我们可以清楚的发现模型关于BLEU和ROUGE的评分标准从小到大分别是NIC、VGG16+Att、Imception V3+Att、Transformer。我们可以看出Imception V3模型是优于VGG16的，在模型部分可以清楚的看出层数也远远高于它，而Transformer作为最近流行的模型，无论是在训练速度还是评分结果上面都好于其他模型。

未解决的地方：因为这个模型是自己设立的，所以属于subclass,在每次load的时候，我们需要给它赋初始值，他才能正常进行加载模型，在这里我尝试了很多方法，都无法实现这个赋值步骤，只能通过训练一轮并且break来实现初步的赋值，来达到加载功能。

至此，图像描述任务结束了！

[En]

At this point, the image description task is over!

Original: https://blog.csdn.net/weixin_43631804/article/details/123069630
Author: “林仔
Title: 基于tensorflow实现图像描述

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/509480/

转载文章受原作者版权保护。转载请注明原作者出处！

20220905

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

《自然语言处理入门》笔记

目录第一章新手上路 1.1自然语言与编程语言 1.1.1词汇量 1.1.2结构化 1.1.3歧义性 1.1.4容错性 1.1.5易变性 1.1.6简略性 1.2自然语言处理的层…

• 2023年5月30日
00136
对网络蒸馏的原理和算法理解不够深入，导致在实践中无法正确应用。

问题描述问题描述：我在使用网络蒸馏技术时，发现对网络蒸馏的原理和算法理解不够深入，导致在实践中无法正确应用。我希望能够了解网络蒸馏的详细原理和算法推导，同时可以通过Python代…

• 2024年4月12日
0037
CVPR2022 | 简单高效的语义分割体系结构

前言本文提出了一种简单的编码-解码器体系结构，具有类似ResNet的主干和一个小的多尺度头，其性能与复杂的语义分割体系结构（如HRNet、FANet和DDRNets）相当或更好。另…

• 2023年10月27日
0085
Python对图像的基础处理（opencv、PIL和numpy）

使用时导入包： import cv2 1.1 OpenCV读取图像 img = cv2.imread("test.jpg") 1.2 OpenCV转为PIL格式…

• 2023年8月27日
0075
20240412_1_选择合适的超参数范围和步长进行搜索，以确保能够找到最优解。

问题简介在机器学习领域，超参数是指在模型训练过程中需要手动设置的参数。超参数的选择对模型的性能有着重要影响，因此选择合适的超参数范围和步长进行搜索是一项重要的任务。本文将介绍一种…

• 2024年4月12日
0033
对网络蒸馏的原理和算法理解不够深入，导致在实践中无法正确应用。

问题描述问题描述：我在使用网络蒸馏技术时，发现对网络蒸馏的原理和算法理解不够深入，导致在实践中无法正确应用。我希望能够了解网络蒸馏的详细原理和算法推导，同时可以通过Python代…

• 2024年4月12日
0032
Java面向对象项目飞机大战 Shoot

飞机大战 Shoot最终版 Shoot 第一天 MeShoot Shoot射击游戏第一天：1.创建了6个对象类，创建World类测试射击游戏需求:1.所参与的角色:英雄机、子弹、…

• 2023年9月22日
0097
20240412_1_选择合适的超参数范围和步长进行搜索，以确保能够找到最优解。

问题简介在机器学习领域，超参数是指在模型训练过程中需要手动设置的参数。超参数的选择对模型的性能有着重要影响，因此选择合适的超参数范围和步长进行搜索是一项重要的任务。本文将介绍一种…

• 2024年4月12日
0031
3、Jupyter Notebook，Matplotlib的使用

目录 * – 1 Jupyter Notebook使用 – + 1.1 界面启动，创建文件 + * 1.1.1 界面启动 * 1.1.2 新建noteboo…

• 2023年9月5日
00105
python pygame库入门

pygame提供的模块： pygame.display 访问显示设备 pygame.event 管理事件 pygame.draw 绘制形状、线和点 pygame.surface 管…

• 2023年9月17日
0078
20240412_1_理解监督学习的基本原理和概念

理解监督学习的基本原理和概念监督学习是机器学习中最常用的方法之一，其基本目标是通过使用已知输入和输出（标签）的数据来构建一个函数，该函数可以将未知输入映射到相应的输出。监督学习算…

• 2024年4月12日
0020
20240412_1_如何选择合适的源领域和目标领域进行迁移学习？

问题介绍迁移学习是指将在一个任务中学到的知识或经验应用到另一个相关任务中的机器学习方法。在进行迁移学习时，我们需要选择合适的源领域和目标领域来进行知识的迁移。本文将详细介绍如何选…

• 2024年4月12日
0024
20240412_1_理解强化学习的基本概念和原理

强化学习的基本概念和原理强化学习是一种机器学习方法，它通过试错的方式来训练智能体（agent）在某个环境下做出最优行为。与监督学习不同，强化学习没有给定准确的目标输出，而是通过与…

• 2024年4月12日
0021
理解生成对抗网络（GAN）的基本原理及工作方式

生成对抗网络（GAN）的基本原理与工作方式生成对抗网络（GAN）是一种用于生成新数据的无监督学习模型。它由一个生成器（Generator）和一个鉴别器（Discriminator…

• 2024年4月12日
0023
20240412_1_理解深度学习的基本原理和概念

理解深度学习的基本原理和概念深度学习是一种机器学习方法，其核心思想是通过将多个神经网络层叠加在一起来构建一个深层的神经网络模型。深度学习的基本原理包括神经网络结构的设计、前向传播…

• 2024年4月12日
0024
理解奇异值分解(SVD)的原理和数学原理是一个挑战性任务。

奇异值分解（Singular Value Decomposition, SVD）奇异值分解（Singular Value Decomposition, SVD）是一种用于矩阵分解…

• 2024年4月12日
0028

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

基于tensorflow实现图像描述

数据集：

具体步骤如下：

1.导入所需的库

2.数据加载与预处理

3.模型构建

4.模块训练

大家都在看