基于tensorflow实现图像描述

【基础翻译自:Attention Mechanism For Image Caption Generation in Python

借鉴于:Python中图像标题生成的注意机制实战教程_Together_CZ的博客-CSDN博客

该文内容主要是针对图像描绘从最基础的baseline:NIC模型开始到引出Attention,并且与Transformer模型进行性能比对,在源内容上进行拓展以及更新。主要针对学期大作业内容进行讲解。

任务定义:

图像描述是一门结合了计算机视觉和自然语言处理两个研究领域的技术。如何高效地输入一幅图像,分析并得到其特征向量,并通过这些特征向量生成通用的句子来正确描述图像是我们的任务。

[En]

Image description is a technology that combines the two research fields of computer vision and natural language processing. How to efficiently input an image, analyze and get its feature vectors, and through these vectors to generate general sentences to correctly describe the image is our task.

现如今较流行的机器学习语言有tensorflowpytorch两种,在这里主要针对TensorFlow进行编码。

数据集:

该实验基于 Flicker8k 数据集,它是第一个公开的大规模图像和描述匹配的语料集,扩充版本 Flickr30k,其中每一个图像都有五个不同的标题,这些标题描述了图像中所说的实体和事件,我们主要针对 Flicker8k_dateset 和 Flickr8k.token.txt 两个文件进行分析处理,前者放置着图片,后者是每个图片对应的描述语句。通过查询,该数据集图片共有 16182 张,因为存在缺失,图像和描述语句的对应关系只生成了 40455 组,且描述语句最长达到了 33 个单词,反之最短为 2 个单词。最终我们将 dateset 最终设置成了 40000 组进行分析。

数据集的下载可以搜索Flickr8k数据集进行下载。

具体步骤如下:

1.导入所需的库

这里包含了后续所需要的各类models所需要的库,包括VGG16、LSTM等

import string
import numpy as np
import pandas as pd
from numpy import array
from PIL import Image
import pickle
import h5py
import matplotlib.pyplot as plt
import sys, time, os, warnings

warnings.filterwarnings("ignore")
import re

import keras
import tensorflow as tf
import jieba
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu

from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import  to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense, BatchNormalization
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.applications.vgg16 import VGG16, preprocess_input

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

2.数据加载与预处理

Flicker8k_Dataset为图像文档

Flicker8k.token.txt为图像字幕

image_path = "/home/lxx/data/Flicker8k_Dataset"
dir_Flickr_text = "/home/lxx/data/F_t/Flickr8k.token.txt"
jpgs = os.listdir(image_path)

print("Total Images in Dataset = {}".format(len(jpgs)))//查看图像数量

将图像和相对应字幕一一对应:

file = open(dir_Flickr_text, 'r')
text = file.read()
file.close()
#编号图片文字对应
datatxt = []
for line in text.split('\n'):
    col = line.split('\t')
    if len(col) == 1:
        continue
    w = col[0].split("#")
    datatxt.append(w + [col[1].lower()])
print(len(datatxt))
data = pd.DataFrame(datatxt, columns=["filename", "index", "caption"])
data = data.reindex(columns=['index', 'filename', 'caption'])
data = data[data.filename != '2258277193_586949ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)

data.head()

接下来需要对词汇进行处理

针对描述语句,我们的建立所需要的词典,方便后期描述语句的转化。对于文本,我们需要进行处理,比如删除标点符号、单个字符和数字值,并将出去后的文本分割得到总数量。对于词典部分,我们录用使用率前5000 个的单词来构成 tokenizer, 即使用频率较低单词设为


#词汇
vocabulary = []
for txt in data.caption.values:
   vocabulary.extend(txt.split())
print('Vocabulary Size: %d' % len(set(vocabulary)))

​
def remove_punctuation(text_original):#删除标点
    text_no_punctuation = text_original.translate(string.punctuation)
    return (text_no_punctuation)

def remove_single_character(text):#删除单个
    text_len_more_than1 = ""
    for word in text.split():
        if len(word) > 1:
            text_len_more_than1 += " " + word
    return (text_len_more_than1)

def remove_numeric(text):#删除数字
    text_no_numeric = ""
    for word in text.split():
        isalpha = word.isalpha()#判断是否只含有字母
        if isalpha:
            text_no_numeric += " " + word
    return (text_no_numeric)

def text_clean(text_original):
    text = remove_punctuation(text_original)
    text = remove_single_character(text)
    text = remove_numeric(text)
    return (text)

for i, caption in enumerate(data.caption.values):
    newcaption = text_clean(caption)
    data["caption"].iloc[i] = newcaption

clean_vocabulary = []
for txt in data.caption.values:
   clean_vocabulary.extend(txt.split())
print('Clean Vocabulary Size: %d' % len(set(clean_vocabulary)))

对于每个副标题,我们需要添加

[En]

For each subtitle, we need to add


#&#x589E;&#x52A0;<start>&#x4E0E;<end>
PATH = "/home/lxx/data/Flicker8k_Dataset"
all_captions = []
for caption in data["caption"].astype(str):
    caption = '<start> ' + caption + ' <end>'
    all_captions.append(caption)

print(all_captions[:10])

</end></start></end></start>

之后我们采取前40000个图像进行处理(批处理大小为64,则一共有625批次)

#&#x6BCF;&#x4E2A;&#x6807;&#x9898;&#x7684;&#x5BF9;&#x5E94;&#x6587;&#x4EF6;
all_img_name_vector = []
for annot in data["filename"]:
    full_image_path = PATH +"/"+ annot
    all_img_name_vector.append(full_image_path)

print(all_img_name_vector[:10])

print(f"len(all_img_name_vector) : {len(all_img_name_vector)}")
print(f"len(all_captions) : {len(all_captions)}")

#&#x4EC5;&#x53D6;40000
def data_limiter(num, total_captions, all_img_name_vector):
    train_captions, img_name_vector = shuffle(total_captions, all_img_name_vector, random_state=1)#&#x968F;&#x673A;&#x6392;&#x5217;
    train_captions = train_captions[:num]
    img_name_vector = img_name_vector[:num]
    return train_captions, img_name_vector

train_captions, img_name_vector = data_limiter(40000, all_captions, all_img_name_vector)

3.模型构建

在这里我们将采用两种结构来定义图像特征提取模型:VGG16Inception_v3

这里两个模型都是用来对图像进行分类,在图像描述中,我们需要的是最后的特征值,因此需要模型中删除了softmax层。

这里我们需要注意的是两者之间的区别:

[En]

What we need to note here is the difference between the two:

1) load_picture中VGG16统一为(224,224),而Inception_V3为(299,299)

2)Inception_V3输出模型为(8,8,2048)《=》(64,2048)

VGG16输出模型为(7,7,512)《=》(49,512)

后面会进行介绍原因,例子呈现Inception_v3。


#&#x4F7F;&#x7528;inception_v3&#x5B9A;&#x4E49;
def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

image_features_extract_model.summary()

模型结构————inception_v3

基于tensorflow实现图像描述

模型结构————VGG16

基于tensorflow实现图像描述

接下来,让我们将每个图片名称映射到要加载图片的函数。

[En]

Next, let’s map each picture name to the function that you want to load the picture.

#&#x6620;&#x5C04;
encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(64)

我们将特征存储在各自的.npy文件中,然后将这些特征通过编码器传递。

#&#x63D0;&#x53D6;&#x7279;&#x5F81;&#x5E76;&#x5C06;&#x5176;&#x5B58;&#x50A8;&#x5728;&#x5404;&#x81EA;&#x7684;.npy&#x6587;&#x4EF6;&#x4E2D;
for img, path in tqdm(image_dataset):
    batch_features = image_features_extract_model(img)
    batch_features = tf.reshape(batch_features,
                                (batch_features.shape[0], -1, batch_features.shape[3]))

    for bf, p in zip(batch_features, path):
        path_of_feature = p.numpy().decode("utf-8")
        np.save(path_of_feature, bf.numpy())

之后,我们会建立一个词典,根据出现的次数将词汇量设置为5000,并将未放置的单词设置为

[En]

After that, we will set up a dictionary, set the vocabulary to 5000 according to the number of occurrences, and set the unplaced words to

top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')#&#x5206;&#x8BCD;&#x5668;

tokenizer.fit_on_texts(train_captions)#&#x4F7F;&#x7528;&#x4E00;&#x7CFB;&#x5217;&#x6587;&#x6863;&#x6765;&#x751F;&#x6210;token&#x8BCD;&#x5178;
train_seqs = tokenizer.texts_to_sequences(train_captions)#&#x5C06;&#x591A;&#x4E2A;&#x6587;&#x6863;&#x8F6C;&#x6362;&#x4E3A;word&#x4E0B;&#x6807;&#x7684;&#x5411;&#x91CF;&#x5F62;&#x5F0F;
tokenizer.word_index['<pad>'] = 0## &#x8BCD;_&#x7D22;&#x5F15; &#x4FDD;&#x5B58;&#x6240;&#x6709;word&#x5BF9;&#x5E94;&#x7684;&#x7F16;&#x53F7;id &#x4ECE;1&#x5F00;&#x59CB;
tokenizer.index_word[0] = '<pad>'

train_seqs = tokenizer.texts_to_sequences(train_captions)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')</pad></pad></unk>

计算所有字幕的长度,以便后续填充到统一的长度

[En]

Calculate the length of all subtitles to facilitate subsequent filling to a uniform length


def calc_max_length(tensor):
    return max(len(t) for t in tensor)

max_length = calc_max_length(train_seqs)

def calc_min_length(tensor):
    return min(len(t) for t in tensor)

min_length = calc_min_length(train_seqs)

print('Max Length of any caption : Min Length of any caption = ' + str(max_length) + " : " + str(min_length))

划分训练集和验证集,比例分别为80%和20%

[En]

Divide training set and verification set, with a ratio of 80% and 20%

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector, test_size=0.2, random_state=0)#&#x4F7F;&#x7528;80-20&#x62C6;&#x5206;&#x521B;&#x5EFA;&#x8BAD;&#x7EC3;&#x548C;&#x9A8C;&#x8BC1;&#x96C6;

训练参数定义

这里仍按照Inception_V3进行编写,若想要变成VGG16,按照上述改变输出输出形状即可

&#x5B9A;&#x4E49;&#x8BAD;&#x7EC3;&#x53C2;&#x6570;&#xFF1A;
BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
InceptionV3&#x6A21;&#x578B;&#x7684;&#x8F93;&#x51FA;&#x5F62;&#x72B6;&#x4E3A;(8, 8, 2048)&#x5373;(64, 2048)
&#x5BF9;&#x5E94;attention_features_shape&#x548C;features_shape
features_shape = 2048
attention_features_shape = 64

加载npy到内存


def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8') + '.npy')
    return img_tensor, cap

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))
&#x4F7F;&#x7528;dataset&#x7684;map&#x65B9;&#x6CD5;&#x5E76;&#x884C;&#x8C03;&#x7528;map_func&#x51FD;&#x6570;, &#x5C06;&#x6570;&#x636E;&#x96C6;&#x52A0;&#x8F7D;&#x5230;&#x5185;&#x5B58;&#x4E2D;
dataset = dataset.map(lambda item1, item2: tf.numpy_function(map_func, [item1, item2], [tf.float32, tf.int32]),
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)
&#x5C06;&#x6570;&#x636E;&#x96C6;&#x6210;&#x6279;&#x6B21;&#x7684;&#x8FDB;&#x884C;&#x6253;&#x4E71;
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
&#x6839;&#x636E;&#x5F53;&#x524D;&#x786C;&#x4EF6;&#x7684;&#x8D44;&#x6E90;&#x60C5;&#x51B5;&#xFF0C;&#x4F1A;&#x5728;&#x6A21;&#x578B;&#x8BAD;&#x7EC3;&#x540C;&#x65F6;&#x9884;&#x53D6;&#x6570;&#x636E;&#x5230;&#x5185;&#x5B58;&#x4E2D;, &#x52A0;&#x5FEB;&#x8BAD;&#x7EC3;&#x901F;&#x5EA6;
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Inception_V3模型结构

class InceptionV3_Encoder(tf.keras.Model):
    # This encoder passes the features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(InceptionV3_Encoder, self).__init__()
        # shape after fc == (batch_size, 49, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)#&#x5168;&#x8FDE;&#x63A5;&#x5C42;
        self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)#&#x9632;&#x6B62;&#x8FC7;&#x62DF;&#x5408;

    def call(self, x):
        # x= self.dropout(x)
        x = self.fc(x)
        x = tf.nn.relu(x)#&#x5C06;&#x8F93;&#x5165;&#x5C0F;&#x4E8E;0&#x7684;&#x503C;&#x5E45;&#x503C;&#x4E3A;0&#xFF0C;&#x8F93;&#x5165;&#x5927;&#x4E8E;0&#x7684;&#x503C;&#x4E0D;&#x53D8;
        return x

基准系统:NIC实现(无注意力)

该实验的 baseline 主要是基于 Google 提出的 NIC 模型,即 CNN-RNN 模型,利用简单的 CNN 卷积网络作为编辑器,其中用了三个卷积层,卷积核都为 3,步长为1,pooling 层都用MAX 实现简单的提取图像特征功能,而RNN 循环神经网络作为解码器。具体过程是将经过 CNN 卷积层的图像特征作为 RNN 的首个输入值,然后在依次将描述语句分解输入到RNN 中进行训练,训练方式同样采取teacher-forcing方法,具体过程如图 5, 而图 5 的 RNN 采用 LSTM 实现,和 GRU 一样,能很大程度上缓解 RNN 的梯度爆炸问题。

基于tensorflow实现图像描述

代码如下:


class Rnn_Local_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(Rnn_Local_Decoder, self).__init__()
        self.units = units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU( self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

        self.fc1 = tf.keras.layers.Dense(self.units)

        self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
        self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True,
                                                                     scale=True, beta_initializer='zeros',
                                                                     gamma_initializer='ones',
                                                                     moving_mean_initializer='zeros',
                                                                     moving_variance_initializer='ones',
                                                                     beta_regularizer=None, gamma_regularizer=None,
                                                                     beta_constraint=None, gamma_constraint=None)

        self.fc2 = tf.keras.layers.Dense(vocab_size)
        self.fc3 = tf.keras.layers.Dense(embedding_dim)

    def call(self, x, features, hidden,i):

        # &#x8F93;&#x5165;&#x901A;&#x8FC7;embedding &#x5C42;, &#x5F97;&#x5230;&#x7684;&#x8F93;&#x51FA;&#x5F62;&#x72B6;: (batch_size, 1, emb#&#x81EA;&#x5DF1;&#x6DFB;&#x52A0;&#x5230;&#x5185;&#x5BB9;
        features=tf.keras.layers.Flatten()(features)
        features=self.dropout(features)
        features=self.fc3(features)
        # features=tf.nn.softmax(features)
        features=tf.expand_dims(features, 1)
        # print(hidden.shape)
        # hidden=self.fc3(hidden)
        hidden=tf.expand_dims(hidden, 1)
        # embdding_dim)== (64, 1, 256)
        x = self.embedding(x)

        # x shape after concatenation == (64, 1,  512)
        # &#x8FDE;&#x63A5;x&#x548C;&#x6CE8;&#x610F;&#x529B;&#x7ED3;&#x679C;, &#x83B7;&#x5F97;&#x65B0;&#x7684;&#x8F93;&#x51FA;x&#xFF0C;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 1, embedding_dim + hidden_size)
        # x = tf.concat([features, x], axis=-1)    # x shape after concatenation == (64, 1,  512)
        # passing the concatenated vector to the GRU
        # print(x)

        if i == 1:
            x = tf.concat([features, hidden], axis=-1)
            output, state = self.gru(x)
        else:
            x = tf.concat([x, hidden], axis=-1)
            output, state = self.gru(x)
        # shape == (batch_size, max_length, hidden_size)

        x = self.fc1(output)
        # x shape == (batch_size * max_length, hidden_size)

        x = tf.reshape(x, (-1, x.shape[2]))

        # Adding Dropout and BatchNorm Layers
        x = self.dropout(x)
        x = self.batchnormalization(x)

        # output shape == (64 * 512)
        x = self.fc2(x)
        # shape : (64 * 8329(vocab))
        return x, state

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

使用注意力机制:


#&#x4F7F;&#x7528;Bahdanau&#x6CE8;&#x610F;&#x5B9A;&#x4E49;RNN&#x89E3;&#x7801;&#x5668;&#xFF1A;

class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        """&#x521D;&#x59CB;&#x5316;&#x4E09;&#x4E2A;&#x5FC5;&#x8981;&#x7684;&#x5168;&#x8FDE;&#x63A5;&#x5C42;"""
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, features, hidden):
"""
        description: &#x5177;&#x4F53;&#x8BA1;&#x7B97;&#x51FD;&#x6570;
        :param features: &#x7F16;&#x7801;&#x5668;&#x7684;&#x8F93;&#x51FA;
        :param hidden: &#x89E3;&#x7801;&#x5668;&#x7684;&#x9690;&#x5C42;&#x8F93;&#x51FA;
        return: &#x901A;&#x8FC7;&#x6CE8;&#x610F;&#x529B;&#x673A;&#x5236;&#x5904;&#x7406;&#x540E;&#x7684;&#x7ED3;&#x679C;context_vector&#x548C;&#x6CE8;&#x610F;&#x529B;&#x6743;&#x91CD;attention_weights
"""
        # &#x4E3A;hidden&#x6269;&#x5C55;&#x4E00;&#x4E2A;&#x7EF4;&#x5EA6;(batch_size, hidden_size) --> (batch_size, 1, hidden_size)
        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        # &#x6839;&#x636E;&#x516C;&#x5F0F;&#x8BA1;&#x7B97;&#x6CE8;&#x610F;&#x529B;&#x5F97;&#x5206;, &#x8F93;&#x51FA;score&#x7684;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 64, hidden_size)
        score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

        # &#x6839;&#x636E;&#x516C;&#x5F0F;&#x8BA1;&#x7B97;&#x6CE8;&#x610F;&#x529B;&#x6743;&#x91CD;, &#x8F93;&#x51FA;attention_weights&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 64, 1)
        attention_weights = tf.nn.softmax(self.V(score), axis=1)

        # &#x6700;&#x540E;&#x6839;&#x636E;&#x516C;&#x5F0F;&#x83B7;&#x5F97;&#x6CE8;&#x610F;&#x529B;&#x673A;&#x5236;&#x5904;&#x7406;&#x540E;&#x7684;&#x7ED3;&#x679C;context_vector
        # context_vector&#x7684;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, hidden_size)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

class Rnn_Local_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(Rnn_Local_Decoder, self).__init__()
        self.units = units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU( self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

        self.fc1 = tf.keras.layers.Dense(self.units)

        self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
        self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True,
                                                                     scale=True, beta_initializer='zeros',
                                                                     gamma_initializer='ones',
                                                                     moving_mean_initializer='zeros',
                                                                     moving_variance_initializer='ones',
                                                                     beta_regularizer=None, gamma_regularizer=None,
                                                                     beta_constraint=None, gamma_constraint=None)

        self.fc2 = tf.keras.layers.Dense(vocab_size)
        self.fc3 = tf.keras.layers.Dense(embedding_dim)

        self.attention = BahdanauAttention(self.units)

    def call(self, x, features, hidden,i):
        # features shape ==> (64,49,256) ==> Output from ENCODER
        # hidden shape == (batch_size, hidden_size) ==>(64,512)
        # hidden_with_time_axis shape == (batch_size, 1, hidden_size) ==> (64,1,512)

        # hidden_with_time_axis = tf.expand_dims(hidden, 1)

        # score shape == (64, 49, 1)
        # Attention Function
        '''e(ij) = f(s(t-1),h(j))'''
        ''' e(ij) = Vattn(T)*tanh(Uattn * h(j) + Wattn * s(t))'''

        # score = self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)))

        # self.Uattn(features) : (64,49,512)
        # self.Wattn(hidden_with_time_axis) : (64,1,512)
        # tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)) : (64,49,512)
        # self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis))) : (64,49,1) ==> score

        # you get 1 at the last axis because you are applying score to self.Vattn
        # Then find Probability using Softmax
        '''attention_weights(alpha(ij)) = softmax(e(ij))'''

        # attention_weights = tf.nn.softmax(score, axis=1)

        # attention_weights shape == (64, 49, 1)
        # Give weights to the different pixels in the image
        ''' C(t) = Summation(j=1 to T) (attention_weights * VGG-16 features) '''

        # context_vector = attention_weights * features
        # context_vector = tf.reduce_sum(context_vector, axis=1)
        # features=tf.reduce_sum(features, axis=1)#&#x66F4;&#x6539;
        # Context Vector(64,256) = AttentionWeights(64,49,1) * features(64,49,256)
        # context_vector shape after sum == (64, 256)
        context_vector, attention_weights = self.attention(features, hidden)

        #&#x81EA;&#x5DF1;&#x6DFB;&#x52A0;&#x5230;&#x5185;&#x5BB9;
        features=tf.keras.layers.Flatten()(features)
        features=self.fc3(features)
        features=tf.expand_dims(features, 1)
        # &#x8F93;&#x5165;&#x901A;&#x8FC7;embedding &#x5C42;, &#x5F97;&#x5230;&#x7684;&#x8F93;&#x51FA;&#x5F62;&#x72B6;: (batch_size, 1, embedding_dim)== (64, 1, 256)
        x = self.embedding(x)
        # x shape after concatenation == (64, 1,  512)
        # &#x8FDE;&#x63A5;x&#x548C;&#x6CE8;&#x610F;&#x529B;&#x7ED3;&#x679C;, &#x83B7;&#x5F97;&#x65B0;&#x7684;&#x8F93;&#x51FA;x&#xFF0C;&#x5F62;&#x72B6;&#x4E3A;: (batch_size, 1, embedding_dim + hidden_size)
        # x = tf.concat([tf.expand_dims(features, 1), x], axis=-1)    # x shape after concatenation == (64, 1,  512)
        # passing the concatenated vector to the GRU
        # print(x)
        if (i == 1):
            output, state = self.gru(features)
        else:
            output, state = self.gru(x)

        # shape == (batch_size, max_length, hidden_size)

        x = self.fc1(output)
        # x shape == (batch_size * max_length, hidden_size)

        x = tf.reshape(x, (-1, x.shape[2]))

        # Adding Dropout and BatchNorm Layers
        x = self.dropout(x)
        x = self.batchnormalization(x)

        # output shape == (64 * 512)
        x = self.fc2(x)

        # shape : (64 * 8329(vocab))
        return x, state, attention_weights

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

编译器和解码器:

encoder 部分我们将完成 GRU 编译和 Bahdanau 注意力机制实现。具体实现步骤是通过根据该算法将图像特征值和隐藏层进行计算得到关联向量和注意力权重,同时连接该步骤输入词向量和注意力,得到一个新的特征向量,并将其输入 GRU 得到所需要的 output,进行多个全连接输出,得到所需要的输出数

encoder = InceptionV3_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, units, vocab_size)

损失器和优化器:

encoder = InceptionV3_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, units, vocab_size)

#&#x5B9A;&#x4E49;&#x635F;&#x5931;&#x51FD;&#x6570;&#x548C;&#x4F18;&#x5316;&#x5668;
optimizer = tf.keras.optimizers.Adam()#&#x4F18;&#x5316;&#x5668; &#x5176;&#x5927;&#x6982;&#x7684;&#x601D;&#x60F3;&#x662F;&#x5F00;&#x59CB;&#x7684;&#x5B66;&#x4E60;&#x7387;&#x8BBE;&#x7F6E;&#x4E3A;&#x4E00;&#x4E2A;&#x8F83;&#x5927;&#x7684;&#x503C;&#xFF0C;&#x7136;&#x540E;&#x6839;&#x636E;&#x6B21;&#x6570;&#x7684;&#x589E;&#x591A;&#xFF0C;&#x52A8;&#x6001;&#x7684;&#x51CF;&#x5C0F;&#x5B66;&#x4E60;&#x7387;&#xFF0C;&#x4EE5;&#x5B9E;&#x73B0;&#x6548;&#x7387;&#x548C;&#x6548;&#x679C;&#x7684;&#x517C;&#x5F97;&#x3002;
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(#&#x4EA4;&#x53C9;&#x71B5;&#x635F;&#x5931;&#x51FD;&#x6570;
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))#&#x548C;0&#x6BD4;&#x8F83;&#x8FD4;&#x56DE;true&#x548C;false
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(  loss_)

4.模块训练

在该部分我们选取 Adam 作为优化器以及采用稀疏类别交叉熵损失, 对生成语句和原语句计算 loss,同时我们采用 teacher-forcing 训练方式,即将正确语句的下一个单词强制作为模型的输入进行训练,能够极大的加快模型的收敛速度,令模型训练过程更快更平稳。因为是自定义模型,所以只能通过 save_weights 来保存。最终在获取目标单词构成语句过程中,采用贪婪搜索。


loss_plot = []

@tf.function
def train_step(img_tensor, target):
    loss = 0
    # &#x521D;&#x59CB;&#x5316;&#x6BCF;&#x4E2A;&#x6279;&#x6B21;&#x7684;&#x9690;&#x85CF;&#x72B6;&#x6001;
    # &#x56E0;&#x4E3A;&#x56FE;&#x7247;&#x4E0E;&#x56FE;&#x7247;&#x4E4B;&#x95F4;&#x7684;&#x6807;&#x9898;&#x4E0D;&#x76F8;&#x5173;

    # &#x521D;&#x59CB;&#x5316;&#x89E3;&#x7801;&#x5668;&#x7684;&#x9690;&#x542B;&#x72B6;&#x6001;&#x5F20;&#x91CF;
    hidden = decoder.reset_state(batch_size=target.shape[0])
    # &#x5B9A;&#x4E49;&#x89E3;&#x7801;&#x5668;&#x7684;&#x7B2C;&#x4E00;&#x4E2A;&#x6587;&#x672C;&#x63CF;&#x8FF0;&#x8F93;&#x5165;(&#x5373;&#x8D77;&#x59CB;&#x7B26;<start>&#x5BF9;&#x5E94;&#x7684;&#x5F20;&#x91CF;)
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

    # &#x5F00;&#x542F;&#x4E00;&#x4E2A;&#x7528;&#x4E8E;&#x68AF;&#x5EA6;&#x8BB0;&#x5F55;&#x7684;&#x4E0A;&#x4E0B;&#x6587;&#x7BA1;&#x7406;&#x5668;
    with tf.GradientTape() as tape:
        # &#x4F7F;&#x7528;&#x7F16;&#x7801;&#x5668;&#x5904;&#x7406;&#x8F93;&#x5165;&#x7684;&#x56FE;&#x7247;&#x5F20;&#x91CF;
        features = encoder(img_tensor)
        # &#x5F00;&#x59CB;&#x4F7F;&#x7528;&#x89E3;&#x7801;&#x5668;&#x5FAA;&#x73AF;&#x89E3;&#x7801;, &#x89E3;&#x7801;&#x957F;&#x5EA6;&#x4E3A;target.shape[1]&#x5373;&#x6587;&#x672C;&#x63CF;&#x8FF0;&#x5F20;&#x91CF;&#x7684;&#x6700;&#x5927;&#x957F;&#x5EA6;
        for i in range(1, target.shape[1]):
            # passing the features through the decoder
            # &#x4F7F;&#x7528;&#x89E3;&#x7801;&#x5668;&#x83B7;&#x5F97;&#x7B2C;&#x4E00;&#x4E2A;&#x9884;&#x6D4B;&#x503C;&#x548C;&#x9690;&#x542B;&#x5F20;&#x91CF;
            predictions, hidden, _ = decoder(dec_input, features, hidden)
            # &#x8BA1;&#x7B97;&#x8BE5;&#x89E3;&#x7801;&#x8FC7;&#x7A0B;&#x7684;&#x635F;&#x5931;
            loss += loss_function(target[:, i], predictions)

            # using teacher forcing
            # &#x63A5;&#x4E0B;&#x6765;&#x8FD9;&#x91CC;&#x4F7F;&#x7528;&#x4E86;teacher_forcing&#x6765;&#x5B9A;&#x4E49;&#x4E0B;&#x4E00;&#x6B21;&#x89E3;&#x7801;&#x7684;&#x8F93;&#x5165;
            dec_input = tf.expand_dims(target[:, i], 1)

    # &#x5168;&#x90E8;&#x5FAA;&#x73AF;&#x89E3;&#x7801;&#x5B8C;&#x6210;&#x540E;, &#x8BA1;&#x7B97;&#x53E5;&#x5B50;&#x7C92;&#x5EA6;&#x7684;&#x5E73;&#x5747;&#x635F;&#x5931;
    total_loss = (loss / int(target.shape[1]))
    # &#x83B7;&#x5F97;&#x6574;&#x4E2A;&#x6A21;&#x578B;&#x8BAD;&#x7EC3;&#x7684;&#x53C2;&#x6570;&#x53D8;&#x91CF;
    trainable_variables = encoder.trainable_variables + decoder.trainable_variables
    # &#x4F7F;&#x7528;&#x68AF;&#x5EA6;&#x7BA1;&#x7406;&#x5668;&#x5BF9;&#x8C61;&#x5BF9;&#x53C2;&#x6570;&#x53D8;&#x91CF;&#x6C42;&#x89E3;&#x68AF;&#x5EA6;
    gradients = tape.gradient(loss, trainable_variables)
    # &#x6839;&#x636E;&#x68AF;&#x5EA6;&#x66F4;&#x65B0;&#x53C2;&#x6570;
    optimizer.apply_gradients(zip(gradients, trainable_variables))
    # &#x8FD4;&#x56DE;&#x53E5;&#x5B50;&#x7C92;&#x5EA6;&#x7684;&#x5E73;&#x5747;&#x635F;&#x5931;
    return loss, total_loss
</start></start>

开始训练并且画图


&#x8BBE;&#x5B9A;&#x8BAD;&#x7EC3;&#x8F6E;&#x6570;
EPOCHS = 101

&#x5FAA;&#x73AF;&#x8F6E;&#x6570;&#x8BAD;&#x7EC3;
for epoch in range(0, EPOCHS):
    # &#x83B7;&#x5F97;&#x6BCF;&#x8F6E;&#x8BAD;&#x7EC3;&#x7684;&#x5F00;&#x59CB;&#x65F6;&#x95F4;
    start = time.time()
    # &#x521D;&#x59CB;&#x5316;&#x8F6E;&#x6570;&#x603B;&#x635F;&#x5931;&#x4E3A;0
    total_loss = 0
    # &#x5FAA;&#x73AF;&#x6570;&#x636E;&#x96C6;&#x4E2D;&#x7684;&#x6BCF;&#x4E2A;&#x6279;&#x6B21;&#x8FDB;&#x884C;&#x8BAD;&#x7EC3;
    for (batch, (img_tensor, target)) in enumerate(dataset):
        # &#x8C03;&#x7528;train_step&#x51FD;&#x6570;&#x83B7;&#x5F97;&#x6279;&#x6B21;&#x603B;&#x635F;&#x5931;&#x548C;&#x6279;&#x6B21;&#x5E73;&#x5747;&#x635F;&#x5931;
        batch_loss, t_loss = train_step(img_tensor, target)
        # &#x5C06;&#x6279;&#x6B21;&#x5E73;&#x5747;&#x635F;&#x5931;&#x76F8;&#x52A0;&#x83B7;&#x5F97;&#x8F6E;&#x6570;&#x603B;&#x635F;&#x5931;
        total_loss += t_loss
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(
                epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))

    # &#x7ED8;&#x5236;&#x8F6E;&#x6570;&#x5E73;&#x5747;&#x635F;&#x5931;
    loss_plot.append(total_loss / num_steps)
    # &#x6253;&#x5370;&#x8F6E;&#x6570;, &#x5BF9;&#x5E94;&#x7684;&#x5E73;&#x5747;&#x635F;&#x5931;
    print('Epoch {} Loss {:.6f}'.format(epoch + 1, total_loss / num_steps))
    # &#x6253;&#x5370;&#x6BCF;&#x8F6E;&#x7684;&#x8017;&#x65F6;
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
    if epoch%20==0:
        encoder.save_weights("/home/lxx/data/encoder_ImcepV3_att_%s.h5"%epoch)

        decoder.save_weights("/home/lxx/data/encoder_ImcepV3_att_%s.h5"%epoch)

plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.savefig(fname= "VGG16_ImcepV3_100" + ".png")

基于tensorflow实现图像描述

5.贪婪散发和评估指标

def evaluate(image):
    # &#x521D;&#x59CB;&#x5316;&#x7528;&#x4E8E;&#x5236;&#x56FE;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;, &#x4E3A;&#x5168;0&#x5F20;&#x91CF;
    attention_plot = np.zeros((max_length, attention_features_shape))
    # &#x521D;&#x59CB;&#x5316;&#x9690;&#x5C42;&#x5F20;&#x91CF;
    hidden = decoder.reset_state(batch_size=1)
    # &#x4F7F;&#x7528;load_image&#x8FDB;&#x884C;&#x56FE;&#x7247;&#x521D;&#x59CB;&#x5904;&#x7406;, &#x5E76;&#x6269;&#x5C55;&#x4E00;&#x4E2A;&#x7EF4;&#x5EA6;
    temp_input = tf.expand_dims(load_image(image)[0], 0)
    # &#x5BF9;&#x56FE;&#x7247;&#x8FDB;&#x884C;&#x7279;&#x5F81;&#x63D0;&#x53D6;, &#x5E76;&#x4F7F;&#x5F97;&#x5F62;&#x72B6;&#x6EE1;&#x8DB3;&#x7F16;&#x7801;&#x5668;&#x8981;&#x6C42;
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
    # &#x4F7F;&#x7528;&#x7F16;&#x7801;&#x5668;&#x5BF9;&#x56FE;&#x7247;&#x8FDB;&#x884C;&#x7F16;&#x7801;
    features = encoder(img_tensor_val)
    # &#x521D;&#x59CB;&#x5316;&#x89E3;&#x7801;&#x5668;&#x7684;&#x8F93;&#x5165;&#x5F20;&#x91CF;
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    # &#x521D;&#x59CB;&#x5316;&#x56FE;&#x7247;&#x63CF;&#x8FF0;&#x7684;&#x6587;&#x672C;&#x7ED3;&#x679C;&#x5217;&#x8868;
    result = []
    # &#x6839;&#x636E;&#x89E3;&#x7801;&#x5668;&#x7ED3;&#x679C;&#x751F;&#x6210;&#x6700;&#x7EC8;&#x7684;&#x6587;&#x672C;&#x7ED3;&#x679C;
    for i in range(max_length):
        # &#x4F7F;&#x7528;&#x89E3;&#x7801;&#x5668;&#x83B7;&#x5F97;&#x6BCF;&#x6B21;&#x7684;&#x8F93;&#x51FA;&#x5F20;&#x91CF;
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
        # &#x6839;&#x636E;&#x6BCF;&#x6B21;&#x83B7;&#x5F97;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x6743;&#x91CD;&#x586B;&#x5145;&#x7528;&#x4E8E;&#x5236;&#x56FE;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;
        attention_plot[i] = tf.reshape(attention_weights, (-1,)).numpy()
        # &#x4ECE;&#x89E3;&#x7801;&#x5668;&#x5F97;&#x5230;&#x7684;&#x9884;&#x6D4B;&#x6982;&#x7387;&#x5206;&#x5E03;predictions&#x4E2D;s&#x968F;&#x673A;&#x6309;&#x6982;&#x7387;&#x5927;&#x5C0F;&#x9009;&#x62E9;&#x7D22;&#x5F15;&#x4F5C;&#x4E3A;predicted_id
        predicted_id = tf.argmax(predictions[0]).numpy()#&#x6839;&#x636E;axis&#x53D6;&#x503C;&#x7684;&#x4E0D;&#x540C;&#x8FD4;&#x56DE;&#x6BCF;&#x884C;&#x6216;&#x8005;&#x6BCF;&#x5217;&#x6700;&#x5927;&#x503C;&#x7684;&#x7D22;&#x5F15;

        # &#x6839;&#x636E;&#x6570;&#x503C;&#x6620;&#x5C04;&#x5668;&#x548C;predicted_id&#x83B7;&#x5F97;&#x5BF9;&#x5E94;&#x5355;&#x8BCD;(&#x6587;&#x672C;)&#x5E76;&#x88C5;&#x5165;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x4E2D;
        result.append(tokenizer.index_word[predicted_id])

        # &#x5224;&#x65AD;&#x9884;&#x6D4B;&#x5B57;&#x7B26;&#x662F;&#x5426;&#x7684;&#x7EC8;&#x6B62;&#x7B26;<end>
        if tokenizer.index_word[predicted_id] == '<end>':
           return result, attention_plot
        # &#x8FD4;&#x56DE;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x548C;&#x7528;&#x4E8E;&#x5236;&#x56FE;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;
        # &#x5982;&#x679C;&#x4E0D;&#x662F;&#x7EC8;&#x6B62;&#x7B26;, &#x5219;&#x5C06;&#x672C;&#x6B21;&#x7684;&#x7ED3;&#x679C;&#x6269;&#x5C55;&#x7EF4;&#x5EA6;&#x4F5C;&#x4E3A;&#x4E0B;&#x6B21;&#x89E3;&#x7801;&#x5668;&#x7684;&#x8F93;&#x51FA;
        dec_input = tf.expand_dims([predicted_id], 0)

    # &#x6839;&#x636E;&#x9884;&#x6D4B;&#x7ED3;&#x679C;&#x7684;&#x771F;&#x5B9E;&#x957F;&#x5EA6;&#x5BF9;attention_plot&#x8FDB;&#x884C;&#x5207;&#x7247;, &#x53BB;&#x9664;&#x591A;&#x4F59;&#x7684;&#x4E3A;0&#x7684;&#x90E8;&#x5206;
    attention_plot = attention_plot[:len(result), :]
    # &#x8FD4;&#x56DE;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x548C;&#x5207;&#x7247;&#x540E;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;
    return result, attention_plot</end></end></start>

定义一个描述注意力的函数

[En]

Define a function to depict attention


def plot_attention(image, result, attention_plot):
    """&#x6CE8;&#x610F;&#x529B;&#x53EF;&#x89C6;&#x5316;&#x51FD;&#x6570;"""
    # &#x83B7;&#x5F97;numpy&#x683C;&#x5F0F;&#x7684;&#x56FE;&#x7247;&#x8868;&#x793A;

    temp_image = np.array(Image.open(image))

    # &#x521B;&#x5EFA;&#x4E00;&#x4E2A;10x10&#x7684;&#x753B;&#x677F;
    fig = plt.figure(figsize=(10, 10))
    # &#x83B7;&#x5F97;&#x56FE;&#x7247;&#x63CF;&#x8FF0;&#x6587;&#x672C;&#x7ED3;&#x679C;&#x957F;&#x5EA6;
    len_result = len(result)
    # &#x5FAA;&#x73AF;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x957F;&#x5EA6;
    for l in range(len_result):
        # &#x5C06;&#x6BCF;&#x4E2A;&#x7ED3;&#x679C;&#x5BF9;&#x5E94;&#x7684;&#x6CE8;&#x610F;&#x529B;&#x5F20;&#x91CF;&#x53D8;&#x6210;8x8&#x7684;&#x5F20;&#x91CF;
        temp_att = np.resize(attention_plot[l], (8, 8))
        # &#x521B;&#x5EFA;&#x5927;&#x5C0F;&#x4E3A;&#x7ED3;&#x679C;&#x5217;&#x8868;&#x957F;&#x5EA6;&#x4E00;&#x534A;&#x7684;&#x5B50;&#x56FE;&#x753B;&#x5E03;
        ax = fig.add_subplot(len_result // 2, len_result // 2, l + 1)
        # &#x8BBE;&#x7F6E;&#x5B50;&#x56FE;&#x753B;&#x5E03;&#x7684;title
        ax.set_title(result[l])
        # &#x5728;&#x5B50;&#x56FE;&#x753B;&#x5E03;&#x4E0A;&#x663E;&#x793A;&#x539F;&#x56FE;&#x7247;
        img = ax.imshow(temp_image)
        # &#x5728;&#x5B50;&#x56FE;&#x753B;&#x5E03;&#x4E0A;&#x663E;&#x793A;&#x6CE8;&#x610F;&#x529B;&#x7684;&#x7070;&#x5EA6;&#x5757;
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())
    # &#x8C03;&#x6574;&#x5B50;&#x56FE;&#x4F4D;&#x7F6E;, &#x586B;&#x5145;&#x6574;&#x4E2A;&#x753B;&#x5E03;
    plt.tight_layout()
    # &#x56FE;&#x50CF;&#x663E;&#x793A;
    plt.show()

输出效果:

基于tensorflow实现图像描述

对图像进行测试:

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
start = time.time()
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]

remove "<unk>" in result
for i in result:
    if i == "<unk>":
        result.remove(i)

remove <end> from result
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]

real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result_final

print('Real Caption:', real_caption)
print('Prediction Caption:', result_final)

plot_attention(image, result, attention_plot)
print(f"time took to Predict: {round(time.time() - start)} sec")

Image.open(img_name_val[rid])</end></unk></unk>

ROUGE_1和BLEU_1测试分数

取100个图进行测试,取平均分(代码与上面有部分重复)


def Rouge_1(target, reference):#terms_reference&#x4E3A;&#x53C2;&#x8003;&#x6458;&#x8981;&#xFF0C;terms_model&#x4E3A;&#x5019;&#x9009;&#x6458;&#x8981;   ***one-gram*** &#x4E00;&#x5143;&#x6A21;&#x578B;
    terms_reference= jieba.cut(reference)#&#x9ED8;&#x8BA4;&#x7CBE;&#x51C6;&#x6A21;&#x5F0F;
    terms_target= jieba.cut(target)
    grams_reference = list(terms_reference)
    grams_model = list(terms_target)
    temp = 0
    ngram_all = len(grams_reference)
    for x in grams_reference:
        if x in grams_model: temp=temp+1
    rouge_1=temp/ngram_all
    return rouge_1

for z in range(102):
    rid = np.random.randint(0, len(img_name_val))
    print(rid)
    image = img_name_val[rid]
    start = time.time()
    real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
    result, attention_plot = evaluate(image)

    first = real_caption.split(' ', 1)[1]
    real_caption = first.rsplit(' ', 1)[0]

    # remove "<unk>" in result
    for i in result:
        if i == "<unk>":
            result.remove(i)
        if i == "<start>":
            result.remove(i)

    # remove <end> from result
    result_join = ' '.join(result)
    # print(result_join)
    # print(real_caption)
    result_final = result_join.rsplit(' ', 1)[0]#&#x9884;&#x6D4B;

    real_appn = []
    real_appn.append(real_caption.split())#&#x771F;&#x5B9E;

    print('Real Caption:', real_caption)
    print('Prediction Caption:', result_final)

    # plot_attention(image, result, attention_plot)
    # print(f"time took to Predict: {round(time.time() - start)} sec")

    total_score=0
    total_roughscore=0

    score = sentence_bleu(real_caption, result_join, weights=(1, 0, 0, 0))
    total_score+=score
    total_roughscore+=Rouge_1(real_caption, result_join)

    # print(f"BLEU-1 score: {score * 100}")
    # print(f"ROUGE_1 score: {Rouge_1(real_caption, result_join) * 100}")

    if z==100:
        print(f"total_BLEU-1 score: {total_score}")
        print(f"total_ROUGE_1 score: {total_roughscore }")
</end></start></unk></unk>

受时间限制,我们针对四个模型都采取在EPOCH=60的条件下进行比对,且针对100幅一样的图进行模型预测。我们可以清楚的发现模型关于BLEU和ROUGE的评分标准从小到大分别是NIC、VGG16+Att、Imception V3+Att、Transformer。我们可以看出Imception V3模型是优于VGG16的,在模型部分可以清楚的看出层数也远远高于它,而Transformer作为最近流行的模型,无论是在训练速度还是评分结果上面都好于其他模型。

基于tensorflow实现图像描述

基于tensorflow实现图像描述

未解决的地方:因为这个模型是自己设立的,所以属于subclass,在每次load的时候,我们需要给它赋初始值,他才能正常进行加载模型,在这里我尝试了很多方法,都无法实现这个赋值步骤,只能通过训练一轮并且break来实现初步的赋值,来达到加载功能。

至此,图像描述任务结束了!

[En]

At this point, the image description task is over!

Original: https://blog.csdn.net/weixin_43631804/article/details/123069630
Author: “林仔
Title: 基于tensorflow实现图像描述

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/509480/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球