【问题分析】Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model

目录

背景概述

示例Sample

问题分析

Saver.restore API分析

背景概述

百度/Google “Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model”的错误,可以发现网上有很多类似的问题分析,但基本都是从装载pretrain的checkpoint文件路径提示解决方案,忽略了该错误信息中最后一个model 关键字 “Unsuccessful … Failed to find any matching files for model” 。

本文通过一个可以复现问题的sample程序,结合Tensorflow r1.15的源码对该问题做深入的分析。

示例Sample

import pdb
import tensorflow as tf

#pretrain的vgg16模型checkpoint文件
pretrained_model = '/your/path/to/vgg_16.ckpt'

#构建fc6 conv层的权重变量
fc6_conv = tf.get_variable("fc6_conv", [7, 7, 512, 4096], trainable=False)
#构建Saver OP,准备从模型checkpoint中恢复权重Assign给fc6_conv
restorer_fc = tf.compat.v1.train.Saver({"vgg_16/fc6/weights": fc6_conv})

with tf.Session() as sess:
  graph = tf.get_default_graph()
  fetch_list = []
  #从计算图中找到名字为"save/Assign"的Assign OP 加入fetch list
  for op in graph.get_operations():
    if op.name.find("save/Assign") >= 0:
      for tensor_o in op.outputs:
         fetch_list.append(tensor_o)

  #运行Saver OP从模型checkpoint中恢复权重并Assign给fc6_conv - 运行正常
  restorer_fc.restore(sess, pretrained_model)

  #运行fetch_list中的"save/Assign" Assign OP - 报错"... Failed to find any matching files for model"
  sess.run(fetch_list)

  sess.close()

Sample示意程序及其代码注释如上,当运行”sess.run(fetch_list)” 的时候,就会发生类似下面的Error错误,导致Sample程序运行异常终止

tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.

  (0) Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model
     [[node save/RestoreV2 (defined at /home/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
     [[save/RestoreV2/_7]]
  (1) Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model
     [[node save/RestoreV2 (defined at /home/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.

0 derived errors ignored.

问题分析

【问题分析】Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model

Sample程序对应的计算图

Sample程序的计算图如上所示,第一眼的感觉就是和sample程序逻辑上表达的计算图不相符,TF在图生成、切分、优化过程中自说自话添加了很多OP节点,这是导致TF静态计算图难以debug的重要原因,比如:其中出现Error错误的OP 即图中红框所示的save/Assign OP,它是一个Assign 类型的OP,而该OP在sample程序中其实没有API显示的创建。

【问题分析】Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model

save/Assign OP的input Tensor示意

save/Assign OP的input输入Tensor是上图中蓝框所示的save/RestoreV2 OP 和 fc6_conv OP,前者是一个RestoreV2类型OP,后者是一个VariableV2类型OP

//tensorflow/core/kernels/assign_op.h

  void Compute(OpKernelContext* context) override {
    const Tensor& rhs = context->input(1);

    // We always return the input ref.

    context->forward_ref_input_to_ref_output(0, 0);
...

如上根据Tensorflow Assign kernel的源码分析可以知道,Assign OP的kernel计算过程就是把input 1的Tensor直接输出给input 0 Tensor。根据上面的计算图可知,save/Assign OP的input 1就是 save/RestoreV2 OP的输出Tensor,input 0 就是fc6_conv OP的输出Tensor,即把save/RestoreV2从pretrain的模型权重checkpoint文件读取的值给到fc6_conv Variable中,实现restore权重的目的,so far so good 看不出为什么会产生Error 错误

//tensorflow/core/kernels/save_restore_v2_ops.cc

class RestoreV2 : public OpKernel {
...

  void Compute(OpKernelContext* context) override {
    //pretrain的模型权重checkpoint文件路径(可以是待通配符的路径pattern),如果有错就会引起本文的Error错误
    const Tensor& prefix = context->input(0);
...

        //读取checkpoint文件中的Tensor值
        RestoreTensor(context, &checkpoint::OpenTableTensorSliceReader,
                      /* preferred_shard */ -1, /* restore_slice */ true,
                      /* restore_index */ i);

//tensorflow/core/kernels/save_restore_tensor.cc

void RestoreTensor(OpKernelContext* context,
                   checkpoint::TensorSliceReader::OpenTableFunction open_func,
                   int preferred_shard, bool restore_slice, int restore_index) {
    //pretrain的模型权重checkpoint文件路径(可以是待通配符的路径pattern)
  const string& file_pattern = file_pattern_t.flat()(0);
...

  if (!reader) {
    //构建读取模型权重checkpoint文件的allocated_reader
    allocated_reader.reset(new checkpoint::TensorSliceReader(
        file_pattern, open_func, preferred_shard));
...

//tensorflow/core/util/tensor_slice_reader.cc

TensorSliceReader::TensorSliceReader(const string& filepattern,
 ...

  Status s = Env::Default()->GetMatchingPaths(filepattern, &fnames_);
 ...

  //分析模型权重checkpoint文件路径的pattern,提取其中发现的checkpoint文件路径,如果找不到合适的checkppoint文件路径,就会抛出本文的Error错误
  if (fnames_.empty()) {
    status_ = errors::NotFound(
        "Unsuccessful TensorSliceReader constructor: "
        "Failed to find any matching files for ",
        filepattern);
    return;
  }

分析save/RestoreV2 OP的kernel 源码实现如上,展示其中最关键的代码部分并加上了注释,可以看到如果save/RestoreV2 OP的input 0 Tensor中没有发现合适的模型权重checkpoint文件路径pattern,那么最终就是在构建allocated_reader的时候抛出本文分析的”Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ” 错误。注意一个细节的地方,本文提到错误信息中最后一个model 关键字 “Unsuccessful … Failed to find any matching files for model“,该值来源与变量filepattern,而网上大多数文章在这个错误出现的时候变量filepattern是一个checkpoint的路径,而不是”model”,所以重点是分析为什么在Sample用例中出现了这个model值。

【问题分析】Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model

RestoreV2 OP input 0 Tensor的值来源示意

根据上面的代码分析可知, RestoreV2 OP input 0 Tensor是中非常重要的模型权重路径pattern。分析计算图中的Tensor关系可以发现RestoreV2 OP input 0 Tensor来自输入 save/Const OP的输出,而save/Const OP的输入来自save/filename OP,而最终save/filename OP的input是个Const string 值” model“。所以这就完美解释了在sample代码中运行sess.run(fetch_list) 的过程如下

  • sess.run(fetch_list) 等于运行save/Assign OP
  • 运行save/Assign OP需要运行save/RestoreV2 OP
  • 运行save/RestoreV2 OP需要运行save/Const OP获得save/RestoreV2 kerenl中input 0 Tensor依赖的模型权重checkpoint文件路径pattern
  • 运行save/Const OP需要运行save/filename OP获得它的输出,而它的输出就是一个Const string 值” model“,所以导致save/RestoreV2 kerenl中input 0 Tensor依赖的模型权重checkpoint文件路径pattern值为” model“,最终就是在构建allocated_reader的时候抛出本文分析的”Unsuccessful … Failed to find any matching files for model” Error

Saver.restore API分析

//tensorflow/python/training/saver.py

class Saver:
  ...

  def restore(self, sess, save_path):
    """Restores previously saved variables.

    This method runs the ops added by the constructor for restoring variables.

    It requires a session in which the graph was launched.  The variables to
    restore do not have to have been initialized, as restoring is itself a way
    to initialize variables.

    The save_path argument is typically a value previously returned from a
    save() call, or a call to latest_checkpoint().

    Args:
      sess: A Session to use to restore the parameters. None in eager mode.

      save_path: Path where parameters were previously saved.

    Raises:
      ValueError: If save_path is None or not a valid checkpoint.

"""
    if self._is_empty:
      return
    if save_path is None:
      raise ValueError("Can't load save_path when it is None.")

    checkpoint_prefix = compat.as_text(save_path)
...

        #应用程序提供模型权重checkpoint的文件路径到参数save_path
        sess.run(self.saver_def.restore_op_name,
                 {self.saver_def.filename_tensor_name: save_path})

细心的同学一定会有个疑问,既然计算图中看到执行 RestoreV2 OP 会发生Error,那为什么sample代码中restorer_fc.restore(sess, pretrained_model) 也会执行RestoreV2 OP,为毛没有发生Error。如上面的RestoreV2 OP源码所示,关键原因就是sample程序在调用restore API的时候输入了模型权重的checkpoint文件路径,所以在TF源码中sess.run 的时候就把文件路径作为feed list送入了计算图,所以在运行save/RestoreV2 OP的kerenl中input 0 Tensor 依赖的模型权重checkpoint文件路径pattern就不是错误的”model”值,而是sample程序在调用restore API的时候输入了模型权重的checkpoint文件路径,从而能够正确的找到checkpoint文件读取数据了

Original: https://blog.csdn.net/HaoBBNuanMM/article/details/123735318
Author: HaoBBNuanMM
Title: 【问题分析】Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/497103/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球