Keras-训练网络时的问题:loss一直为nan,accuracy一直为一个固定的数

在使用VGG19做分类任务时,遇到一个问题:loss一直为nan,accuracy一直为一个固定的数,如下输出所示,即使加入了自动调整学习率 (ReduceLROnPlateau) 也没法解决问题。


reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=10, mode='auto')

earlystopping = EarlyStopping(monitor='val_accuracy', verbose=1, patience=30)

history = model_vgg19.fit(train_gen,
                    validation_data=valid_gen,
                    epochs=200,
                    steps_per_epoch=len(train_gen),
                    validation_steps=len(valid_gen),
                    callbacks=[reduce_lr, earlystopping])

输出:

Epoch 1/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 2/200
176/176 [==============================] - 31s 176ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 3/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 4/200
176/176 [==============================] - 31s 176ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 5/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 6/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 7/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 8/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 9/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 10/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 11/200
176/176 [==============================] - 30s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 12/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 13/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 14/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 15/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 16/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 17/200
176/176 [==============================] - 30s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 18/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 19/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 20/200
176/176 [==============================] - 30s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 21/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 22/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 23/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 24/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 25/200
176/176 [==============================] - 31s 178ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 26/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 27/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 28/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 29/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 30/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 31/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-06
Epoch 00031: early stopping

最常见的原因是学习率太高。对于分类问题,学习率太高会导致模型「顽固」地认为某些数据属于错误的类,而正确的类的概率为0(实际是浮点数下溢),这样用交叉熵就会算出无穷大的损失函数。一旦出现这种情况,无穷大对参数求导就会变成 NaN,之后整个网络的参数就都变成 NaN 了。解决方法是调小学习率,甚至把学习率调成0,看看问题是否仍然存在。若问题消失,那说明确实是学习率的问题。若问题仍存在,那说明刚刚初始化的网络就已经挂掉了,很可能是实现有错误。
作者:王赟 Maigo
链接:https://www.zhihu.com/question/62441748/answer/232522878 来源:知乎
版权归作者所有。商业转载请联系作者授权,非商业转载请注明出处。

[En]

The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

在训练的时候,整个网络随机初始化,很容易出现nan,这时候需要把学习率调小,可以尝试0.1,0.01,0.001,直到不出现nan为止,如果一直都有,那可能是网络实现问题。
作者:峻许
链接:https://www.zhihu.com/question/62441748/answer/232704244
来源:知乎
版权归作者所有。商业转载请联系作者授权,非商业转载请注明出处。

[En]

The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

当您在搜索原因的过程中发现此问题的其他可能原因时,请将其记录在此处:

[En]

When you see other possible causes of this problem during the search for the cause, record it here:

可能的原因:

loss Nan有若干种问题:

  • 学习率太高。
  • 对于分类问题,用categorical cross entropy
  • 对于回归问题,可能会有除以0的计算,加一点余数即可解决
    [En]

    for the regression problem, there may be a calculation of division by 0, which can be solved by adding a small remainder.*

  • 数据本身是否存在Nan,可以用numpy.any(numpy.isnan(x))检查一下input和target
  • target本身应该是能够被loss函数计算的,比如sigmoid激活函数的target应该大于0,同样的需要检查数据集

作者:猪了个去
链接:https://www.zhihu.com/question/62441748/answer/232520044
来源:知乎
版权归作者所有。商业转载请联系作者授权,非商业转载请注明出处。

[En]

The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

可能的原因:

nan是代表无穷大或者非数值,一般在一个数除以0时或者log(0)时会遇到无穷大,所以你就要想想是否你在计算损失函数 (loss) 的时候,你的网络输出为0,因为计算了log(0)从而导致出现nan。

[En]

版权声明:本文为CSDN博主「accumulate_zhang」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/accumulate_zhang/article/details/79890624

可能的原因:

来源:https://discuss.gluon.ai/t/topic/8925/4
另外还有一种可能就是random的augmentation在你的图像上没处理好,可能会有一些极端的augmentation,验证方法是关掉train default aug里面的一些random crop 和random expand

但是我尝试去掉在数据预处理时的图像增强 (Data augmentation, ImageDataGenerator) 后,loss还是在中途变成nan,输出如下:

Epoch 1/200
176/176 [==============================] - 27s 84ms/step - loss: 10.1720 - accuracy: 0.0182 - val_loss: 7.0242 - val_accuracy: 0.0252 - lr: 0.0010
Epoch 2/200
176/176 [==============================] - 13s 74ms/step - loss: 10.0744 - accuracy: 0.0221 - val_loss: 5.1386 - val_accuracy: 0.0168 - lr: 0.0010
Epoch 3/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0149 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 4/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 5/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 6/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 7/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 8/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 9/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 10/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 11/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 12/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 13/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 14/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 15/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 16/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 17/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 18/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 19/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 20/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 21/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 22/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 23/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 24/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 25/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 26/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 27/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 28/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 29/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 30/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 31/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 00031: early stopping

这个问题回到了原来调整学习率的问题上。

[En]

The problem returns to the original problem of adjusting the learning rate.

可能的原因:

梯度爆炸,解决方法:降低学习率、梯度剪裁、归一化

[En]

版权声明:本文为CSDN博主「鹅似一颗筱筱滴石头~」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/Drifter_Galaxy/article/details/104004267

2.学习率和网络的层数一般成反比,层数越多,学习率通常要减小
3.有时候可以先用较小的学习率训练5000或以上次迭代,得到参数输出,手动kill掉训练,用前面的参数fine tune,这时候可以加大学习率,能更快收敛哦
4.如果用caffe训练的话,你会看到没输出,但是gpu的利用率一直为0或很小,兄弟,已经nan了,只是因为你display那里设置大了,导致无法显示,设为1就可以看到了
作者:峻许
链接:https://www.zhihu.com/question/62441748/answer/232704244
来源:知乎
版权归作者所有。商业转载请联系作者授权,非商业转载请注明出处。

[En]

The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

Original: https://blog.csdn.net/q2972112/article/details/122149499
Author: 向日葵种植爱好者
Title: Keras-训练网络时的问题:loss一直为nan,accuracy一直为一个固定的数

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/508899/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球