[debug]PyTorch报错:ConnectionResetError:[Errno104]Connectionresetbypeer

问题描述:

使用PyTorch 1.10.0,训练报错:

ConnectionResetError: [Errno 104] Connection reset by peer

问题解析

参见pytorch的issue

I believe the issue is only triggered for the case that both
persistent_workers and pin_memory are turned on and iteration is
terminated at the time that worker is sending data to queue. First,
persistent worker would keep iterator with workers running without
proper cleaning up (using __del__ in _MultiProcessingDataLoaderIter.

And, if any background worker (daemon process) is terminated when it
is sending data to the _worker_result_queue, such Error would be
triggered as the pin_memory_thread want to get such data from Queue.

I can send a PR

解决方法

目前的解决方法是增大batchsize,或者可以尝试issue中的其他方法

I have experienced this issue as well where the dataloader exits with a ConnectionResetError: [Errno 104] Connection reset by peer error. I observed that this error goes away away with either a) adding a sleep, or b) using larger batch sizes. I suspect there is race condition that is triggered if the dataloader completes very quickly. I am running Pytorch 1.10.

Original: https://blog.csdn.net/qq_41683065/article/details/122643637
Author: Harry嗷
Title: [debug]PyTorch报错:ConnectionResetError:[Errno104]Connectionresetbypeer

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/713055/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球