StarRocks BE节点崩溃原因查找及解决思路:std::bad_alloc

问题分析

StarRocks BE 5个节点突然在几分钟内全部掉线。查找BE的be.out日志,输出如下:

tcmalloc: large alloc 1811947520 bytes == 0x77f9f0000 @  0x384f94f 0x39ce2dc 0x399646a
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
*** Aborted at 1641348199 (unix time) try "date -d @1641348199" if you are using GNU date ***
PC: @     0x7fa8c7db4387 __GI_raise
*** SIGABRT (@0x2ab9) received by PID 10937 (TID 0x7fa7f0658700) from PID 10937; stack trace: ***
    @          0x2da5562 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa8c99cc630 (unknown)
    @     0x7fa8c7db4387 __GI_raise
    @     0x7fa8c7db5a78 __GI_abort
    @          0x12e91ff _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x391d6f6 __cxxabiv1::__terminate()
    @          0x391d761 std::terminate()
    @          0x391d8b5 __cxa_throw
    @          0x12e80de _ZN12_GLOBAL__N_110handle_oomEPFPvS0_ES0_bb.cold
    @          0x39ce27e tcmalloc::allocate_full_cpp_throw_oom()
    @          0x399646a std::__cxx11::basic_string<>::_M_mutate()
    @          0x3996e90 std::__cxx11::basic_string<>::_M_replace_aux()
    @          0x1c5c4fd apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>()
    @          0x1c5c6ac apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt()
    @          0x1e3d3c9 apache::thrift::TDispatchProcessor::process()
    @          0x2d91062 apache::thrift::server::TConnectedClient::run()
    @          0x2d88d13 apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
    @          0x2d8ab10 apache::thrift::concurrency::Thread::threadMain()
    @          0x2d7c500 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvSt10shared_ptrIN6apache6thrift11concurrency6ThreadEEES8_EEEEE6_M_runEv
    @          0x3998d40 execute_native_thread_routine
    @     0x7fa8c99c4ea5 start_thread
    @     0x7fa8c7e7c9fd __clone
&#x5206;&#x6790;&#x65E5;&#x5FD7;&#xFF0C;&#x5173;&#x952E;&#x8BCD;&#x662F;&#xFF1A;std::bad_alloc

&#x663E;&#x7136;&#x662F;&#x5185;&#x5B58;&#x4E0D;&#x591F;&#x53D1;&#x751F;&#x4E86;&#x96EA;&#x5D29;&#x6548;&#x5E94;&#xFF0C;&#x5982;&#x679C;&#x8282;&#x70B9;&#x6BD4;&#x8F83;&#x591A;&#xFF0C;&#x53EF;&#x80FD;&#x4E0D;&#x4F1A;&#x90FD;&#x6302;&#x6389;&#x3002;

BE&#x662F;C++&#x5F00;&#x53D1;&#x7684;&#xFF0C;&#x9519;&#x8BEF;&#x89E3;&#x91CA;&#x53C2;&#x8003;&#xFF1A;https://www.zhihu.com/question/24926411

operator new抛bad_alloc算是比较严重的资源问题了,因为无法分配内存,对象无法构造,肯定不能按照原来的逻辑运行了,而且很可能连给你clean up的内存都不够。
在这种情况下,让程序消亡是正确的做法。

[En]

In this case, it is the right thing to do to let the program die.

解决思路

最好的方法当然是增加内存。毕竟,随着数据量的增加,内存使用量必然会增加,可能无法应对突然增加的导入数据量。

[En]

The best way is certainly to increase memory. After all, as the amount of data increases, the use of memory is bound to increase, and you may not be able to cope with the sudden increase in the amount of data imported.

在StarRocke当前版本(1.19)中有一个配置项:

mem_limit=80%   # BE&#x53EF;&#x4EE5;&#x4F7F;&#x7528;&#x7684;&#x673A;&#x5668;&#x603B;&#x5185;&#x5B58;&#x7684;&#x6BD4;&#x4F8B;&#xFF0C;&#x5982;&#x679C;&#x662F;BE&#x5355;&#x72EC;&#x90E8;&#x7F72;&#x7684;&#x8BDD;&#xFF0C;&#x4E0D;&#x9700;&#x8981;&#x914D;&#x7F6E;&#xFF0C;&#x5982;&#x679C;&#x662F;&#x548C;&#x5176;&#x5B83;&#x5360;&#x7528;&#x5185;&#x5B58;&#x6BD4;&#x8F83;&#x591A;&#x7684;&#x670D;&#x52A1;&#x6DF7;&#x5408;&#x90E8;&#x7F72;&#x7684;&#x8BDD;&#xFF0C;&#x8981;&#x5355;&#x72EC;&#x914D;&#x7F6E;&#x4E0B;
load_process_max_memory_limit_bytes=107374182400    # &#x5355;&#x8282;&#x70B9;&#x4E0A;&#x6240;&#x6709;&#x7684;&#x5BFC;&#x5165;&#x7EBF;&#x7A0B;&#x5360;&#x636E;&#x7684;&#x5185;&#x5B58;&#x4E0A;&#x9650;&#xFF0C;100GB
load_process_max_memory_limit_percent=80    # &#x5355;&#x8282;&#x70B9;&#x4E0A;&#x6240;&#x6709;&#x7684;&#x5BFC;&#x5165;&#x7EBF;&#x7A0B;&#x5360;&#x636E;&#x7684;&#x5185;&#x5B58;&#x4E0A;&#x9650;&#x6BD4;&#x4F8B;&#xFF0C;80%

您可以通过设置此选项来限制内存占用。

[En]

You can limit memory footprint by setting this option.

有关其他内存优化参数,请参阅:

[En]

For additional memory optimization parameters, please see:

建议把 cat /proc/sys/vm/overcommit_memory 设成 1。

echo 1 | sudo tee /proc/sys/vm/overcommit_memory

内存表:StarRocks支持把表数据全部缓存在内存中,用于加速查询,内存表适合数据行数不多维度表的存储。

但在实际使用中,内存表的优化并不完善,建议暂不使用内存表。

[En]

However, the optimization of memory table is not perfect in practical use, so it is recommended not to use memory table for the time being.

新版StarRocks(2.0),对内存管理进行了优化,也可以一定程度上解决问题:

  • 内存管理优化
  • 重构内存统计/控制框架,精确统计内存使用,彻底解决OOM
  • 优化元数据内存使用
  • 解决大内存释放长时间卡住执行线程的问题
  • 进程优雅退出机制,支持内存泄漏检查#1093

欢迎关注微信公众号:数据架构探索

Original: https://www.cnblogs.com/quqibing/p/15784766.html
Author: 曲奇饼AI
Title: StarRocks BE节点崩溃原因查找及解决思路:std::bad_alloc

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/522611/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球