incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

一、prometheus和grafana 简介

Prometheus 是继 Kubernetes 之后的第二个 CNCF “毕业” 项目,其监控理念传承于由谷歌研发的一款内部监控软件,现主要开发语言为 go,代码目前已经托管在 github 中,遵从 apache 2.0 开源协议,受欢迎的程度非常高,github 地址为:https://github.com/prometheus/。

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e73f2af2-cb4f-4b88-9b03-a4fdd81bffe1

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2dcd2316-0118-4e09-8856-355cbfce4a7c

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:118f55b3-4ce7-4b23-bce5-054a5450d817

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c7c60e92-1147-4996-acea-ac8cc5809918

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d3fe6ea6-566d-4201-8d2b-0c08074bf83f

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:820306be-b72d-4946-9e51-f29786f5091a

prometheus的优势:

  • 易于管理,通俗易懂,prometheus在使用时只有一个二进制的执行文件,安装非常简单,不依赖于任何别的应用。
  • 能够轻易获取服务内部状态,比如jvm等。
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:49e214bf-a84f-420f-b543-ada3e40d7909
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:db1f67b4-b1e2-490f-a4cf-85ea41345288

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fae6f0ac-8bc3-48ed-9d23-46b288afaa52

    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d3af7dba-d664-499f-aa4e-0a0f721a4691

  • 采用http协议,默认pull模式拉取数据,也可以通过中间网关push数据(需要安装push gateway)
  • 支持自动发现(通过服务的方式进行自动发现待监控的目标,可以通过Consul实现服务发现)
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:eb7c17ae-58cd-4cb0-958f-ba7055dd6305
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fb884612-ac73-472e-bd94-20119a7eb708

  • 易集成,可以和grafana 快速集成。

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

备注:此架构图摘自prometheus官方网站

prometheus根据配置定时可以去拉取各个节点的数据,默认使用的拉取方式是pull,当然也可以支持使用pushgateway提供的push方式去获取各个监控节点的数据。将获取到的数据存入TSDB(时序型数据库),pushgateway 就是 外部的应用可以将监控的数据主动推送给pushgateway,然后prometheus 自动从pushgateway进行拉取。此时prometheus已经获取到了监控数据,可以使用内置的PromQL进行查询。它的报警功能使用Alertmanager提供,Alertmanager是prometheus的告警管理和发送报警的一个组件。prometheus原生的图标功能由于过于简单,尤其是可视化的界面比较简单,因此建议将prometheus数据接入grafana,由grafana进行统一管理。

Grafana是开源的可视化监控、数据分析利器,支持多种数据库类型和丰富的套件,目前已支持超过50多个数据源,50多个面板,17个应用程序和1700多个不同的仪表图。(本文作者:张永清,转载请注明来源博客园:https://www.cnblogs.com/laoqing/p/14538635.html)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a3b6141b-267d-4ca6-bb3e-69904c71758f

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1290b56d-0738-4072-b89c-9f5e97fb9ef3

* 支持许多不同的时间序列数据(数据源)存储后端。每个数据源都有一个特定查询编辑器。官方支持数据源:Graphite、infloxdb、opensdb、prometheus、elasticsearch、cloudwatch,mysql ,zabbix等。每个数据源的查询语言和功能有较大差异。可以将来自多个数据源的数据组合到一个仪表板上,但每个面板都要绑定到指定的数据源中,如下图所示。

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d24f762e-6613-4cbe-b418-138c74ccf076

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d420fb9b-6026-4b23-870d-38202c11898b

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9f090681-d95f-4ce3-acdc-12cc0d737980

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:037f6325-416e-4805-8562-e3f0277d4dce

二、incubator-dolphinscheduler 简介

incubator-dolphinscheduler是一个由国内公司发起的大数据领域的开源调度项目,incubator-dolphinscheduler 能够支撑非常多的应用场景,包括:

  • 以DAG图的方式将Task按照任务的依赖关系关联起来,可实时可视化监控任务的运行状态
  • 支持丰富的任务类型:Shell、MR、Spark、SQL(mysql、postgresql、hive、sparksql),Python,Sub_Process、Procedure,flink,datax,sqoop,http等
  • 支持工作流定时调度、依赖调度、手动调度、手动暂停/停止/恢复,同时支持失败重试/告警、从指定节点恢复失败、Kill任务等操作
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1fe249e4-0951-4ddf-81a3-beb02bf50718
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4017b9f1-284c-419f-93b9-6cefbeb921c3

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f2fbe6a0-6b82-4b0f-95a8-fabd41921a7c

    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c990458e-6be4-4536-a531-5303d40b2614

  • 支持资源文件的在线上传/下载,管理等,支持在线文件创建、编辑
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fb1b9701-4fc8-4b0e-aa93-75b98bf7de36
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ebdb551e-cbe3-4f21-a008-3f5b8aeecbb1

  • 实现集群HA,通过Zookeeper实现Master集群和Worker集群去中心化
  • 支持对 Master/Worker cpu load,memory,cpu在线查看
  • 支持工作流运行历史树形/甘特图展示、支持任务状态统计、流程状态统计
  • 支持补数
  • 支持多租户
  • 支持国际化

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:493da4bf-562a-4fb0-9ca3-db083d2e8abb

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6e943630-029f-421a-8b00-937afe7d7d51

三、incubator-dolphinscheduler 如何快速接入到 prometheus和grafana 中进行监控

1、通过prometheus中push gateway的方式采集监控指标数据。

需要借助push gateway一起,然后将数据发送到push gateway 地址中,比如地址为http://10.25x.xx.xx:8085,那么就可以写一个shell 脚本,通过crontab调度或者incubator-dolphinscheduler调度,定期运行shell脚本,来发送指标数据到prometheus中。

#!/bin/bash
failedTaskCounts=mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'failed' as failTotal ,count(distinct(process_definition_id))
as failCounts from dolphinscheduler.t_ds_process_instance where state=6 and start_time>='${datetimestr} 00:00:00'" |grep "failed"|awk -F " " '{print $2}'
echo "failedTaskCounts:${failedTaskCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <//10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
failedSchedulingTaskCounts $failedTaskCounts
EOF

这段脚本中failedSchedulingTaskCounts 就是定义的prometheus中的一个指标。脚本通过sql语句查询出失败的任务数,然后发送到prometheus中。

然后在grafana 中就可以选择数据源为prometheus,并且选择对应的指标。

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

比如可以通过如下shell 脚本采集正在运行的任务数,然后通过push gateway 发送到prometheus中。(本文作者:张永清,转载请注明来源博客园:https://www.cnblogs.com/laoqing/p/14538635.html)

#!/bin/bash
runningTaskCounts=mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'running' as runTotal ,count(distinct(process_definition_id))  as runCounts from dolphinscheduler.t_ds_process_instance where state=1" |grep "running"|awk -F " " '{print $2}'
echo "runningTaskCounts:${runningTaskCounts}"job_name="Scheduling_system"
instance_name="dolphinscheduler"
if [ "${runningTaskCounts}yy" == "yy" ];then
runningTaskCounts=0
fi
cat <//10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
runningSchedulingTaskCounts $runningTaskCounts
EOF

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:be72dc71-b24e-47ea-bc82-00dd72ce3aab

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8ecd5012-8903-492b-bb70-283d7b1d6986

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

在配置好指标后,还可以配置该指标的告警,如下图所示,点击Create Alert 按钮,按照页面提示,即可以配置告警通知。

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8edb1ab1-5ab7-47c1-9337-af561ceaa64f

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:19d0689e-4d8b-4edf-ba78-544778e5f0bf

1)、失败的工作流实例数

failedInstnceCounts=mysql -h 10.25x.xx.xx -u username-ppassword -e "select 'failed' as failTotal ,count(1) as failCounts from dolphinscheduler.t_ds_process_instance where state=6 and start_time>='${datetimestr} 00:00:00'" |grep "failed"|awk -F " " '{print $2}'
echo "failedInstnceCounts:${failedInstnceCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <//10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
failedSchedulingInstanceCounts $failedInstnceCounts
EOF

2)、等待中的工作任务流数

waittingTaskCounts=mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'waitting' as waitTotal ,count(distinct(process_definition_id)) as waitCounts from dolphinscheduler.t_ds_process_instance where state in(10,11) and start_time>='${sevenDayAgo} 00:00:00'" |grep "waitting"|awk -F " " '{print $2}'
echo "waittingTaskCounts:${waittingTaskCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <//10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
waittingSchedulingTaskCounts $waittingTaskCounts
EOF

3)、运行中的工作流实例数

runningInstnceCounts=mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'running' as runTotal ,count(1)  as runCounts from dolphinscheduler.t_ds_process_instance where state=1" |grep "running"|awk -F " " '{print $2}'
echo "runningInstnceCounts:${runningInstnceCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
if [ "${runningInstnceCounts}yy" == "yy" ];then
runningInstnceCounts=0
fi
cat <//10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
runningSchedulingInstnceCounts $runningInstnceCounts
EOF

2 、通过grafana 直接查询dolphinscheduler自身 的Mysql数据库(也可以是别的数据库)

首先需要在grafana 中定义一个数据源,这个数据源就是dolphinscheduler自身 的Mysql数据库。

然后在grafana 中选择这个数据源,Format as 格式选择table,输入对应的sql语句。(本文作者:张永清,转载请注明来源博客园:https://www.cnblogs.com/laoqing/p/14538635.html)

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

比如如下sql,统计本周以及当日正在运行的调度任务的情况。

select d.*,ifnull(f.today_runCount,0) as today_runCount,ifnull(e.today_faildCount,0) as today_faildCount,ifnull(f.today_avg_timeCosts,0) as today_avg_timeCosts,ifnull(f.today_max_timeCosts,0) as today_max_timeCosts,
ifnull(g.week_runCount,0) as week_runCount,ifnull(h.week_faildCount,0) as week_faildCount,ifnull(g.week_avg_timeCosts,0) as week_avg_timeCosts,ifnull(g.week_max_timeCosts,0) as week_max_timeCosts from
(select a.id,c.name as project_name,a.name as process_name,b.user_name,a.create_time,a.update_time from t_ds_process_definition a,t_ds_user b, t_ds_project c  where a.user_id=b.id and c.id=a.project_id and a.release_state=$status) d
left join
(select count(1) as today_faildCount,process_definition_id from
t_ds_process_instance where state=6 and start_time>=DATE_FORMAT(NOW(),'%Y-%m-%d 00:00:00') and  start_timeDATE_FORMAT(NOW(),'%Y-%m-%d 23:59:59') group by process_definition_id ) e  on d.id=e.process_definition_id
left join
(select count(1) as today_runCount,avg(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as today_avg_timeCosts,max(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as today_max_timeCosts,process_definition_id from
t_ds_process_instance  where start_time>=DATE_FORMAT(NOW(),'%Y-%m-%d 00:00:00') and  start_timeDATE_FORMAT(NOW(),'%Y-%m-%d 23:59:59') group by process_definition_id ) f on d.id=f.process_definition_id
left join
(select count(1) as week_runCount,avg(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as week_avg_timeCosts,max(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as week_max_timeCosts,process_definition_id from
t_ds_process_instance  where start_time>=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-1), '%Y-%m-%d 00:00:00') and  start_timeDATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-7), '%Y-%m-%d 23:59:59') group by process_definition_id ) g
on d.id=g.process_definition_id  left join
(select count(1) as week_faildCount,process_definition_id from
t_ds_process_instance where state=6 and start_time>=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-1), '%Y-%m-%d 00:00:00')  and  start_timeDATE_FORMAT( SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-7), '%Y-%m-%d 23:59:59') group by process_definition_id ) h
on d.id=h.process_definition_id

这些配置完后,保存就可以得到如下的表格:(本文作者:张永清,转载请注明来源博客园:https://www.cnblogs.com/laoqing/p/14538635.html)

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

通过类似上述方式的配置,在输入sql后,选择Graph视图时,还可以生成任务耗时的曲线图,如下所示

select (UNIX_TIMESTAMP(a.end_time)-UNIX_TIMESTAMP(a.start_time)) as timeCosts, UNIX_TIMESTAMP(a.end_time) as time   from t_ds_process_instance a,t_ds_process_definition b where end_time>=DATE_FORMAT( DATE_SUB(CURDATE(), INTERVAL 1 MONTH), '%Y-%m-01 00:00:00')  and end_time is not null and a.process_definition_id=b.id and b.name='$process_name'

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:83745fc6-bd08-4247-b8e7-f9c1deab4b54

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:47f19e71-8641-4330-bc70-0e8d018e22f6

Original: https://www.cnblogs.com/laoqing/p/14538635.html
Author: 张永清
Title: incubator-dolphinscheduler 如何在不写任何新代码的情况下,能快速接入到prometheus和grafana中进行监控

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/562807/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球