Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

上一个十年,以 Hadoop 为代表的大数据技术发展如火如荼,各种数据平台、数据湖、数据中台等产品和解决方案层出不穷,这些方案最常用的场景包括统一汇聚企业数据,并对这些离线数据进行分析洞察,来达到辅助决策或者辅助营销的目的,像传统的 BI 报表、数据大屏、标签画像等等。

但企业中除了这样的分析型业务(OLAP),还同时存在 对数据实时性要求更高的交互型业务场景(OLTP 或 Operational Applications),例如电商行业常见的统一商品或订单查询、金融行业的实时风控、服务行业的客户 CDP 等,这些场景对企业来说往往都是关键任务类型。

除了 OLTP 场景,很多 新一代的运营型分析(Operational Analytics)也在逐渐成为主流数据应用,运营分析的特点是同样 需要来自业务系统的最新的实时数据,以帮助客户做一些较为及时的业务响应。

针对这些交互式 APP 或者运营分析的场景,传统的大数据平台由于其对实时数据支持度有限,无法予以有效支撑。今天我们就来探讨关于实时数据的问题:

  • 什么是实时数据?
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3d08b385-e3d8-4dd6-a6e2-417524df1727
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1eea73af-7153-4883-ac8c-4d84062b9930

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:845eeb57-52c4-4779-a259-8f5b9485adea

    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:34164336-2b9b-4e8e-bc2d-39683e1b1291

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:80db1260-a9e4-41d1-adc6-71b8efb9bedc

    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7e9b2343-6804-4c6d-9c27-53135b2ac43c

一、什么是实时数据

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:47ea49b4-db56-4ddf-9f71-6cf622f31cca

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4b9da9d5-9674-4e9b-84fe-ce7009b285c8

如果我们把实时数据技术放到一个数据架构的完整版图里面,可能更容易来理解实时数据到底意味着什么。A16Z 的 Matt Bornstein 在 Emerging Architectures for Modern Data Infrastructure 这篇文章里,很好地归纳了新一代数据基础架构的主要组成部分。他把数据基础架构全生命周期分成了几个阶段: Ingestion, Storage, Query & Processing, Transformation, Analysis & Output。

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:76ea6919-f262-48ad-9f01-3ba200d8838b

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:20ec2c81-b568-4009-9592-4b9413f52f06

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8bd694eb-b205-4f1d-a5c2-c01bc5086251

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0f0bac3f-e71f-462d-90ee-01be45fb29fd

实时采集同步/实时集成

数字化防疫场景为例,疫情防控特殊阶段,”核酸码”、”健康码”已成日常出行的必需品,特别是在核酸检测频率较高的时期,因检测结果显示不及时而影响日常工作生活的情况比比皆是。而应对这一问题的核心环节,就是核酸检测数据的实时、精准采集。

实时计算

数据可视化大屏的应用场景为例,作为传统制造业实现数字化转型的常见辅助工具,实时产能大屏可以为实现自动化生产提供数据依据;实时监控运行情况,及时预警;帮助企业分析产品周期,实现精准洞察,辅助业务开拓。而实现这些目标的基础就是实时数据——通过对全部系统数据的实时采集、实时计算,让数据分析展示更高效。

实时分析

金融行业的反欺诈平台为例,随着用户体量不断增大,客户历史行为数据也在持续累积,为了在数据分析处理的同时不对交易行为产生影响,新时代的风控工作面临着更多实时性挑战,传统数仓架构逐渐无法满足相应需求。因此,需要联合多渠道表现,对客户行为数据进行实时分析与反馈,快速区分欺诈交易与正常交易,助力快速决策。

实时查询、服务

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6591df41-ae52-416b-99b0-d25f02e46e96

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8a9f9652-a8c7-4b41-8b21-e8d44f2a5daa

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:aa12a0be-aa88-43b4-986b-b568c3b82677

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:eb95f6c1-941a-4002-b723-8ff3f16f85e8

无论是哪一种实时数据处理技术,都离不开一个事实:我们需要有实时的数据来源。Doris 作为一个实时数仓,只有获取到第一时间的新鲜数据,才能得出即时有效的洞察;Flink 的实时流计算,需要 Kafka 为其供数;而 Kafka 作为消息队列,需要用户自己负责将源头数据推到 Kafka 的 Topic。可以说, 一切的实时数据技术,都离不开源头——实时数据的获取和集成

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f92a2623-c43d-441e-957a-242c8a5fab4a

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:489e98b9-e073-4431-a9cb-ccf3f326757c

  • 定制 API 集成或者使用低代码API
  • ETL 工具
  • ESB 或者MQ
  • Kafka

我们分别来看看这几种解决方案的 Pros & Cons。

API 、ETL 和 ESB

API 集成是一个不需要第三方工具的解决方案。通常可以由研发团队按照数据共享的诉求,定制开发相应的 API 接口,测试上线来支持下游业务。

这种方式的 Pros:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:65c0196b-aaf8-441e-85d3-970292de272b

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a7100498-dbae-4b20-9fa4-e7ebd9aa2e9c

* 可以高度定制

Cons 也比较明显:

  • 需求变多时,开发成本会比较高,API 的管理也会出新的问题
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7501b4f0-58ac-4143-afcb-c22c6a2ae4c8
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ea78d007-01c1-437f-a065-1f7ef1472bc4

  • API 方式不太适合有全量或者大量数据交付的场景

ETL 是通过抽取的方式将需要的数据复制到下游的系统内。取决于需求量,可以通过自己写一些定期运行的脚本,或者使用一些成熟的 ETL 工具来实现。

Pros:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e0c885f0-82db-49e8-a55a-dea14fb06f52

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:aac17029-ef7b-40f9-a23a-491b89be0237

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:73b20882-973b-4063-ab78-08289dc154d9

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b509755b-d192-4613-8e9b-b174135bee71

Cons:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:663ac6a2-c9c0-4f30-b860-5513c81659ee

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:dc51c96c-53fc-434c-b819-26bc11ed0bc7

* ETL 无法复用,每个新起业务都需要不少数量的 ETL 链路,导致数量激增,管理困难

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

ESB 作为一个企业数据总线,通过一系列的接口标准,将独立的软件系统以一个中央总线方式连接在一起。每个系统如果需要将数据或者消息传递给另外一个系统,可以通过 ESB 进行中转。ESB 的架构优势体现在:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f155a921-717e-405b-8fb9-ab0339fa8120

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a3576b3c-24f6-4e7e-a2dd-22d4fd3957dc

* 降低了系统之间的对接成本,都只需要和标准的 ESB 中间件交互就够了
* 数据实时可达

但是 ESB 已经被证明是”明日黄花”,它主要的锅在于:

  • 接口定义异常繁琐
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0c4cb8fa-13a4-4d9e-ace1-897656b04c3b
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:18d76c12-bb42-4adc-afc4-fc7746e30183

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fc3a6aa4-c0df-468f-af99-94266f41b394

    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b2768c23-a5c0-40f6-9083-42259c5eaa55

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6caf57b6-e1a4-4022-b2b0-4c98c1f9fe6f

    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:68afa2e0-2faf-438a-a0eb-5f238ba11c2e

今天的主流:Kafka ETL

十年前,随着大数据技术的发展,一个叫做 Kafka 的消息中间件开始流行起来,并快速在实时数据集成领域占据架构主导地位。作为最主流的消息事件平台代表,Apache Kafka 最初只是一个分布式的日志存储。后来逐渐增加了 Kafka Connect 和 Kafka Streams 功能。基于这些能力,我们可以用 Kafka 来搭建一个实时 ETL 链路,满足企业内业务系统之间数据实时集成的需求。

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

但是这种基于 Kafka 来做实时 ETL 的架构,不足点非常明显:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4ef3c171-0f43-4f09-bbdf-6455c33b78bf

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2b1c5c87-0c98-404b-a745-c415cffd0ea3

* 需要 Java 代码开发,超出一般数据工程师的能力范围
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1a625451-fcc5-42fe-ba0c-b1a2c8de7aef

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:60220416-c965-410d-9228-96e1a1219e38

DaaS:以服务化的方式来解决数据集成的问题

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4d568073-7cb4-4ea4-9f15-a25769903739

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:08a5e927-0d0d-4a61-abb1-96f48d12f6a4

  • API 开发太繁琐,对源端性能侵入影响高
  • ETL 实时性不够,无法有效复用,造成意大利面的摊子
  • ESB 有中央化的优势,太贵性能太弱,已经 out
  • Kafka 架构复杂,开发成本不低,关键数据排错很困难

DaaS(Data as a Service,数据即服务)是一种新型的数据架构,通过将企业的核心数据进行物理或者逻辑的汇总,然后通过一个标准化的方式为下游提供数据的支撑,如下图所示:

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

这种架构的优势是:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9a8e2e5e-bd38-4795-b65e-24d70bdd8e4f

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ce091c80-0c52-4a32-840f-6922fbfa5f7d

* 接口标准化
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:23449181-eb27-4250-bd4d-8fec9b87c55d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a57fbd2f-f403-48ae-a002-67b65e36b760

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d3de6739-8958-44d4-b95b-b4a39e168771

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b7c677c9-7132-4eff-ad5c-dbdbf4e7bae1

但是传统的 DaaS 解决方案,如数据中台这样的实现,有一个很大的局限就是数据的入库是通过批量采集方式,导致数据不够新鲜实时,直接影响了传统 DaaS 的业务价值。

三、新一代 DaaS 方案:自带实时 ETL 的数据服务平台

展望新的十年,基于完全自主研发的新一代实时数据平台「Tapdata Live Data Platform」应运而生,作为一个实现 DaaS 架构的新型数据平台,Tapdata LDP 的核心突破点是自己 实现了完整的基于 CDC 的异构数据实时集成+计算建模的能力,通过源源不断对数十种源数据库实时采集并输送到 DaaS 存储的方式,确保数据在 DaaS 里始终得到实时更新,实现了一个 Incremental DaaS,增量数据服务平台的理念。通过这种对 DaaS 的实时增强,Tapdata 将承载着将企业 ETL 数量从 N 降为 1 的使命,凭借”全链路实时”的核心技术优势,加速连接并统一数据孤岛,打造一站式的实时数据服务底座,为企业的数据驱动业务”Warehouse Native “提供全面、完整、准确的新鲜数据支撑。

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

功能特性

  • 首个同时支持 TP 和 AP 业务场景的实时数据平台
  • 两大核心技术能力:实时数据集成(ETL) 和 实时数据服务(DaaS)
  • 首创以 DaaS 方式解决实时数据集成问题,数倍降低 ETL 链路数量
  • 40+ 数据源的实时复制能力,立即打通企业数据孤岛
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f9f58cce-e182-4fad-8b86-e2fb53755a06
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0fbf93d7-4fda-4da6-96d6-609ef79d79a9

  • 面向程序员:首创 Fluent ETL,用代码而不是 SQL 来处理数据
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4c54065f-3601-4420-86cf-c8df6eb94b09
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c77f9238-249f-4285-921c-210abf11be82

  • 开放+开源,使用PDK自助扩展更多对接和处理能力,在免费获得实时能力同时回馈社区

实时应用场景

  • 实时数据管道:Tapdata 可以用来 替换 CDC + Kafka + Flink,几分钟快速构建完整的数据采集+流转的管道,避开 CDC 数据采集易错、Kafka 阻塞、链路排查困难等大坑小坑。
  • 实时 ETL:Tapdata 可以用来 替换 Kettle / Informatica 或 Python 这样的 ETL 工具或脚本。基于 JS 或 Python 的 UDF 功能可以无限扩展处理能力;分布式部署能力可以提供更高的处理性能;基于拖拉拽的新一代数据开发更加简便。此外,还支持通过自定义算子快速扩展平台的数据处理及加工能力。
  • 实时构建宽表:从大数据分析到数仓建设,再到 Dashboard,数据工程人员使用大量批处理任务来生成用于展现和分析的宽表或视图。这些宽表构建通常需要耗费大量资源,且数据更新不及时。但若利用 Tapdata 独特的 增量宽表构建能力,即可以最小化的成本提供最新鲜的数据结果。
  • 实时数据服务:数字化转型过程中企业需要构建大量新型业务,这些业务往往需要来自其他业务系统的数据。传统基于 ETL 的数据搬运方案局限性较大,如链路繁杂、无法复用、大量的数据链路对源端产生影响较大等。Tapdata 的实时数据服务可以通过 对数据做最后一次 ETL,同步到基于 MongoDB 或 TiDB 的分布式数据平台,结合 无代码 API,可以为众多下游业务直接在数据平台提供快速的数据 API 支撑。

想要了解更多新一代实时数据集成架构的技术细节,以及实时数据领域的前沿发展,推荐关注下周三 6 月 29 日 14:30 的 Tapdata Live Data Platform 产品发布暨开源说明会——纯粹的技术分享,干货式的理念交流,更有实时数据”输送者”和”使用者”等多方代表参与的圆桌论坛——敬请期待!点击报名

Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

Original: https://www.cnblogs.com/tapdata/p/16407221.html
Author: Tapdata钛铂数据
Title: Kafka ETL 之后,我们将如何定义新一代实时数据集成解决方案?

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/562471/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球