大家好,我是独孤风。
2022年已过去一半多的时间了。这半年多,我们重点关注了LinkedIn Datahub、Atlas等元数据管理工具,了解了他们在数据治理领域的作用。
也关注了Apache Griffin等数据质量工具的使用。
但是,在数据工程领域这只是冰山一角,近期lakeFS高级工程师 _Einat Orr_发布一份2022年的数据工程汇总图,对于数据工程领域的优秀项目进行了整理汇总。
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:185a1a69-754d-4b43-81f0-cd36720085a2
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5851fa92-e7f9-4968-a2a3-e4b821c2eee6
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1355546a-6e64-47eb-97bf-b81a9f512090
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:aeeadebd-a2fe-4c6f-b950-65f814b94d93
1、数据采集软件
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:609f68ba-e840-4bc3-afec-0ad850529288
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:609ca0ba-2d8d-4bef-bac7-5630f25b2052
值得关注的是Airbyte, Airbyte成立于 2020 年,是一个开源项目。
附上地址: https://github.com/airbytehq/airbyte
Airbyte 是一个开源 EL(T) 平台,可帮助您在数据仓库、数据湖和数据库中复制数据。
2、数据采集框架
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:719b82c1-91c6-4c9d-95bd-a30da26ed745
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d171a766-a8cf-4641-ade1-56cf7c457121
毫无疑问,除了商用的软件以外,Spark、Flink、Kafka、Pulsar等开源技术将继续大发异彩。
3、对象存储
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:78184219-af15-49d3-9c94-b3b1ec1cc8ab
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3faae442-56bb-469a-b946-64d4deb4b93d
这一领域老牌的Ceph,SwiftStack确实有一定的市场份额,但更应该关注新兴Minio。之前我们也做过相应的实践。
大数据流动历史文章: Github 29K Star的开源对象存储方案——Minio入门宝典
4、数据湖
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:264fbe68-6238-4ce7-8bef-0312ace2805e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a3f28334-d1f3-4392-b323-50ed778190dd
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:dc4f348e-dca3-4316-a7dd-0697ab5ecfb7
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6dae8b46-2010-4cdc-8671-838f88fdf3cf
Hudi、Iceberg也成为了很多公司的选择。
目前来看 _Databricks_的架构依然是更高性能的,目前来看他们还没有开源出更多的东西。
5、以数据为中心的机器学习
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e7c87d5d-3c29-4db7-a663-65fbf410ab9d
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c4ac6d8d-08f4-4f43-ba46-16a45f06d13d
端到端 MLOps 工具,以数据为中心的机器学习方法的工具,机器学习的 可观察性和监控。
2022年deepchecks开源。
https://github.com/deepchecks/deepchecks
deepchecks用于验证 ML 模型和数据的测试套件。Deepchecks 是一个 Python 包,用于以最小的努力全面验证您的机器学习模型和数据。
6、数据治理
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c4de2450-5fa4-4df6-8d80-7f41df0529ed
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:94394884-94f9-4d53-b836-a5fb37722aa3
目前我们关注的重点还是Atlas和Datahub。
Atlas是老牌的Hadoop生态中的一员,对于数据血缘的支持还是足够的。
通过二次开发,Atlas可以满足大部分公司的业务需求。
而Datahub作为一颗冉冉升起的新星,更是值得持续的关注。
万字保姆级长文——Linkedin元数据管理平台Datahub离线安装指南
2022 年还有哪些其他项目正在兴起?哪些工具正在成为行业事实上的标准?
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:abf0092b-8a04-40e2-ae14-ce2739c1604b
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:da60a42e-cfd7-44fc-b64c-e3951a4609e7
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d1c1a3f7-b768-45e6-8f03-6c70406767b9
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e4ef2c81-1dc3-42c6-8fe2-a9b6cab175fc
加入学习群 请关注大数据流动,后台回复 加群
数据治理实践类知识星球 数据治理工具箱 也已成立 ,需要加入请在后台 回复 “数据治理工具箱”
Original: https://www.cnblogs.com/tree1123/p/16520058.html
Author: 独孤风
Title: 2022,数据科学与数据治理项目全纪录
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/562159/
转载文章受原作者版权保护。转载请注明原作者出处!