如何快速同步hdfs数据到ck

ClickHouse是面向OLAP的分布式列式DBMS。我们部门目前已经把所有数据分析相关的日志数据存储至ClickHouse这个优秀的数据仓库之中,当前日数据量达到了300亿。

之前的数据处理和存储经验是基于实时数据流,数据存储在卡夫卡中。我们使用Java或Golang来读取、解析、清理和写入Kafka到ClickHouse的数据,这样我们就可以快速访问数据。然而,在很多学生的使用场景中,数据并不是实时的,可能需要将数据从HDFS或Have导入到ClickHouse中。有同学写Spark程序实现数据导入,有没有更简单高效的方法?

[En]

The previous experience of data processing and storage is based on real-time data flow, and the data is stored in Kafka. We use Java or Golang to read, parse, clean and write the data from Kafka to ClickHouse, so that we can quickly access the data. However, in many students’ usage scenarios, the data is not real-time, and it may be necessary to import the data from HDFS or Hive into ClickHouse. Some students write Spark programs to achieve data import, so is there a more simple and efficient method?

HDFS to ClickHouse

假设我们的日志存储在HDFS中,我们需要解析日志,过滤出我们关心的字段,并将对应的字段写入ClickHouse的表中。

[En]

Assuming that our log is stored in HDFS, we need to parse the log and filter out the fields we care about, and write the corresponding fields to the table of ClickHouse.

Log Sample

我们在HDFS中存储的日志格式如下,这是一个非常常见的Nginx日志。

[En]

The log format we store in HDFS is as follows, which is a very common Nginx log.

ClickHouse Schema

我们的ClickHouse TABLE语句如下所示,我们的表按天分区。

[En]

Our ClickHouse table statement is as follows, and our table is partitioned by day.

Waterdrop with ClickHouse

接下来,我们将为您详细介绍如何通过Water Drop满足上述要求,并将HDFS中的数据写入ClickHouse。

[En]

Next, we will give you a detailed description of how we can meet the above requirements through Waterdrop and write the data in HDFS into ClickHouse.

Waterdrop

Waterdrop是一个非常易用,高性能,能够应对海量数据的实时数据处理产品,它构建在Spark之上。Waterdrop拥有着非常丰富的插件,支持从Kafka、HDFS、Kudu中读取数据,进行各种各样的数据处理,并将结果写入ClickHouse、Elasticsearch或者Kafka中。

Prerequisites

首先,我们需要安装水滴,安装非常简单,不需要配置系统环境变量1.准备Spark环境2.安装水滴3.配置水滴

[En]

First of all, we need to install Waterdrop, the installation is very simple, there is no need to configure the system environment variable 1. Prepare the Spark environment 2. Install Waterdrop 3. Configure Waterdrop

Waterdrop Pipeline

我们只需要为水滴管道编写一个配置文件,即可完成数据导入。

[En]

We only need to write a configuration file for Waterdrop Pipeline to complete the data import.

配置文件由四个部分组成,即Spark、输入、过滤器和输出。

[En]

The configuration file consists of four parts, namely Spark, Input, filter, and Output.

Spark

本节介绍Spark的配置,主要用于配置Spark执行所需的资源大小。

[En]

This section is related to the configuration of Spark, mainly to configure the size of resources required for Spark execution.

Input

本节定义数据源,如以下从HDFS文件读取文本格式数据的配置示例所示。

[En]

This section defines the data source, as shown in the following configuration example of reading data in text format from an HDFS file.

Filter

在Filter部分,这里我们配置一系列的转化,包括正则解析将日志进行拆分、时间转换将HTTPDATE转化为ClickHouse支持的日期格式、对Number类型的字段进行类型转换以及通过SQL进行字段筛减等

Output

最后,我们将处理后的结构化数据写入ClickHouse。

[En]

Finally, we write the processed structured data to ClickHouse.

Running Waterdrop

我们将上述四部分配置组合成为我们的配置文件 config/batch.conf

执行命令,指定配置文件,运行Water Drop,即可将数据写入ClickHouse。这里我们以当地的模式为例。

[En]

Execute the command, specify the configuration file, run Waterdrop, and the data can be written to ClickHouse. Here we take the local model as an example.

在本文中,我们向您展示了如何使用Water将Nginx日志文件从HDFS导入到ClickHouse。您可以通过单个配置文件快速导入数据,而无需编写任何代码。除了支持HDFS数据源,Water Drop还支持将数据的实时读取处理从Kafka写到ClickHouse。我们的下一篇文章将向您展示如何将数据从配置单元快速导入到ClickHouse。

[En]

In this article, we showed you how to import Nginx log files from HDFS into ClickHouse using Waterdrop. You can quickly import data through a single configuration file without writing any code. In addition to supporting HDFS data sources, Waterdrop also supports writing real-time read processing of data from Kafka to ClickHouse. Our next article will show you how to quickly import data from Hive into ClickHouse.

当然,水滴不仅是编写ClickHouse数据的工具,在编写Elasticearch和Kafka等数据源方面也发挥着重要作用。

[En]

Of course, Waterdrop is not only a tool for writing ClickHouse data, but also plays an important role in writing data sources such as Elasticsearch and Kafka.

Original: https://www.cnblogs.com/wenBlog/p/16020970.html
Author: DB乐之者
Title: 如何快速同步hdfs数据到ck

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6911/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部