Flink1.14学习测试:将数据写入到Hive&Hdfs(二)

2023年11月13日上午12:00 • 大数据 • 阅读 68

Flink1.14学习测试:将数据写入到Hive&Hdfs(二)

参考

接收Kafka数据并写入到Hive （实现思路一）

说明

消息结构(JSON格式)

{"name":"Fznjui","age":16,"gender":"女"}

Kafka表定义

Kafka Table配置 json format。 Schema 结构保持与消息内容结构一致，当消息接收到时即可直接转换。

自定义UDF函数

此处是因为中文分区目录HIve不识别，所以做个转换函数。

测试完整代码

import cn.hutool.core.io.resource.ResourceUtil
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.Expressions.{$, currentTimestamp, dateFormat}
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.apache.flink.table.api.{ApiExpression, DataTypes, FieldExpression, Schema, SqlDialect, TableDescriptor, call}
import org.apache.flink.table.catalog.hive.HiveCatalog
import org.apache.flink.table.functions.ScalarFunction

import java.util.concurrent.TimeUnit

object KafkaToHiveTest1 {

  private val sourceTopic = "ly_test"
  private val kafkaServers = "192.168.11.160:9092,192.168.11.161:9092,192.168.11.162:9092"

  def main(args: Array[String]): Unit = {

    val parameterTool = ParameterTool.fromArgs(args)

    val hiveConfigDir = parameterTool.get("hiveCfgDir", ResourceUtil.getResource("hive/conf").getPath)

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    env.enableCheckpointing(TimeUnit.SECONDS.toMillis(1), CheckpointingMode.EXACTLY_ONCE)

    val tableEnv = StreamTableEnvironment.create(env)

    val catalogName = "lyTest"
    val hiveCatalog = new HiveCatalog(catalogName, "ly_test", hiveConfigDir)
    tableEnv.registerCatalog(catalogName, hiveCatalog)

    tableEnv.useCatalog(catalogName)

    val sourceTable = "kafkaTable"
    tableEnv.createTemporaryTable(sourceTable, TableDescriptor
      .forConnector("kafka")

      .schema(Schema.newBuilder()
        .column("name", DataTypes.STRING())
        .column("age", DataTypes.INT())
        .column("gender", DataTypes.STRING())
        .build())
      .option("topic", sourceTopic)
      .option("properties.bootstrap.servers", kafkaServers)
      .option("properties.group.id", "KafkaToHiveTest1")
      .option("scan.startup.mode", "latest-offset")

      .option("format", "json")
      .build())

    tableEnv.getConfig.setSqlDialect(SqlDialect.HIVE)

    val kafka_sink_hive = "kafka_sink_hive"

    tableEnv.executeSql(
      s"""
         |create table if not exists $kafka_sink_hive (
         | name string,
         | age int,
         | gender string,
         | sink_date_time timestamp(9)
         |) partitioned by (ymd string,sex string)
         |  stored as parquet
         |  tblproperties(
         |    'partition.time-extractor.timestamp-pattern' = '\\u0024ymd 00:00:00',
         |    'sink.partition-commit.policy.kind' = 'metastore'
         |  )
         |""".stripMargin)

    tableEnv.getConfig.setSqlDialect(SqlDialect.DEFAULT)

    tableEnv.from(sourceTable)
      .addColumns(currentTimestamp() as "now")
      .select($"name", $"age", $"gender", $"now",
        dateFormat($("now"), "yyyy-MM-dd") as "ymd",

        call(classOf[GenderConvert], $("gender")).asInstanceOf[ApiExpression] as "sex")
      .executeInsert(kafka_sink_hive)
  }

  class GenderConvert extends ScalarFunction {

    def eval(gender: String): String = {
      gender match {
        case "男" => "boy"
        case "女" => "girl"
        case _ => "other"
      }
    }
  }

}

运行结果

; 查看分区

查看HDFS对应的目录

; 接收Kafka数据并写入到Hive （实现思路二）

本质上练习在上一步已经结束了，此处主要还是为了测试自定义 udtf(表值函数) 函数。目的：使用自定义 udtf 函数解析数据随后入库。（茴字到底有几种写法？）

测试完整代码

import cn.hutool.core.date.DateUtil
import cn.hutool.core.io.resource.ResourceUtil
import cn.hutool.json.JSONUtil
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.annotation.DataTypeHint
import org.apache.flink.table.api.Expressions.{$, dateFormat}
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.apache.flink.table.api.{DataTypes, FieldExpression, Schema, SqlDialect, TableDescriptor, call}
import org.apache.flink.table.catalog.hive.HiveCatalog
import org.apache.flink.table.functions.TableFunction
import org.apache.flink.types.Row

import java.util.Date
import java.util.concurrent.TimeUnit

object KafkaToHiveTest2 {

  private val sourceTopic = "ly_test"
  private val kafkaServers = "192.168.11.160:9092,192.168.11.161:9092,192.168.11.162:9092"

  def main(args: Array[String]): Unit = {

    val parameterTool = ParameterTool.fromArgs(args)

    val hiveConfigDir = parameterTool.get("hiveCfgDir", ResourceUtil.getResource("hive/conf").getPath)

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    env.enableCheckpointing(TimeUnit.SECONDS.toMillis(1), CheckpointingMode.EXACTLY_ONCE)

    val tableEnv = StreamTableEnvironment.create(env)

    val catalogName = "lyTest"
    val hiveCatalog = new HiveCatalog(catalogName, "ly_test", hiveConfigDir)
    tableEnv.registerCatalog(catalogName, hiveCatalog)

    tableEnv.useCatalog(catalogName)

    val sourceTable = "kafkaTable"
    tableEnv.createTemporaryTable(sourceTable, TableDescriptor
      .forConnector("kafka")

      .schema(Schema.newBuilder()
        .column("message", DataTypes.STRING())
        .build())
      .option("topic", sourceTopic)
      .option("properties.bootstrap.servers", kafkaServers)
      .option("properties.group.id", "KafkaToHiveTest2")
      .option("scan.startup.mode", "latest-offset")

      .option("format", "raw")
      .build())

    tableEnv.getConfig.setSqlDialect(SqlDialect.HIVE)

    val kafka_sink_hive = "kafka_sink_hive"

    tableEnv.executeSql(
      s"""
         |create table if not exists $kafka_sink_hive (
         | name string,
         | age int,
         | gender string,
         | sink_date_time timestamp(9)
         |) partitioned by (ymd string,sex string)
         |  stored as parquet
         |  tblproperties(
         |    'partition.time-extractor.timestamp-pattern' = '\\u0024ymd 00:00:00',
         |    'sink.partition-commit.policy.kind' = 'metastore'
         |  )
         |""".stripMargin)

    tableEnv.getConfig.setSqlDialect(SqlDialect.DEFAULT)

    tableEnv.from(sourceTable)
      .joinLateral(call(classOf[ParseJsonConvert], $("message")))
      .select($"name", $"age", $"gender", $"now",

        dateFormat($("now"), "yyyy-MM-dd") as "ymd", $"sex")
      .executeInsert(kafka_sink_hive)
  }

  class ParseJsonConvert extends TableFunction[Row] {

    @DataTypeHint(value = "Row")
    def eval(message: String): Unit = {
      val json = JSONUtil.parse(message)
      val name = json.getByPath("name", classOf[String])
      val age = json.getByPath("age", classOf[java.lang.Integer])
      val gender = json.getByPath("gender", classOf[String])
      val sex = gender match {
        case "男" => "boy"
        case "女" => "girl"
        case _ => "other"
      }

      val localDateTime = DateUtil.offsetDay(new Date(), 1).toLocalDateTime
      collect(Row.of(name, age, gender, localDateTime, sex))
    }
  }

}

运行结果

; 查看分区

查看HDFS对应的目录

Original: https://blog.csdn.net/baidu_32377671/article/details/125815077
Author: lyanjun
Title: Flink1.14学习测试:将数据写入到Hive&Hdfs(二)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/817883/

转载文章受原作者版权保护。转载请注明原作者出处！

大数据

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Android将数据保存到本地数据库–Room

处理大量结构化数据的应用可极大地受益于在本地保留这些数据。最常见的使用场景是缓存相关的数据，这样一来，当设备无法访问网络时，用户仍然可以在离线状态下浏览该内容。 Room 持久性库…

大数据 2023年11月10日
0055
【云原生】zookeeper + kafka on k8s 环境部署

一、概述二、Zookeeper on k8s 部署 1）添加源 2）修改配置 3）开始安装 4）测试验证 5）Prometheus监控 6）卸载三、Kafka on k8s 部…

大数据 2023年6月3日
0066
Linux 单用户模式修改密码 Centos、Ubuntu

Centos 7 开机进入选择系统界面按e选择进入系统，一般默认第一个（进入系统过程中可以持续按上下键，避免跳过此页面）修改Grud 进入界面后往下找，找到以linux16开头…

大数据 2023年5月27日
0076
【数据分析】豆瓣电影Top250爬取的数据的可视化分析

豆瓣Top250网址将之前爬取到的豆瓣电影进行简单的可视化：数据列表保存为CSV格式，如图导入数据做好准备 import pandas as pdimport numpy …

大数据 2023年5月25日
10217
利用Hudi Bootstrap转化现有Hive表的parquet/orc文件为Hudi表

前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家。点击跳转到网站：https://www.captainai.net/dongkelun ; 前言 …

大数据 2023年11月13日
0053
Redis这15个“雷坑”，别问我咋知道的……

来源：https://dbaplus.cn/news-158-3836-1.html Original: https://www.cnblogs.com/hbuuid/p/1480…

大数据 2023年6月3日
00104
一口气说出 Redis 16 个常见使用场景

大数据 2023年11月16日
0046
Docker从入门到精通（三）——概念与执行流程

前面我们大概介绍了docker是什么以及如何安装docker，但是对里面出现的一些名词，可能大家还不熟悉，这篇文章就来为大家解惑。 1、容器化平台 Docker 是提供应用打包，部…

大数据 2023年5月29日
0047
jenkins docker push脚本

#!/bin/bash docker -v #私有库url repositryUrl=’192.168.7.52:5000′ #获取项目版本号作为镜像的tag version=aw…

大数据 2023年5月29日
0094
Linux安全加固学习

1、修改主机名方法一# hostname 主机名 ##临时修改主机名方法二#vim /etc/hostname ##修改hostname文件重启系统生效将localhost…

大数据 2023年5月27日
0074
是否想过中文编程呢？易语言使用的体验和感想

对于很多自以为英语不好就不能学好编程的童鞋来说，其实编程和英语真的不是那么的紧密，易语言可能是一种不错的选择。这里我们来体验一下，和其他语言做一下对比。一、变量的声明易语言中用c…

大数据 2023年11月10日
0045
Linux基础01

虚拟机关键配置名词解释虚拟⽹络编辑器桥接模式您可以访问Internet连接，并且配置的地址信息与物理主机的地址信息相同，容易造成地址冲突。 [En] You can acce…

大数据 2023年5月27日
0060
postgresql13 for window 安装及备份还原数据

下载 win x86-64 https://www.enterprisedb.com/download-postgresql-binaries 初始化 C:\install\pos…

大数据 2023年6月3日
00100
数学之美读书日记

判断句子是否合乎语法判断一个句子是否符合人们的习惯（合乎人们的说话习惯）只需要计算出该句子出现的概率就行假设有一句话为 $ S=w_1w_2w_3…w_n(n=le…

大数据 2023年5月28日
0073
Hive Lateral View、视图与索引

1.Hive Lateral View Lateral View 用于和 UDTF 函数（explode、split）结合来使用。首先通过 UDTF 函数拆分成多行，再将多行结果组…

大数据 2023年5月25日
0090
Android——一个简单的APP模版

如果你恰好在学习Android，并且刚看完网上的一些入门视频以及一些相关书籍，需要一个实战项目来巩固自己所学，并提升自我开发能力，那么此文也许你可以浅看一下。此外重点在于叙述，并附…

大数据 2023年11月10日
0043

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Flink1.14学习测试:将数据写入到Hive&Hdfs(二)

Flink1.14学习测试:将数据写入到Hive&Hdfs(二)

参考

接收Kafka数据并写入到Hive （实现思路一）

说明

消息结构(JSON格式)

Kafka表定义

自定义UDF函数

测试完整代码

运行结果

; 查看分区

查看HDFS对应的目录

; 接收Kafka数据并写入到Hive （实现思路二）

测试完整代码

运行结果

; 查看分区

查看HDFS对应的目录

大家都在看