Hive语法及其进阶(一)

2023年5月26日下午11:27 • 大数据 • 阅读 76

1、Hive完整建表

1 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name(
 2       [(col_name data_type [COMMENT col_comment], ...)]
 3       )
 4       [COMMENT table_comment]
 5       [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
 6       [CLUSTERED BY (col_name, col_name, ...)
 7       [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
 8       [
 9        [ROW FORMAT row_format]
10        [STORED AS file_format]
11        | STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ]  (Note:  only available starting with 0.6.0)
12       ]
13       [LOCATION hdfs_path]
14       [TBLPROPERTIES (property_name=property_value, ...)]  (Note:  only available starting with 0.6.0)
15       [AS select_statement]  (Note: this feature is only available starting with 0.5.0.)

注意:

EXTERNAL:外部表
(col_name data_type [COMMENT col_comment],…:定义字段名，字段类型
COMMENT col_comment:给字段加上注释
COMMENT table_comment:给表加上注释
PARTITIONED BY (col_name data_type [COMMENT col_comment],…):分区分区字段注释
CLUSTERED BY (col_name, col_name,…):分桶
SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS:设置排序字段升序、降序
ROW FORMAT row_format:指定设置行、列分隔符(默认行分隔符为\n)
STORED AS file_format:指定Hive储存格式：textFile、rcFile、SequenceFile 默认为：textFile
LOCATION hdfs_path:指定储存位置(默认位置在hive.warehouse目录下)
TBLPROPERTIES (property_name=property_value, …):跟外部表配合使用，比如：映射HBase表，然后可以使用HQL对hbase数据进行查询，当然速度比较慢
AS select_statement:从别的表中加载数据 select_statement=sql语句

2、使用默认方式建表

1 create table students01
2         (
3             id bigint,
4             name string,
5             age int,
6             gender string,
7             clazz string
8         )
9         ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

注意:
分割符不指定,默认不分割
通常指定列分隔符,如果字段只有一列可以不指定分割符：

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

3、建表2：指定location

1 create table students02
 2         (
 3             id bigint,
 4             name string,
 5             age int,
 6             gender string,
 7             clazz string
 8         )
 9         ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
10         LOCATION 'data';

4、建表3：指定存储格式

1  create table student_rc
 2         (
 3             id bigint,
 4             name string,
 5             age int,
 6             gender string,
 7             clazz string
 8         )
 9         ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
10         STORED AS rcfile;

注意:

指定储存格式为rcfile，inputFormat:RCFileInputFormat,outputFormat:RCFileOutputFormat，如果不指定，默认为textfile

注意：

除textfile以外，其他的存储格式的数据都不能直接加载，需要使用从表加载的方式。

5、建表4：从其他表中加载数据
格式:
create table xxxx as select_statement(SQL语句) (这种方式比较常用)

例子:
create table students4 as select * from students2;

6、建表5：从其他表中获取表结构

格式:
create table xxxx like table_name 只想建表，不需要加载数据

例子：

create table student04 like students;

7.Hive加载数据

1、使用hadoop dfs -put '本地数据' 'hive表对应的HDFS目录下 <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927194249038-983993749.png" /> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927194335062-1744380153.png" /> 2、使用 load data inpath（是对hdfs的文件移动，移动，移动，不是复制） 3、使用load data local inpath（经常使用，从本地文件中上传） // overwrite 覆盖加载 // 实际上就是hadoop执行了rmr然后put操作例如：load data local inpath'/usr/local/data/students.txt' overwrite into table student01; 方式1和方式2的区别: 1.上传数据到hdfs目录和hive表没有任何关系(不需要数据格式进行匹配,hive读取数据还是需要数据格式的匹配) 2.上传数据到hive表和hive表有关系(需要数据格式进行匹配) 8. 清空表 truncate table student01; 注意：清空代表清空数据，不是删除表 <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927195129914-254742467.png" /> 11. insert into table xxxx SQL语句（没有as）传输给别的格式的hive table 例如： insert into table student04 select * from student01; 覆盖插入把into 换成 overwrite 例如： insert overwrite table student04 select * from student01; 9、Hive 内部表（Managed tables）vs 外部表（External tables） 区别: 内部表删除数据跟着删除外部表只会删除表结构,数据依然存在 注意: 该公司的实际应用场景是外部表。为了避免意外删除表格，数据也会丢失。<details><summary>[En]</summary>The actual application scenario in the company is external tables. In order to avoid accidental deletion of tables, data is also lost.</details> 不能通过路径来判断是目录还是hive表(是内部表还是外部表) 建表： <pre><code>1 内部表 2 create table students_managed01 3 ( 4 id bigint, 5 name string, 6 age int, 7 gender string, 8 clazz string 9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927203014512-95537009.png" /> <pre><code>1 //内部表指定location 2 create table students_managed02 3 ( 4 id bigint, 5 name string, 6 age int, 7 gender string, 8 clazz string 9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 11 LOCATION '/managed'; </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927203141308-1029187150.png" /> <pre><code>1 // 外部表 2 create external table students_external01 3 ( 4 id bigint, 5 name string, 6 age int, 7 gender string, 8 clazz string 9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927203719492-579803790.png" /> <pre><code>1 // 外部表指定location 2 create external table students_external02 3 ( 4 id bigint, 5 name string, 6 age int, 7 gender string, 8 clazz string 9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 11 LOCATION '/external'; </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927204308294-626691896.png" /> 上传数据： <pre><code>hive> load data local inpath '/usr/local/data/students.txt'into table students_managed01;hive> load data local inpath '/usr/local/data/students.txt'into table students_managed02;hive> load data local inpath '/usr/local/data/students.txt'into table students_external01;hive> load data local inpath '/usr/local/data/students.txt'into table students_external02; </code></pre> 删除数据： <pre><code>hive> drop table students_managed01; hive> drop table students_managed02; hive> drop table students_external01; hive> drop table students_external02; </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927205100476-1477950715.png" /> 外部表与内部表总结： 可以看出，删除内部表的时候，表中的数据（HDFS上的文件）会被同表的元数据一起删除 删除外部表的时候，只会删除表的元数据，不会删除表中的数据（HDFS上的文件） 一般在公司中，使用外部表多一点，因为数据可以需要被多个程序使用，避免误删，通常外部表会结合location一起使用 外部表还可以将其他数据源中的数据映射到 hive中，比如说：hbase，ElasticSearch...... 设计外部表的初衷就是让表的元数据与数据解耦 10、Hive建立单级分区表 1.创建单级分区 <pre><code>1 create table students_pt 2 ( 3 id bigint, 4 name string, 5 age int, 6 gender string, 7 clazz string 8 ) 9 PARTITIONED BY(month string) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927211336996-902335493.png" /> 2.加载数据 load data local inpath '/usr/local/data/students.txt' into table students_pt partition(month='2021-09-26'); <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927211824988-1004476824.png" /> 3.分区查询 单分区查询 select * from students_pt where month='2021-09-26'; 多分区查询 select * from students_pt where month='2021-09-26'or month='2021-09-24'; 4.增加分区 创建单个分区 alter table students_pt add partition(month='2021-09-25'); 创建多个分区 alter table students_pt add partition(month='2021-09-23') partition(month='2021-09-24');（注意中间没有逗号分割） 5.删除分区 删除单个分区 alter table students_pt drop partition(month='2021-09-23'); 删除多个分区 alter table students_pt drop partition(month='2021-09-24'),partition(month='2021-09-25');（注意中间有逗号分割） 6.查看分区表分区 show partitions students_pt; <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210927213944358-485588338.png" /> 7.查看分区表结构 desc formatted students_pt; 11、Hive建立多级分区表 1.创建二级分区表 <pre><code>1 hive> create table score_pt( 2 > id int, 3 > subjectid int, 4 > score int) 5 > partitioned by (month string,day string) 6 > row format delimited fields terminated by ','; </code></pre> 2.上传数据 <pre><code>1 load data local inpath '/usr/local/data/score.txt' into table score_pt partition(month='2021-09',day='01') </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210928221802675-1084774276.png" /> 3.加载数据 <pre><code>1 select * from score_pt where month='2021-09' and day='01'; </code></pre> 4.添加二级分区 <pre><code>1 hive> alter table score_pt add partition(month='2021-09',day=02); </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210928222508893-688053502.png" /> <pre><code>1 alter table score_pt add partition(month='2021-09',day=03) partition(month='2021-09',day=04);注意：没有逗号，和添加单级分区一样 </code></pre> 5.删除二级分区 <pre><code>1 alter table score_pt drop partition(month='2021-09',day=02); </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210928222911635-1632591284.png" /> <pre><code>1 alter table score_pt drop partition(month='2021-09',day=03),partition(month='2021-09',day=04); </code></pre> <pre><code>注：有逗号，就像删除单级分区一样<details><summary>*[En]*</summary>*Note: there is a comma, just like deleting a single-level partition*</details> </code></pre> <img alt="Hive语法及其进阶(一)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210928223012463-666684492.png" /> 12.动态分区 <blockquote> 有的时候我们原始表中的数据里面包含了 ''日期字段 dt''，我们需要根据dt中不同的日期，分为不同的分区，将原始表改造成分区表。 hive默认不开启动态分区 动态分区：根据数据中某几列的不同的取值划分不同的分区 </blockquote> <h5>开启Hive的动态分区支持</h5> <pre><code>表示开启动态分区 hive> set hive.exec.dynamic.partition=true; 表示动态分区模式：strict（需要配合静态分区一起使用）、nostrict strict： insert into table students_pt partition(dt='anhui',pt) select ......,pt from students; hive> set hive.exec.dynamic.partition.mode=nostrict; 支持的分区数量上限为1000个，可根据业务调整。<details><summary>*[En]*</summary>*The maximum number of supported partitions is 1000, which can be adjusted according to the business.*</details> hive> set hive.exec.max.dynamic.partitions.pernode=1000; #### 使用动态分区插入数据 1.创建表存储数据 </code></pre> 1 create table students_dt 2 ( 3 id bigint, 4 name string, 5 age int, 6 gender string, 7 clazz string, 8 dt string 9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; <pre><code> </code></pre> 1 create table students_dt_p 2 ( 3 id bigint, 4 name string, 5 age int, 6 gender string, 7 clazz string 8 ) 9 PARTITIONED BY(dt string) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; <pre><code> 2.插入数据（只能用这一种方式） // 分区字段需要放在 select 的最后，如果有多个分区字段同理，它是按位置匹配，不是按名字匹配 </code></pre> insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt; <pre><code> ![Hive语法及其进阶(一)](https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210928224728651-317857394.png) ![Hive语法及其进阶(一)](https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230526/2506444-20210928224943658-1076892534.png) 上单讲分区：https://developer.aliyun.com/article/81775 #### Hive分桶 > 分桶实际上是对文件（数据）的进一步切分 > > Hive默认关闭分桶 > > 作用：在往分桶表中插入数据的时候，会根据 clustered by 指定的字段进行hash分区对指定的buckets个数进行取余，进而可以将数据分割成buckets个数个文件，以达到数据均匀分布，可以解决Map端的"数据倾斜"问题，方便我们取抽样数据，提高Map join效率 > > 分桶字段需要根据业务进行设定 ##### 开启分桶开关 </code></pre> hive> set hive.enforce.bucketing=true; <pre><code> ##### 建立分桶表 create table students_buks ( id bigint, name string, age int, gender string, clazz string ) CLUSTERED BY (clazz) into 12 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; </code></pre> <h5>往分桶表中插入数据</h5> 
// 直接使用load data 并不能将数据打散
load data local inpath ‘/usr/local/soft/data/students.txt’ into table students_buks;

// 需要使用下面这种方式插入数据，才能使分桶表真正发挥作用
insert into students_buks select * from students;

Original: https://www.cnblogs.com/lmandcc/p/15345444.html
Author: lmandcc
Title: Hive语法及其进阶(一)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/522707/

转载文章受原作者版权保护。转载请注明原作者出处！

大数据

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

docker-compose一直创建中

docker-compose解决棘手一直创建中，，，然后显示 timeout 什么设置什么1000。然后更新docker-compose 重新启动systemctl restart…

大数据 2023年5月28日
0088
Docker容器中安装vim

注：如果没有vim，请预先安装vim，执行如下命令： apt-get install vim 在使用docker容器时。有时候里边没有安装vim。敲vim命令时提示说：vim: c…

大数据 2023年5月29日
0072
vc++中几个常用的数学函数

首先是头文件应该引用:cmath log()是表示以e为底的对数，在数学表达式中是ln，其中 e = 2.718281828459；然后如果要是表示以a为底，b为指数的; log…

大数据 2023年5月25日
0071
C#：Winfrom 实现DataGridView 自定义分页

目录安装Dapper依赖安装SQLite依赖新建SQLite数据库文件主要代码示例运行界面今天给大家分享Winform实现DataGridView 自定义分页的案例，感…

大数据 2023年11月11日
0039
Flutter实现搜索的三种方式

示例 1 ：使用搜索表单创建全屏模式我们要构建的小程序在右侧有一个应用程序栏和一个搜索按钮。当您按下此按钮时，将出现全屏模式对话框。它不是突然弹出的，而是带有淡入淡出动画和幻灯片…

大数据 2023年5月24日
0073
Redis主从模式和哨兵模式

大数据 2023年11月16日
0037
数据库炸了—-我就重启了一下啊（Communications link failure）

重启数据库后，数据库大部分时间连不上了；连续请求不会报错，请求间隔时间稍微长一点就会报错报错如图： com.mysql.cj.jdbc.exceptions.Communicat…

大数据 2023年6月3日
00123
一起来学自然语言处理—-加工原料文本

加工原料文本从网络和硬盘访问文本 * 1.电子书 2.处理的HTML 3.读取本地文件 4.NLP的流程字符串：字符串的基本操作使用Unicode进行文字处理 * 1. 从文…

大数据 2023年5月28日
00112
【JetCache】JetCache的使用方法与步骤

大数据 2023年11月15日
0049
新版Glance发布，更好用的Android数据库调试助手

本文同步发表于我的微信公众号，扫一扫文章底部的二维码或在微信搜索郭霖即可关注，每个工作日都有文章更新。 Glance是一个由我开发的用于调试Android数据库的开源库，无须借…

大数据 2023年11月11日
0061
Spark的数据存储目录HDFS

Spark主要在内存中运算，最终运算结果可以通过Hive存入到Mysql（MariaDB）和HDFS系统的。结论1.表的基本信息(表名，创建时间，所属者等)存入Mysql（Mar…

大数据 2023年11月13日
0033
数据仓库Hive的安装和使用

大数据 2023年11月13日
0043
【转】JVM监控工具介绍jstack, jconsole, jinfo, jmap, jdb, jstat, jvisualvm

一、jps 用来查看基于HotSpot JVM里面所有进程的具体状态, 包括进程ID，进程启动的路径等等。与unix上的ps类似，用来显示本地有权限的java进程，可以查看本地运行…

大数据 2023年5月28日
0094
如何用Dockerfile构建镜像

Dockerfile构建镜像是以基础镜像为基础的，Dockerfile是一个文本文件，内容是用户编写的一些docker指令，每一条指令构建一层，因此每一条指令的内容，就是描述该层应…

大数据 2023年5月28日
0078
flink-sql大量使用案例

1. 介绍本章节主要说明各类型 flink sql的先后编写执行顺序，另外简单写一些实际可用的案例。推荐大家使用 StreamPark 进行 flink sql 任务的开发和上线…

大数据 2023年11月13日
0048
关于SQL注入及防御

### 回答1： _SQL注入_是一种常见的网络安全漏洞，可以通过在应用程序的输入字段中插入恶意的 _SQL_语句，从而绕过应用程序的安全机制，访问、更改或删除 _数据库_中的数据…

大数据 2023年11月10日
0032

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Hive语法及其进阶(一)

大家都在看