hive从入门到放弃(四)——分区与分桶

分区可以提高查询效率，实际上 hive 的一个分区就是 HDFS 上的一个目录，目录里放着属于该分区的数据文件。

分区的基本操作

create table partition_table(
    col1 int,
    col2 string
)
partitioned by (part_col string)
row format delimited fields terminated by '\t';

*分区字段不能是表中字段

创建完分区表如果需要将数据导入表中，需要用 load 命令导入；

 load data local inpath
'/data_dir/data_file' into table partition_table
partition(part_col='20220331');

如果是在 HDFS 中创建目录并将数据文件传到目录中，是没办法查到的，因为查询分区表是需要查询元数据的;

如果非要用这种方法或者已经做了，可以执行修复命令: msck repair table table_name;

show partitions partition_table;

select * from partition_table where part_col='20220331';

alter table partition_table add partition(part_col='20220331');

alter table partition_table drop partition(part_col='20220331');

二级分区

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4d8e006a-5530-4e20-98b6-b75832837440

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:29e0335c-9764-48ed-95ac-14fcbd3c5386

create table partition_table(
    col1 int,
    col2 string
)
partitioned by (part_col1 string, part_col2 string)
row format delimited fields terminated by '\t';

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e1136e37-e209-4a5e-a0f0-08e122489671

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:439d6e9e-3561-49c6-aa35-f402565bfb52

动态分区

关系型数据库中，对分区表 Insert 数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中

Hive 中也提供了类似的机制，即动态分区(Dynamic Partition)，不过使用 Hive 的动态分区需要进行相应的配置。

开启动态分区功能（默认 true，开启）

hive.exec.dynamic.partition=true

设置为非严格模式

hive.exec.dynamic.partition.mode=nonstrict

默认 strict，表示至少指定一个分区为静态分区，nonstrict 表示允许所有的分区字段都能使用动态分区。

在所有执行 MR 的节点上，最大一共可以创建多少个动态分区。默认 1000

hive.exec.max.dynamic.partitions=1000

在每个执行 MR 的节点上，最大可以创建多少个动态分区。比如源数据中包含了一年的数据，即 day 字段有 365 个值，那么该参数就
需要设置成大于 365，如果使用默认值 100，则会报错。

hive.exec.max.dynamic.partitions.pernode=100

insert into partition_table partition(part_col) select * from table_name;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d5121cfb-7931-49da-a6ec-a07b3de5eecf

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b044fb74-6c21-4716-8768-1b4117cbf4c5

hive可以将数据进行分桶，不同于分区是针对存储路径进行分类，分桶是在数据文件中对数据进行划分的一种技术。

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a8809caa-1e47-4cba-90a1-0d4709086e30

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:534912fa-b78a-4949-bd0f-8ad266a4edec

创建分桶表

-- &#x5206; 6 &#x4E2A;&#x6876;&#x7684;&#x5206;&#x6876;&#x8868;
create table bucket_table(col1 int, col2 string)
clustered by(col1)
into 6 buckets
row format delimited fields terminated by '\t';

加载数据

加载数据到分桶表中可以使用 load 或者 insert 的方式。

需要注意的是，reduce 的个数设置应该为-1,让 Job 自行决定需要用多少个 reduce 或者将 reduce 的个
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1bfb58cb-3755-4d5c-a2cc-603b717b5e0d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c9f68af1-a27b-4040-b391-677cd4a73d69

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2035b99f-4aef-4b38-876a-303b31e5ad44

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ca5bb695-e78f-415c-b13c-0b586c3d1fe4

果。Hive 可以通过对表进行抽样来满足这个需求。

语法： tablesample（bucket x out of y）

select * from bucket_table tablesample(bucket 1 out of 3 on col1);

y必须是table总共bucket数的倍数或者因子。

上面的语句表示：对于分桶数为 6 的表，总共抽取 6/y = 6/3 = 2 个bucket的数据，

分别为第 x=1 个 bucket 和第 x+3=4 个 bucket 的数据。

本文简单介绍了 hive 的分区，包括如何创建分区表、新建分区和删除分区，还有二级分区和动态分区；以及分桶表，包括分桶表的概念和抽样函数。

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d2acffe6-a5c8-4603-a215-973f3f8a8f86

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:596aabd0-4b14-424a-93a0-603749030cff

Original: https://www.cnblogs.com/lyuzt/p/16091617.html
Author: 大数据的奇妙冒险
Title: hive从入门到放弃(四)——分区与分桶

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/562998/

转载文章受原作者版权保护。转载请注明原作者出处！

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

hive从入门到放弃(四)——分区与分桶

分区的基本操作

二级分区

动态分区

创建分桶表

加载数据

大家都在看