Hive

2023年11月12日上午3:13 • 大数据 • 阅读 35

一、Hive

1.1 Hive作用及优缺点

1、前端采集数据
2、然后对数据进行预处理，将其变为结构化数据。MR和Spark可以做
3、拿到海量数据，对于海量数据的分析，SQL可以多维度查询，但对海量数据没办法；
    MR可以做分析，但可惜没有丰富的查询能力，于是Hive诞生，它提供SQl语句，然后拿到客户端的Sql语句进行解析转为MR程序，输出结果。
    MySQl中记录着元数据信息：表的位置信息和表结构信息
               Hdfs记录着：原数据信息和分析结果数据

1、什么是hive？
    Hive是用于解决海量结构化数据分析的一种数仓工具。
    本质是将sql语句解析成MR程序，并计算出结果。
    hive是一个工具，不是数据库，它将用户编写的SQL语句转换成MR程序运行，分析处理海量架构化数据。
    hive处理的数据存储在Hdfs上，分析数据的底层实现是MR/Spark；执行程序运行在yarn上。
2、优点：
    1）对SQL有着良好的支持
    2）避免了写MR程序
    3）可以处理大数据，而对于小数据并无优势
    4）运行模式为MR，比较稳定，后期也可改为spark模式（因为MR会将数据缓存到磁盘不会造成内        存溢出）。
3、缺点：
    1）实时性低延迟高，不适合做实时查询。由于Hive在查询数据的时候无索引，需要扫描整个表，       而且MR框架也是比较复杂的，所以延迟会比较高
    2）查询能力有限，迭代式算法无法表达
    3）不擅长数据挖掘
    4)hive自动生成MR程序，通常情况下不够智能
应用场景：
    对实时性要求不高，适合做大量数据的报表数据
4、hive与数据库的区别：
    Hive是数仓工具，并非数据库。Hive除了拥有类似数据库的查询语句，再无类似之处。作为数据库，最起码要可以实现增删改查，而Hive不可以删除、修改数据，基本只能查询。

1.2 Hive安装、启动

Hive--连接着mysql和hdfs，所以hive的使用至少需要SQL和hdfs开启。
    做了开启之后可以查看进程中是否有这个端口。
1、mysql服务d端启动 mysql -uroot -p
    元数据在mysql中；
    mysqlde的服务端口：3306
    而 mysql的开启：service mysqld start
                  service mysqld stop
                  service mysqld status
2、开启HDFS :start-dfs.sh
    HDFS的服务端口：8020
3、启动元数据服务 :hive --service metastore &(&是后台启动)
    hive的服务端口：9083
4、远程连接进到hive客户端：
    hiveserver2  &(&是后台启动)
    beeline
    beeline>!connect jdbc:hive2://linux01:10000
    (退出的时候是 ： ！quit)
查看进程中是否有端口的服务：
    netstat -nltp | grep 端口号
5、进入hive的客户端：bin/hive （不用）
    这是本地连接，不太好使，所以使用远程连接，连接到hive2；hive2做了优化，方便使用。

二、基本语法

2.1 数据类型

byte  smallint(short)  int  bigint(long)  boolean  float  double  timestamp
array
map
struct

2.2 建表–三种建表方式

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 分区
[CLUSTERED BY (col_name, col_name, ...) 分桶
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]   row format delimited fields terminated by "分隔符"
[STORED AS file_format]
[LOCATION hdfs_path]

create (external) table (if not exits) table_name(
    属性 数据类型 （comment...做声明）
    属性 数据类型 （comment...做声明）
)
row format delimited fileds terminated by '分隔符'
(comment table_comment.....做表的声明)
(partitioned by.....)
(clustered   by.....)
(sorted      by.....)
（location path）
；

1、
create table test_1 (
id int,
name string,
sal double
)
row format delimited fields terminated by ','
location '/test/'
;

2、create table test_2 like test_1;

3、 create table test_3 as select id,name,sal from test_1;

2.3 三种数据导入方式

1、load linux本地文件—相当于put操作

load data local inpath /opt/test.csv into table table_name;

2、load Hdfs上的文件—-相当于rm操作，Hdfs上的数据会被移动走

load data inpath '/test' into table table_name;

3、insert导入将查询结果导入已有的表中

insert into table_name select  * from table_name2;

2.4 导出数据

2.5 表的类型

create table test_1..........默认创建管理表(内部表)
crea external table test_2...创建外部表
管理表：在删除表的时候，会连带着把源数据也给删了：一般用于临时业务，且总表还在
外部表：在删除表的时候，不会删除表所在的源数据
注：创建表的时候如果使用location的方式，一定要注意使用external，建立外部表，否则，当你把    表删了的时候，连带着location的数据也给删了。

三、表、库的操作命令

3.1 库的操作命令

建     库：create database my_data;
查看库结构：desc database my_data;
查看库详细信息：desc database extended my_data;
修改库的属性：alter database my_data set dbproperties('auther'='wt')

3.2 表的操作

3.2.1 建表

create table tb_1 ()...

create table as select ....

create table tb_1 like tb_2;

3.2.2 数据导入

    location
    load data local inpath....
    load data inpath......
    insert into ....partition('month'='2021-7') select......
    import....

3.3.3 数据导出

数据导出
    1、将数据格式化后导出到本地
        insert overwrite local directory 'linux目录'
        row format delimited fields terminated by ','
        select * from tb_table;
     2、将数据格式化后导出到Hdfs（没有local）
        insert overwrite  directory 'Hdfs目录'
        row format delimited fields terminated by ','
        select * from tb_table;
     3、Hive命令将数据导出到本地
        hive -e 'select...;select ....;....' > /opt/a.txt
        hive -f sql.txt;
        定时器：定时的去执行一些脚本(不是太好使)
        crontab -e
        *  *  *  *  * CMD
        分 时 日 月 周
        * * * * *    每分钟执行一次
        0 * * * *    每小时执行一次
        30 8 * * *   每天八点半执行一次
        30 8,10,12 * * * 每天八点半、十点半、十二点半各执行一次
        0 1~5 * * * 每天1、2、3、4、5点各执行一次
     4、export将数据导出到Hdfs中
         export table test_3 to '/user/hive/warehouse/export';
         import table test_4 from '/user/hive/warehouse/export';

3.3.4 表命令

查询表结构：desc tb-1；
          desc formatted tb_1;
添加列：alter table tb_1 add columns (name string);
修改表的属性： alter table tb_1 set tblproperties('auther'='wbb');
删除表：drop table tb_1;

3.3 表分区操作

数据
2021-07-01.log
1,2021-07-01
2,2021-07-01
3,2021-07-01
4,2021-07-01
5,2021-07-01

2021-07-02.log
6,2021-07-02
7,2021-07-02
8,2021-07-02

2021-06-01.log
11,2021-06-01
12,2021-06-01
13,2021-06-01

2021-06-02.log
14,2021-06-02
15,2021-06-01
16,2021-06-02

建表：
create table tb_log (
id int ,
ctime string
)
partitioned by (m string,d string)
row format delimited fields terminated by ',';

导入数据：
load  data local inpath '/opt/data/2021-07-01.log' into table tb_log partition(m='2021-07',d='2021-07-01');
load  data local inpath '/opt/data/2021-07-02.log' into table tb_log partition(m='2021-07',d='2021-07-02');
load  data local inpath '/opt/data/2021-06-01.log' into table tb_log partition(m='2021-06',d='2021-06-01');
load  data local inpath '/opt/data/2021-06-02.log' into table tb_log partition(m='2021-06',d='2021-06-02');

于是便完成了表的分区：
tb_log


查看表的所有分区：show partitions tb_name;
删除分区:
alter table tb_log drop partition(d='2021-07-01'),partition(d='2021-06-01');

四：查询语句

SQL的运算优先级：
select       4)运算
             5) 运算逻辑  结果
from
tb_name      1) 源数据
where  and   2)
group by     3)
order by     6) 对结果排序
limit        7) 输出数据的条数

1、求每个人的部门名称
select yg.name,bm.name from yg join bm join gz on yg.uid=gz.uid and bm.bid=gz.bid;
2、每种性别的总工资
select yg.gender ,sum(jb+jj+tc) total_sal from yg  join  gz on  yg.uid = gz.uid
group by yg.gender;

1、case when

case when …then …else …end as…

case when :
3、统计每个年龄段的总工资：
select * ,
     case
     when 条件1 then '取名'
     when 条件2 then '取名2'
     else '取名'
     end  as 取名
 from
 tableName ;

 select * ,case
 when age>=10 and age<20 then '10-20'
 when age>=20 and age<30 then '20-30'
 when age>=30 and age<40 then '30-40'
 when age>=40 and age<50 then '40-50'
 else 'other'
 end as age_stage
 from yg;
+
| yg.uid  | yg.name  | yg.age  | yg.gender  | age_stage  |
+
| 1       | zss      | 23      | M          | 20-30      |
| 2       | lss      | 33      | F          | 30-40      |
| 3       | wbb      | 28      | M          | 20-30      |
| 4       | laona    | 37      | F          | 30-40      |
| 5       | naige    | 17      | M          | 10-20      |
+
 select a.age_stage , sum(jb+jj+tc) total_sal
 from gz join
 (select * ,case
 when age>=10 and age<20 then '10-20'
 when age>=20 and age<30 then '20-30'
 when age>=30 and age<40 then '30-40'
 when age>=40 and age<50 then '40-50'
 else 'other'
 end as age_stage
 from yg) a
 on gz.uid=a.uid group by age_stage;

2、greatest

4、求每个人的 各个工资 和最高工资 和最高工资的类型
select
yg.name,
gz.jb,
gz.jj,
gz.tc,
greatest(gz.jb,gz.jj,gz.tc) higher_sal,
case
when (greatest(gz.jb,gz.jj,gz.tc)=gz.jb) then 'jb'
when (greatest(gz.jb,gz.jj,gz.tc)=gz.jj) then 'jj'
when (greatest(gz.jb,gz.jj,gz.tc)=gz.tc) then 'tc'
end as higher_sal_type
from yg join gz
on yg.uid=gz.uid;

3、if(expr1 , expr2 ,expr3)

if(expr1 , expr2 ,expr3)  相当于三目运算符，若expr1成立则执行expr2，否则执行expr3

4、 with x as (select from ) select from x ;

给子查询起个名，让结构更加清晰一点
with t as (select id from student) select * from s_2 join t on s_2.id=t.id;

5、 left semi join

左半连接
a 1         1 ls
b 2         1 ls
c 3         2 zs
此时直接进行join便会有重复数据两个a ls 所以使用左半连接的时候只会连接到一个a ls

4.1 排序

order by  全表排序，易造成单个reducer任务量过大
distribute by ...sort by...分区，区内排序
    先设置分区：set mapreduce.job.reduces=3;
    然后分区排序：select * from emp distribute by name sort by sal desc;
    区内排序是按照字段取哈希值然后取模，进行分区，之后区内排序，减少了单个reducer的任务量
cluster by：
    要求分区的字段和排序的字段一样，且是升序。只能是升序不能降序。
    select * from emp cluster by sal;

4.2空字段赋值

nvl( string, replace_string) 若string为null则返回replace_string，若不是null则返回本身
select nvl(comm,'无') from emp;

4.2 case when

case when ..then ..else ..end as...

统计每个年龄段的总工资：
select * ,case
 when age>=10 and age<20 then '10-20'
 when age>=20 and age<30 then '20-30'
 when age>=30 and age<40 then '30-40'
 when age>=40 and age<50 then '40-50'
 else 'other'
 end as age_stage
 from yg;

4.3 行转列

concate(a,'-',b)  会将a与b用-连接上，形成a-b

concat_ws('-',array)这个是将数组内的数据取出并用'-'拼上

collect_list(name)分组后，将组内的数据放到一起形成一个数组

concat_ws(',',collect_list(name))

一、列传行

孙悟空 白羊座 A
娜娜  射手座 A
宋宋  白羊座 B
猪八戒 白羊座 A
凤姐  射手座 A
name xz xzid

最后是：xz xzid names
select
    xz-id,
    concat_ws(',',collect_list(name)) as names
from(
    select
        name,concat(xz,'-',xzid) as xz-id
    from t
    ) a
group by xz-id

+
|   xx   |   _c1    |
+
| 射手座,A  | 娜娜,凤姐    |
| 白羊座,A  | 孙悟空,猪八戒  |
| 白羊座,B  | 宋宋       |
+

drop table person_info;
create table person_info(
name string,
constellation string,
blood_type string)
row format delimited fields terminated by "\t";
load data local inpath '/opt/data/person_info.txt'into table person_info;

+
| person_info.name  | person_info.constellation  | person_info.blood_type  |
+
| 孙悟空               | 白羊座                        | A                       |
| 娜娜                | 射手座                        | A                       |
| 宋宋                | 白羊座                        | B                       |
| 猪八戒               | 白羊座                        | A                       |
| 凤姐                | 射手座                        | A                       |
+

1、将星座和血型拼接起来
select
name,
concat(constellation,',',blood_type) xx
from person_info;
2、分组后，将组内的数据收集到一个集合中
    collect_list(name)
    将集合中的数据用分隔符拼接起来
    concat_ws('|',list)

3、
select
xx,
concat_ws(',',collect_list(name))
from
(select
name,
concat(constellation,',',blood_type) xx
from person_info) a
group by xx;

4.4 列转行


    这一行是给虚拟表取个名字为temp 然后这个虚拟表只有一个字段，给这个字段取名为cate

《疑犯追踪》  悬疑,动作,科幻,剧情
《Lie to me》 悬疑,警匪,动作,心理,剧情
《战狼2》   战争,动作,灾难

《疑犯追踪》      悬疑
《疑犯追踪》      动作
《疑犯追踪》      科幻
《疑犯追踪》      剧情
《Lie to me》   悬疑
《Lie to me》   警匪
《Lie to me》   动作
《Lie to me》   心理
《Lie to me》   剧情
《战狼2》        战争
《战狼2》        动作
《战狼2》        灾难

drop table tb_movie;
create table tb_movie(
    name string,
    category string
)
row format delimited fields terminated by "\t";
load data local inpath "/opt/data/movie.txt" into table tb_movie;

+
| tb_movie.name  | tb_movie.category  |
+
| 《疑犯追踪》         | 悬疑,动作,科幻,剧情        |
| 《Lie to me》    | 悬疑,警匪,动作,心理,剧情     |
| 《战狼2》          | 战争,动作,灾难           |
+

select
explode(split(category,','))
from
tb_movie;

+
| col  |
+
| 悬疑   |
| 动作   |
| 科幻   |
| 剧情   |
| 悬疑   |
| 警匪   |
| 动作   |
| 心理   |
| 剧情   |
| 战争   |
| 动作   |
| 灾难   |
+

select
name,
cate
from
tb_movie
lateral view
explode(split(category,',')) temp as cate;

    这一行是给虚拟表取个名字为temp 然后这个虚拟表只有一个字段，给这个字段取名为cate
    此时这一行的结果便是一个表，不需要
    select explode(split(category,',')) temp as cate from tb_movie；

    是一种左侧视图连接 ，因为此时正常的join无法对应，所以使用lateral view 进行连接
    连接的时候会自动去找explode中的字段是从哪一个name下切开的，然后进行连接
    连接后形式如下：
+
|     name     | cate  |
+
| 《疑犯追踪》       | 悬疑    |
| 《疑犯追踪》       | 动作    |
| 《疑犯追踪》       | 科幻    |
| 《疑犯追踪》       | 剧情    |
| 《Lie to me》  | 悬疑    |
| 《Lie to me》  | 警匪    |
| 《Lie to me》  | 动作    |
| 《Lie to me》  | 心理    |
| 《Lie to me》  | 剧情    |
| 《战狼2》        | 战争    |
| 《战狼2》        | 动作    |
| 《战狼2》        | 灾难    |
+

4.5开窗函数

4.5.1 开窗分组，组内操作


    便会根据name进行分组，开窗，强大之处在于可以对分组后的组内字段数据进行操作

    substring函数，可以将字段进行剪切

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

select
    t.*,a.counts
from(
    select
        name,count(1) as counts
    from t
    group by name
    ) a
join t
on a.name = t.name

select
    name,
    ,tm
    ,count(1) over(partition by name ) as counts
from t
order by name,tm

create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/data/business.txt" into table business;

select
*,
count(1) over(partition by name)
from
business
;

+
| business.name  | business.orderdate  | business.cost  | count_window_0  |
+
| jack           | 2017-01-05          | 46             | 5               |
| jack           | 2017-01-08          | 55             | 5               |
| jack           | 2017-01-01          | 10             | 5               |
| jack           | 2017-04-06          | 42             | 5               |
| jack           | 2017-02-03          | 23             | 5               |
| mart           | 2017-04-13          | 94             | 4               |
| mart           | 2017-04-11          | 75             | 4               |
| mart           | 2017-04-09          | 68             | 4               |
| mart           | 2017-04-08          | 62             | 4               |
| neil           | 2017-05-10          | 12             | 2               |
| neil           | 2017-06-12          | 80             | 2               |
| tony           | 2017-01-04          | 29             | 3               |
| tony           | 2017-01-02          | 15             | 3               |
| tony           | 2017-01-07          | 50             | 3               |
+

select
*,
count(1) over(partition by name,substring(orderdate,0,7))
from
business

+
| business.name  | business.orderdate  | business.cost  | count_window_0  |
+
| jack           | 2017-01-05          | 46             | 3               |
| jack           | 2017-01-08          | 55             | 3               |
| jack           | 2017-01-01          | 10             | 3               |
| jack           | 2017-02-03          | 23             | 1               |
| jack           | 2017-04-06          | 42             | 1               |
| mart           | 2017-04-13          | 94             | 4               |
| mart           | 2017-04-11          | 75             | 4               |
| mart           | 2017-04-09          | 68             | 4               |
| mart           | 2017-04-08          | 62             | 4               |
| neil           | 2017-05-10          | 12             | 1               |
| neil           | 2017-06-12          | 80             | 1               |
| tony           | 2017-01-04          | 29             | 3               |
| tony           | 2017-01-02          | 15             | 3               |
| tony           | 2017-01-07          | 50             | 3               |
+

4.5.2行数据相加


 select
 *,
 sum(cost) over (partition by name order by orderdate)
 from business;

 select *,
 sum(cost) over (partition by name order by orderdate rows between unbounded preceding and unbounded following)
 from business;

select *,
 sum(cost) over (partition by name order by orderdate rows between 1 preceding and current row)
 from business;

  select *,
 sum(cost) over (partition by name order by orderdate rows between 1 preceding and 1 following)
 from business;

  select *,
 sum(cost) over (partition by name order by orderdate rows between current row and unbounded following)
 from business;

4.5.3l练习_business表

（1）查询在2017年4月份购买过的顾客及总人数
select
name,
count(*) over()
from
business
where
substring(orderdate,1,7) = '2017-04' ;
group by name;
（2）查询顾客的购买明细及月购买总额
select
*,
sum(cost) over (partition by name,substring(orderdate,0,7))

from
business
;
（3）上述的场景,要将cost按照日期进行累加
select
*,
sum(cost) over(partition by name order by orderdate )
from
business
;
（4）查看顾客每次购买时候的上次的购买时间

select
*,
lag(orderdate,1,'first_buy') over (partition by name order by orderdate)
from
business
;
（5）查询前20%时间的订单信息

select
*
from
(
select
* ,
ntile(5) over(order by orderdate) n
from
business
) t
where t.n = 1;

select
*,
dense_rank() over(partition by subject order by score desc)
from
tb_score;

五、编号函数


 1、
数据：
name  subject score
孙悟空 语文  87
孙悟空 数学  95
孙悟空 英语  68
娜娜  语文  94
娜娜  数学  56
娜娜  英语  84
宋宋  语文  64
宋宋  数学  86
宋宋  英语  84
婷婷  语文  65
婷婷  数学  85
婷婷  英语  78
建表：
create table tb_score(
name string,
subject string,
score double
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
load data local inpath "/opt/data/tb_score.txt" into table tb_score;
表：
+
| tb_score.name  | tb_score.subject  | tb_score.score  |
+
| 孙悟空            | 语文                | 87.0            |
| 孙悟空            | 数学                | 95.0            |
| 孙悟空            | 英语                | 68.0            |
| 娜娜             | 语文                | 94.0            |
| 娜娜             | 数学                | 56.0            |
| 娜娜             | 英语                | 84.0            |
| 宋宋             | 语文                | 64.0            |
| 宋宋             | 数学                | 86.0            |
| 宋宋             | 英语                | 84.0            |
| 婷婷             | 语文                | 65.0            |
| 婷婷             | 数学                | 85.0            |
| 婷婷             | 英语                | 78.0            |
+
1）按科目的成绩进行排序
select
*,
row_number() over(partition by subject order by score desc) rn
from
tb_score;
+
| tb_score.name  | tb_score.subject  | tb_score.score  | rn  |
+
| 孙悟空            | 数学                | 95.0            | 1   |
| 宋宋             | 数学                | 86.0            | 2   |
| 婷婷             | 数学                | 85.0            | 3   |
| 娜娜             | 数学                | 56.0            | 4   |
| 宋宋             | 英语                | 84.0            | 1   |
| 娜娜             | 英语                | 84.0            | 2   |
| 婷婷             | 英语                | 78.0            | 3   |
| 孙悟空            | 英语                | 68.0            | 4   |
| 娜娜             | 语文                | 94.0            | 1   |
| 孙悟空            | 语文                | 87.0            | 2   |
| 婷婷             | 语文                | 65.0            | 3   |
| 宋宋             | 语文                | 64.0            | 4   |
+
2）按科目进行排序
select
*,
rank() over(partition by subject order by score desc) rn
from
tb_score;
+
| tb_score.name  | tb_score.subject  | tb_score.score  | rn  |
+
| 孙悟空            | 数学                | 95.0            | 1   |
| 宋宋             | 数学                | 86.0            | 2   |
| 婷婷             | 数学                | 85.0            | 3   |
| 娜娜             | 数学                | 56.0            | 4   |
| 宋宋             | 英语                | 84.0            | 1   |
| 娜娜             | 英语                | 84.0            | 1   |
| 婷婷             | 英语                | 78.0            | 3   |
| 孙悟空            | 英语                | 68.0            | 4   |
| 娜娜             | 语文                | 94.0            | 1   |
| 孙悟空            | 语文                | 87.0            | 2   |
| 婷婷             | 语文                | 65.0            | 3   |
| 宋宋             | 语文                | 64.0            | 4   |
+
3）按科目进行排序
select
*,
dense_rank() over(partition by subject order by score desc) rn
from
tb_score;
+
| tb_score.name  | tb_score.subject  | tb_score.score  | rn  |
+
| 孙悟空            | 数学                | 95.0            | 1   |
| 宋宋             | 数学                | 86.0            | 2   |
| 婷婷             | 数学                | 85.0            | 3   |
| 娜娜             | 数学                | 56.0            | 4   |
| 宋宋             | 英语                | 84.0            | 1   |
| 娜娜             | 英语                | 84.0            | 1   |
| 婷婷             | 英语                | 78.0            | 2   |
| 孙悟空            | 英语                | 68.0            | 3   |
| 娜娜             | 语文                | 94.0            | 1   |
| 孙悟空            | 语文                | 87.0            | 2   |
| 婷婷             | 语文                | 65.0            | 3   |
| 宋宋             | 语文                | 64.0            | 4   |
+

六、连续登陆天数

需求：求每个人最大的连续消费天数
数据：
a,2017-02-05,200
a,2017-02-06,300
a,2017-02-07,200
a,2017-02-08,400
a,2017-02-10,600
b,2017-02-05,200
b,2017-02-06,300
b,2017-02-08,200
b,2017-02-09,400
b,2017-02-10,600
c,2017-01-31,200
c,2017-02-01,300
c,2017-02-02,200
c,2017-02-03,400
c,2017-02-10,600
a,2017-03-01,200
a,2017-03-02,300
a,2017-03-03,200
a,2017-03-04,400
a,2017-03-05,600
建表：
create table shop(
name string ,
cday string ,
cost int
)
row format delimited fields terminated by "," ;
load data local inpath "/opt/data/shop.txt" into table shop ;
+
| shop.name  |  shop.cday  | shop.cost  |
+
| a          | 2017-02-05  | 200        |
| a          | 2017-02-06  | 300        |
| a          | 2017-02-07  | 200        |
| a          | 2017-02-08  | 400        |
| a          | 2017-02-10  | 600        |
| b          | 2017-02-05  | 200        |
| b          | 2017-02-06  | 300        |
| b          | 2017-02-08  | 200        |
| b          | 2017-02-09  | 400        |
| b          | 2017-02-10  | 600        |
| c          | 2017-01-31  | 200        |
| c          | 2017-02-01  | 300        |
| c          | 2017-02-02  | 200        |
| c          | 2017-02-03  | 400        |
| c          | 2017-02-10  | 600        |
| a          | 2017-03-01  | 200        |
| a          | 2017-03-02  | 300        |
| a          | 2017-03-03  | 200        |
| a          | 2017-03-04  | 400        |
| a          | 2017-03-05  | 600        |
+
查询语句：
第一步：打上编号
select
*,
row_number() over(partition by name order by cday)
from
shop;
+
| shop.name  |  shop.cday  | shop.cost  | row_number_window_0  |
+
| a          | 2017-02-05  | 200        | 1                    |
| a          | 2017-02-06  | 300        | 2                    |
| a          | 2017-02-07  | 200        | 3                    |
| a          | 2017-02-08  | 400        | 4                    |
| a          | 2017-02-10  | 600        | 5                    |
| a          | 2017-03-01  | 200        | 6                    |
| a          | 2017-03-02  | 300        | 7                    |
| a          | 2017-03-03  | 200        | 8                    |
| a          | 2017-03-04  | 400        | 9                    |
| a          | 2017-03-05  | 600        | 10                   |
| b          | 2017-02-05  | 200        | 1                    |
| b          | 2017-02-06  | 300        | 2                    |
| b          | 2017-02-08  | 200        | 3                    |
| b          | 2017-02-09  | 400        | 4                    |
| b          | 2017-02-10  | 600        | 5                    |
| c          | 2017-01-31  | 200        | 1                    |
| c          | 2017-02-01  | 300        | 2                    |
| c          | 2017-02-02  | 200        | 3                    |
| c          | 2017-02-03  | 400        | 4                    |
| c          | 2017-02-10  | 600        | 5                    |
+
第二步：将时间和编号相减
select
name,
date_sub(cday,nu) dnum
from
(
select
*,
row_number() over(partition by name order by cday) nu
from
shop
) a1
;
第三步：分组聚合
 select
 name,
 count(1) over(partition by name,dnum) counts
 from
  select
  name,
  date_sub(cday,nu) dnum
  from
  (
  select
  *,
  row_number() over(partition by name order by cday) nu
  from
  shop
  ) a1
 )a2

;
第四步：找最大连续登陆天数(连续登陆天数大于3天)
select
name,
max(counts) max_in
from
(
 select
 name,
 count(1) over(partition by name,dnum) counts
 from
 (
  select
  name,
  date_sub(cday,nu) dnum
  from
  (
  select
  *,
  row_number() over(partition by name order by cday) nu
  from
  shop
  ) a1
 )a2
)a3
where counts>3
group by name
;
+
| name  | max_in  |
+
| a     | 5       |
| c     | 4       |
+

七、集合数据类型


songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing-
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing

songsong,
array<string>  bingbing_lili,
map<string,int> xiao song:18_xiaoxiao song:19,
struct<streat:string,city:string> hui long guan_beijing-

create table tb_dt(
name  string  ,
fs  array<string> ,
chs map<string , int>  ,
addr struct<street:string , city:String>
)
row format delimited fields terminated by ','
collection items  terminated by '_'
map  keys terminated by ':' ;
load  data local inpath '/data/dt.txt' into table tb_dt ;


- arr[index]   取值
- size(arr)     长度
- 数组角标越界返回NULL


map_keys   取所有的key集合
map_value   取所有的value集合
map[key]    根据key取value

八、分区和分桶

动态分区：

- 创建普通表   导入数据
- 开启动态分区
- 开启非严格模式
- 创建分区表
- 通过insert  into partition  select  from  语法导入数据


1,zss,bj
2,lss,bj
3,tg,sh
4,xg,bj
5,ln,sd
5,lln,sd
6,yg,sh
6,yyg,sh


1. 创建普通表  导入数据

drop table tb_teacher ;
create table tb_teacher(
    id string ,
    name string ,
    city string
)
row format delimited fields terminated by ','  ;
load data local inpath '/data/tc.txt' into table  tb_teacher ;
+
| tb_teacher.id  | tb_teacher.name  | tb_teacher.city  |
+
| 1              | zss              | bj               |
| 2              | lss              | bj               |
| 3              | tg               | sh               |
| 4              | xg               | bj               |
| 5              | ln               | sd               |
| 6              | yg               | sh               |
+


2.  设置相关参数

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrick;

3.  创建分区表

create table tb_dynamic_teacher(
    id string ,
    name string ,
    city string
)
partitioned by (ct string) ;



4 导入数据到分区表中

`sql
insert  into  table  tb_dynamic_teacher partition(ct)
select id , name ,city , city as  ct from tb_teacher ;

show  partitions tb_dynamic_teacher ;
+
| partition  |
+
| ct=bj      |
| ct=sd      |
| ct=sh      |
+

分桶：

创建普通表
创建分区表
设置参数
通过查询普通表向分区表中插入数据

数据：
1001    ss1
1002    ss2
1003    ss3
1004    ss4
1005    ss5
1006    ss6
1007    ss7
1008    ss8
1009    ss9
1010    ss10
1011    ss11
1012    ss12
1013    ss13
1014    ss14
1015    ss15
1016    ss16
创建普通表：
create table tb_student (
id int,
name string
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/data/student.txt' into table tb_student;
创建分桶表：
create table tb_bluck_student (
id int,
name string
)
clustered by(name) into 3 buckets;
查看分通表的结构：
desc formatted tb_bluck_student;
设置分桶的参数：
set hive.enforce.bucketing=true;
set mapreduce.job.reduces=-1;

向分桶表中导入数据：
insert into table tb_bluck_student select * from tb_student;

分区分桶：


创建分区分桶表：按照city分区id分桶
三个区，每个区有两个桶
创建普通表

创建普通表：
tb-teacher
create table tb_teacher2(
    id string ,
    name string ,
    city string
)
row format delimited fields terminated by ','  ;
load data local inpath '/data/teacher.txt' into table  tb_teacher2 ;
创建分区分桶表：按照名字分区id分桶
drop table tb_dynamic_blucks_teacher;
create table tb_dynamic_blucks_teacher(
id string,
 name string,
 city string
)
partitioned by (c string)
clustered by (id) into 2 buckets;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

set hive.enforce.bucketing=true;
set mapreduce.job.reduces=-1;

导入数据：
insert into table tb_dynamic_blucks_teacher partition(c)  select *,city c from tb_teacher2;

九、文件格式

详细描述见具体文档：

常用的三种文件格式：
--TextFile：不做任何压缩
--Orc     ：压缩比最高，但安全性没有parquent高
--Parquent：压缩比高于textFile，低于Orc，安全性高于Orc

在创建表的时候可以指定存储的文件格式，默认是textFile格式
stored as textfile ;

十、自定义函数