hive-命令操作记录

Hive 的官方文档请参考:http://wiki.apache.org/hadoop/Hive/LanguageManual 。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
  [(col_name data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type
    [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...)
  [SORTED BY (col_name [ASC|DESC], ...)]
  INTO num_buckets BUCKETS]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION hdfs_path]

CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXIST 选项来忽略这个异常。

EXTERNAL 关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。

LIKE 允许用户复制现有的表结构,但是不复制数据。

用户可以定制SerDe或在创建表时使用自己的SerDe。如果未指定分隔的行格式或行格式,则将使用内置的SerDe。在创建表时,还需要指定表的列。在指定表的列时,您还将指定一个定制的SerDe、配置单元,以通过SerDe确定表的特定列的数据。

[En]

Users can customize SerDe or use their own SerDe when creating tables. If no ROW FORMAT or ROW FORMAT DELIMITED is specified, the built-in SerDe will be used. When you create a table, you also need to specify columns for the table. When you specify the columns of the table, you will also specify a custom SerDe,Hive to determine the data of the specific columns of the table through SerDe.

如果文件数据是纯文本,则可以使用存储为TEXTFILE。如果需要压缩数据,请使用存储为顺序。

[En]

If the file data is plain text, you can use STORED AS TEXTFILE. If the data needs to be compressed, use STORED AS SEQUENCE.

可以使用PARTIVED BY语句创建具有分区的表。一个表可以有一个或多个分区,每个分区位于单独的目录中。此外,表和分区都可以按列聚集,将多个列放入一个存储桶中。还可以使用Sort By对数据进行排序。这可以提高特定应用程序的性能。

[En]

Tables with partitions can be created using PARTITIONED BY statements. A table can have one or more partitions, each in a separate directory. Moreover, both tables and partitions can CLUSTERED BY a column, putting several columns into a bucket. You can also use SORT BY to sort the data. This can improve performance for specific applications.

表名和列名不区分大小写,SerDe和属性名区分大小写。表和列的注释是字符串。

[En]

Table and column names are not case-sensitive, SerDe and property names are case-sensitive. Comments for tables and columns are strings.

删除内部表会同时删除该表的元数据和数据。删除外部表,仅删除元数据并保留数据。

[En]

Deleting an internal table deletes both the table’s metadata and data. Delete an external table, delete only the metadata and retain the data.

Alter table 语句允许用户改变现有表的结构。用户可以增加列/分区,改变serde,增加表和 serde 熟悉,表本身重命名。

Add Partitions

ALTER TABLE table_name ADD
  partition_spec [ LOCATION 'location1' ]
  partition_spec [ LOCATION 'location2' ] ...

partition_spec:
  : PARTITION (partition_col = partition_col_value,
        partition_col = partiton_col_value, ...)

用户可以使用ALTER TABLE ADD PARTITION向表中添加分区。当分区名称为字符串时,请使用引号。

[En]

Users can use ALTER TABLE ADD PARTITION to add partitions to a table. Put quotation marks when the partition name is a string.

ALTER TABLE page_view ADD
    PARTITION (dt='2008-08-08', country='us')
      location '/path/to/us/part080808'
    PARTITION (dt='2008-08-09', country='us')
      location '/path/to/us/part080809';

DROP PARTITION

ALTER TABLE table_name DROP
    partition_spec, partition_spec,...

用户可以使用ALTER TABLE DROP PARTITION删除分区。该分区的元数据和数据将被一起删除。

[En]

Users can use ALTER TABLE DROP PARTITION to delete partitions. The metadata and data of the partition will be deleted together.

ALTER TABLE page_view
    DROP PARTITION (dt='2008-08-08', country='us');

RENAME TABLE

ALTER TABLE table_name RENAME TO new_table_name

此命令允许用户重命名表。数据的位置和分区名称不会更改。也就是说,旧的表名并没有被“释放”,对旧的表的更改会改变新的表的数据。

[En]

This command allows the user to rename the table. The location and partition name of the data do not change. In other words, the old table name is not “released”, and changes to the old table will change the data of the new table.

Change Column Name/Type/Position/Comment

ALTER TABLE table_name C [COLUMN]
  col_old_name col_new_name column_type
    [COMMENT col_comment]
    [FIRST|AFTER column_name]

此命令允许用户更改列的名称、数据类型、注释或位置。

[En]

This command allows the user to change the name, data type, comment, or location of a column.

例如:

[En]

For example:

注意:对列的更改只会更改配置单元元数据,而不会更改实际数据。用户应确保元数据定义与实际数据结构一致。

[En]

Note: changes to the column will only change the Hive metadata, not the actual data. The user should ensure that the metadata definition is consistent with the actual data structure.

Add/Replace Columns

ALTER TABLE table_name ADD|REPLACE
  COLUMNS (col_name data_type [COMMENT col_comment], ...)

ADD COLUMNS 允许用户在当前列的末尾增加新的列,但是在分区列之前。

REPLACE COLUMNS 删除以后的列,加入新的列。只有在使用 native 的 SerDE(DynamicSerDe or MetadataTypeColumnsetSerDe)的时候才可以这么做。

Alter Table Properties

ALTER TABLE table_name SET TBLPROPERTIES table_properties
table_properties:
  : (property_name = property_value, property_name = property_value, ... )

用户可以用这个命令向表中增加 metadata,目前 last_modified_user,last_modified_time 属性都是由 Hive 自动管理的。用户可以向列表中增加自己的属性。可以使用 DESCRIBE EXTENDED TABLE 来获得这些信息。

Add Serde Properties

ALTER TABLE table_name
    SET SERDE serde_class_name
    [WITH SERDEPROPERTIES serde_properties]

ALTER TABLE table_name
    SET SERDEPROPERTIES serde_properties

serde_properties:
  : (property_name = property_value,
    property_name = property_value, ... )

此命令允许用户将用户定义的元数据添加到SerDe对象。为了序列化和反序列化数据,配置单元初始化SerDe属性并将其传递给表的SerDe。通过这种方式,用户可以存储定制SerDe的属性。

[En]

This command allows the user to add user-defined metadata to the SerDe object. In order to serialize and deserialize the data, Hive initializes the SerDe property and passes it to the SerDe of the table. In this way, users can store properties for custom SerDe.

Alter Table File Format and Organization

ALTER TABLE table_name SET FILEFORMAT file_format
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...)
  [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS

此命令修改表的物理存储属性。

[En]

This command modifies the physical storage properties of the table.

将数据加载到表中时,不会对数据进行转换。加载操作只是将数据复制/移动到与配置单元表对应的位置。

[En]

When the data is loaded into the table, no conversion is made to the data. The Load operation simply copies / moves the data to the location corresponding to the Hive table.

Syntax:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
    INTO TABLE tablename
    [PARTITION (partcol1=val1, partcol2=val2 ...)]

Synopsis:

Load 操作只是单纯的复制/移动操作,将数据文件移动到 Hive 表对应的位置。

  • filepath 可以是:
  • 相对路径,例如:project/data1
  • 绝对路径,例如: /user/hive/project/data1
  • 包含模式的完整 URI,例如:hdfs://namenode:9000/user/hive/project/data1
  • 加载的目标可以是表或分区。如果该表包含分区,则必须为每个分区指定分区名称。
    [En]

    * the target of loading can be a table or partition. If the table contains partitions, you must specify the partition name for each partition.

  • filepath 可以引用一个文件(这种情况下,Hive 会将文件移动到表所对应的目录中)或者是一个目录(在这种情况下,Hive 会将目录中的所有文件移动至表所对应的目录中)。
  • 如果指定了LOCAL,则:
    [En]

    * if LOCAL is specified, then:

  • load 命令会去查找本地文件系统中的 filepath。如果发现是相对路径,则路径会被解释为相对于当前用户的当前路径。用户也可以为本地文件指定一个完整的 URI,比如:file:///user/hive/project/data1.

  • load 命令会将 filepath 中的文件复制到目标文件系统中。目标文件系统由表的位置属性决定。被复制的数据文件移动到表的数据对应的位置。

  • 如果未指定LOCAL关键字,则在文件路径指向完整的URI配置单元时将直接使用URI。否则:
    [En]

    * if the LOCAL keyword is not specified, the URI will be used directly if the filepath points to a complete URI,hive. Otherwise:

  • 如果没有指定 schema 或者 authority,Hive 会使用在 hadoop 配置文件中定义的 schema 和 authority,fs.default.name 指定了 Namenode 的 URI。
  • 如果路径不是绝对的,Hive 相对于 /user/ 进行解释。
  • Hive 会将 filepath 中指定的文件内容移动到 table (或者 partition)所指定的路径中。
  • 如果使用了OVERWRITE关键字,则删除目标表(或分区)(如果有)的内容,然后将filepath指向的文件/目录的内容添加到表/分区。
    [En]

    * if the OVERWRITE keyword is used, the contents of the target table (or partition) (if any) are deleted, and then the contents of the file / directory pointed to by filepath are added to the table / partition.

  • 如果目标表(分区)已有文件,且文件名与文件路径中的文件名冲突,则会将现有文件替换为新文件。
    [En]

    * if the target table (partition) already has a file and the file name conflicts with the file name in filepath, the existing file will be replaced by the new file.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [
    CLUSTER BY col_list
    | [DISTRIBUTE BY col_list]
    [SORT BY col_list]
  ]
[LIMIT number]
  • SELECT语句可以是联合查询或子查询的一部分。
    [En]

    * an SELECT statement can be part of a union query or a subquery.

  • table_reference是查询的输入,可以是一个普通表、一个视图、一个join或一个子查询
  • 简单的查询。例如,下面的语句查询T1表中所有列的信息。
    [En]

    * simple inquiry. For example, the following statement queries information for all columns from the T1 table.

SELECT * FROM t1

WHERE Clause

where condition 是一个布尔表达式。例如,下面的查询语句只返回销售记录大于 10,且归属地属于美国的销售代表。Hive 不支持在WHERE 子句中的 IN,EXIST 或子查询。

SELECT * FROM sales WHERE amount > 10 AND region = "US"

ALL and DISTINCT Clauses

使用All和DISTINCT选项来区分重复记录的处理。默认为全部,表示查询所有记录。删除重复记录的独特方法。

[En]

Use the ALL and DISTINCT options to distinguish between the handling of duplicate records. The default is ALL, which means that all records are queried. DISTINCT means to remove duplicate records.

hive> SELECT col1, col2 FROM t1
    1 3
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1, col2 FROM t1
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1 FROM t1
    1
    2

基于Partition的查询

一般 SELECT 查询会扫描整个表(除非是为了抽样查询)。但是如果一个表使用 PARTITIONED BY 子句建表,查询就可以利用分区剪枝(input pruning)的特性,只扫描一个表中它关心的那一部分。Hive 当前的实现是,只有分区断言出现在离 FROM 子句最近的那个WHERE 子句中,才会启用分区剪枝。例如,如果 page_views 表使用 date 列分区,以下语句只会读取分区为’2008-03-01’的数据。

SELECT page_views.*
    FROM page_views
    WHERE page_views.date >= '2008-03-01'
      AND page_views.date HAVING ClauseHive 现在不支持 HAVING 子句。可以将 HAVING 子句转化为一个字查询,例如:SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
可以在以下查询中表示:SELECT COL1 FROM(SELECT COL1,SUM(COL2)as col2sum)<details><summary>[En]</summary>Can be expressed in the following query: SELECT col1 FROM (SELECT col1, SUM (col2) AS col2sum)</details>
  FROM t1 GROUP BY col1) t2
  WHERE t2.col2sum > 10
LIMIT ClauseLimit 可以限制查询的记录数。查询的结果是随机选择的。下面的查询语句从 t1 表中随机查询5条记录:SELECT * FROM t1 LIMIT 5
Top k 查询。下面的查询语句查询销售记录最大的 5 个销售代表。SET mapred.reduce.tasks = 1
  SELECT * FROM sales SORT BY amount DESC LIMIT 5
REGEX Column SpecificationSELECT 语句可以使用正则表达式做列选择,下面的语句查询除了 ds 和 hr 之外的所有列:SELECT (ds|hr)?+.+ FROM sales
Syntaxjoin_table:
    table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER]
    JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN
      table_reference join_condition

table_reference:
    table_factor
  | join_table

table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

join_condition:
    ON equality_expression ( AND equality_expression )*

equality_expression:
    expression = expression
Hive 只支持等值连接(equality joins)、外连接(outer joins)和(left semi joins???)。Hive 不支持所有非等值的连接,因为非等值连接非常难转化到 map/reduce 任务。另外,Hive 支持多于 2 个表的连接。写 join 查询时,需要注意几个关键点:1. 只支持等值join,例如:  SELECT a.* FROM a JOIN b ON (a.id = b.id)
  SELECT a.* FROM a JOIN b
    ON (a.id = b.id AND a.department = b.department)
然而,是正确的:从(a.id b.id)上的联接b中选择a.*<details><summary>[En]</summary>Is correct, however: SELECT a.* FROM a JOIN b ON (a.id b.id)</details>
这是不对的。二.您可以连接两个以上的表,如SELECT a.val、b.val、c.val from a Join b<details><summary>[En]</summary>It's wrong. two。 You can join more than 2 tables, such as SELECT a.val, b.val, c.val FROM a JOIN b</details>
    ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
如果JOIN中多个表的连接键相同,则JOIN将转换为单个映射/还原任务,如SELECT a.val、b.val、c.val from a Join b<details><summary>[En]</summary>If the join key of multiple tables in join is the same, join is converted to a single map/reduce task, such as SELECT a.val, b.val, c.val FROM a JOIN b</details>
    ON (a.key = b.key1) JOIN c
    ON (c.key = b.key1)
被转化为单个 map/reduce 任务,因为 join 中只使用了 b.key1 作为 join key。SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
  JOIN c ON (c.key = b.key2)
该连接被转换为两个映射/还原任务。因为b.key1用于第一个联接条件,而b.key2用于第二个联接。在联接中,每个映射/还原任务的逻辑如下:Reducer缓存联接序列中除最后一个表之外的所有表的记录,并通过最后一个表将结果序列化到文件系统。此实现有助于减少Reduce端的内存使用量。实际上,您应该在末尾编写最大的表(否则会因为缓存而浪费大量内存)。例如:选择a.val、b.val、c.val<details><summary>[En]</summary>This join is converted into two map/reduce tasks. Because b.key1 is used for the first join condition and b.key2 is used for the second join. In join, the logic of each map/reduce task goes like this: reducer caches the records of all tables in the join sequence except the last table, and serializes the results to the file system through the last table. This implementation helps reduce memory usage on the reduce side. In practice, you should write the largest table at the end (otherwise you will waste a lot of memory because of the cache). For example: SELECT a.val, b.val, c.val FROM a</details>
    JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
所有表使用相同的连接键(使用1个MAP/REDUE任务计算)。Reduce端缓存a表和b表的记录,然后每次获得c表的记录时计算连接结果。类似地,从<details><summary>[En]</summary>All tables use the same join key (calculated using 1 map/reduce task). The Reduce side caches the records of table an and table b, and then calculates the join result every time a record of table c is obtained. Similarly, SELECT a.val, b.val, c.val FROM a</details>
    JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
这里使用了两个映射/减少任务。第一次,表A被缓存,并与表b串行化;第二次,第一个map/Reduce任务的结果被缓存,然后与表c串行化。LEFT、RIGHT和FULL OUTER关键字用于处理连接空记录,例如从LEFT OUTER中选择a.val、b.val<details><summary>[En]</summary>Two map/reduce tasks were used here. The first time, table an is cached, serialized with table b; the second time, the results of the first map/reduce task are cached, and then serialized with table c The LEFT,RIGHT and FULL OUTER keywords are used to handle join hollow records, such as SELECT a.val, b.val FROM a LEFT OUTER</details>
    JOIN b ON (a.key=b.key)
对应所有 a 表中的记录都有一条记录输出。输出的结果应该是 a.val, b.val,当 a.key=b.key 时,而当 b.key 中找不到等值的 a.key 记录时也会输出 a.val, NULL。"FROM a LEFT OUTER JOIN b"这句一定要写在同一行——意思是 a 表在 b 表的左边,所以 a 表中的所有记录都被保留了;"a RIGHT OUTER JOIN b"会保留所有 b 表的记录。OUTER JOIN 语义应该是遵循标准 SQL spec的。Join 发生在 WHERE 子句之前。如果你想限制 join 的输出,应该在 WHERE 子句中写过滤条件——或是在 join 子句中写。这里面一个容易混淆的问题是表分区的情况:  SELECT a.val, b.val FROM a
  LEFT OUTER JOIN b ON (a.key=b.key)
  WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'
会 join a 表到 b 表(OUTER JOIN),列出 a.val 和 b.val 的记录。WHERE 从句中可以使用其他列作为过滤条件。但是,如前所述,如果 b 表中找不到对应 a 表的记录,b 表的所有列都会列出 NULL,包括 ds 列。也就是说,join 会过滤 b 表中不能找到匹配 a 表 join key 的所有记录。这样的话,LEFT OUTER 就使得查询结果与 WHERE 子句无关了。解决的办法是在 OUTER JOIN 时使用以下语法:  SELECT a.val, b.val FROM a LEFT OUTER JOIN b
  ON (a.key=b.key AND
      b.ds='2009-07-07' AND
      a.ds='2009-07-07')
此查询的结果在联接阶段进行了预筛选,因此不存在此类问题。这一逻辑也可以应用于Right和Full类型的连接。JOIN不能交换位置。左连接和右连接都是左连接的。选择a.val1、a.val2、b.val、c.val<details><summary>[En]</summary>The results of this query are pre-filtered in the join phase, so there is no such problem. This logic can also be applied to join of RIGHT and FULL types. Join cannot swap positions. Both LEFT and RIGHT join are left-connected. SELECT a.val1, a.val2, b.val, c.val</details>
  FROM a
  JOIN b ON (a.key = b.key)
  LEFT OUTER JOIN c ON (a.key = c.key)
先 join a 表到 b 表,丢弃掉所有 join key 中不匹配的记录,然后用这一中间结果和 c 表做 join。这一表述有一个不太明显的问题,就是当一个 key 在 a 表和 c 表都存在,但是 b 表中不存在的时候:整个记录在第一次 join,即 a JOIN b 的时候都被丢掉了(包括a.val1,a.val2和a.key),然后我们再和 c 表 join 的时候,如果 c.key 与 a.key 或 b.key 相等,就会得到这样的结果:NULL, NULL, NULL, c.val。LEFT SEMI JOIN 是 IN/EXISTS 子查询的一种更高效的实现。Hive 当前没有实现 IN/EXISTS 子查询,所以你可以用 LEFT SEMI JOIN 重写你的子查询语句。LEFT SEMI JOIN 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行。可以被重写为:   SELECT a.key, a.val
   FROM a LEFT SEMI JOIN b on (a.key = b.key)

Original: https://www.cnblogs.com/hwaggLee/p/7840188.html
Author: 243573295
Title: hive-命令操作记录

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6769/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总