Hive Tutorial
官網原文連結:https://cwiki.apache.org/confluence/display/Hive/Tutorial
- Hive Tutorial
- 1、Concepts
- 2、Usage and Examples
- 2.1、Creating, Showing, Altering, and Dropping Tables
- 2.1.2、Browsing Tables and Partitions
- 2.1.3、Altering Tables
- 2.1.4、Dropping Tables and Partitions
- 2.2、Loading Data
- 2.3、Querying and Inserting Data
- 2.3.1、Simple Query
- 2.3.2、Partition Based Query
- 2.3.3、Joins
- 2.3.4、Aggregations
- 2.3.5、Multi Table/File Inserts
- 2.3.6、Dynamic-Partition Insert
- 2.3.7、Inserting into Local Files
- 2.3.8、Sampling
- 2.3.9、Union All
- 2.3.10、Array Operations
- 2.3.11、Map (Associative Arrays) Operations
- 2.3.12、Custom Map/Reduce Scripts
- 2.3.1、Co-Groups
1、Concepts
1.1、What Is Hive
Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.
Hive是一個基於 Apache Hadoop 的資料倉儲基礎設施。Hadoop 為在商業硬體上儲存和處理資料提供了大規模的向外擴充套件和容錯能力。
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).
Hive 可以方便地對大量資料進行彙總、ad-hoc 查詢和分析。
它提供了 SQL,使使用者能夠輕鬆地進行 ad-hoc 查詢、彙總和資料分析。同時,Hive 的 SQL 給使用者提供了多個地方來整合他們自己的功能來做自定義分析,比如使用者定義函式(UDFs)。
1.2、What Hive Is NOT
Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.
Hive 不是為線上事務處理而設計的。它最適合用於傳統的資料倉儲任務。
1.3、Getting Started
For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide.
關於 Hive、HiveServer2 和 Beeline 的詳細設定,請參考 GettingStarted guide。
Books about Hive lists some books that may also be helpful for getting started with Hive.
關於 Hive 的書中列出了一些可能對開始使用 Hive 有幫助的書。
In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples.
在下面的章節中,我們將提供關於該系統功能的教程。我們首先描述資料型別、表和分割槽的概念(與傳統關係 DBMS 非常相似),然後通過一些例子說明 Hive 的功能。
1.4、Data Units
In the order of granularity - Hive data is organized into:
hive 資料有如下組織形式:
Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on. Databases can also be used to enforce security for a user or group of users.
- 資料庫:名稱空間函式,以避免表、檢視、分割槽、列等的命名衝突。資料庫還可以用於為一個使用者或一組使用者實施安全性。
Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):
timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.
userid —which is of BIGINT type that identifies the user who viewed the page.
page_url—which is of STRING type that captures the location of the page.
referer_url—which is of STRING that captures the location of the page from where the user arrived at the current page.
IP—which is of STRING type that captures the IP address from where the page request was made.
-
表:具有相同schema的同種資料單元。一個表的示例就是 page_views 表,表中每行都由下面的列組成:
- timestamp:INT型別,頁面瀏覽時間
- userid:BIGINT型別,
- page_url:STRING型別,
- referer_url:STRING型別,
- IP:STRING型別,
Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience; it is the user's job to guarantee the relationship between partition name and data content! Partition columns are virtual columns, they are not part of the data itself but are derived on load.
- 分割槽:
每個表有一個或多個分割槽 key ,這些 key 決定了資料如何被儲存。
除了作為儲存單元之外,分割槽還允許使用者有效地標識滿足指定條件的行。例如,STRING 型別的 date_partition 和 STRING 型別的 country_partition。
每個唯一的分割槽 key 對應表的一個分割槽。例如,從"2009-12-23"開始的所有"US"下的資料都是 page_views 表的一個分割槽中的資料。
因此,如果基於"US"下2009-12-23的資料分析,你可以只在相關分割槽下執行查詢。從而,可以提高分析效率。
然而,僅僅因為一個分割槽命名為2009-12-23,並不意味著它包含或僅包含該日期的所有資料,分割槽以日期命名是為了方便。
保證分割槽名和資料內容之間的關係是使用者的工作!分割槽列是虛擬列,它們不是資料本身的一部分,而是在載入時派生的。
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.
- 分桶:
通過計算表的某些列的 hash 值,分割槽中的資料再被劃分到桶中。這可以被用來高效地抽樣資料。
例如,page_views 表根據 userid 分桶,userid 是 page_view 表的列之一,而不是分割槽列。
Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.
分割槽或分桶並不是必要的。但是這些抽象允許系統在查詢處理期間刪除大量資料,從而加快查詢的執行。
1.5、Type System
Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.
Hive 支援基本和複雜資料型別,如下所述。有關更多資訊,請參閱 Hive Data Types。
1.5.1、Primitive Types
-
Types are associated with the columns in the tables. The following Primitive types are supported:【型別和表中的列相關,下面是支援的基本資料型別:】
-
Integers
- TINYINT—1 byte integer
- SMALLINT—2 byte integer
- INT—4 byte integer
- BIGINT—8 byte integer
-
Boolean type
- BOOLEAN—TRUE/FALSE
-
Floating point numbers
- FLOAT—single precision
- DOUBLE—Double precision
-
Fixed point numbers
- DECIMAL—a fixed point value of user defined scale and precision
-
String types
- STRING—sequence of characters in a specified character set
- VARCHAR—sequence of characters in a specified character set with a maximum length
- CHAR—sequence of characters in a specified character set with a defined length
-
Date and time types
- TIMESTAMP — A date and time without a timezone ("LocalDateTime" semantics)
- TIMESTAMP WITH LOCAL TIME ZONE — A point in time measured down to nanoseconds ("Instant" semantics)
- DATE—a date
-
Binary types
- BINARY—a sequence of bytes
The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):
這些型別按以下層次結構組織(父例項是所有子例項的超型別):
- Type
- Primitive Type
- Number
-
DOUBLE
-
FLOAT
- BIGINT
- INT
- SMALLINT
- TINYINT
- SMALLINT
- INT
- BIGINT
-
STRING
-
-
BOOLEAN
-
- Number
- Primitive Type
This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.
這種型別層次結構定義瞭如何在查詢語言中隱式地轉換型別。
允許從子型別到祖先型別的隱式轉換。
因此,當查詢表示式期望型別1且資料為型別2時,如果在型別層次結構中,型別1是型別2的祖先,則型別2將隱式轉換為型別1。
請注意,型別層次結構允許隱式地將 STRING 轉換為 DOUBLE。
Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.
顯式型別轉換可以使用強制轉換操作符完成,如下面的 #Built in Functions 一節所示。
1.5.2、Complex Types
Complex Types can be built up from primitive types and other composite types using:
複雜型別可以從基本型別和其他組合型別構建:
Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
- Structs:型別中的元素可以使用點號訪問。例如,c 列的型別是
STRUCT {a INT; b INT}
,通過c.a
訪問欄位。
Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']
- Maps(鍵值元組):使用
['element name']
訪問元素。例如,對映 M 由'group' -> gid
組成,可以通過M['group']
訪問 gid 值。
Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.
- Arrays(可索引的列表):陣列中的元素必須具有相同的型別。可以使用 [n] 來訪問元素,n 是索引(從0開始)。例如,陣列 A 有元素
['a', 'b', 'c']
,那麼 A[1] 將返回 'b'。
Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:
使用基本型別和用於建立複雜型別的構造,可以建立具有任意巢狀級別的型別。例如,一個型別使用者可能包含以下欄位:
- gender—which is a STRING.
- active—which is a BOOLEAN.
1.5.3、Timestamp
Timestamps have been the source of much confusion, so we try to document the intended semantics of Hive.
Timestamps 一直是很多困惑的根源,所以我們試圖記錄 Hive 的語義。
Timestamp ("LocalDateTime" semantics)
Java's "LocalDateTime" timestamps record a date and time as year, month, date, hour, minute, and seconds without a timezone. These timestamps always have those same values regardless of the local time zone.
Java 的 “LocalDateTime” 時間戳將日期和時間記錄為年、月、日、時、分和秒,而沒有時區。
無論本地時區是什麼,這些時間戳總是具有相同的值。
For example, the timestamp value of "2014-12-12 12:34:56" is decomposed into year, month, day, hour, minute and seconds fields, but with no time zone information available. It does not correspond to any specific instant. It will always be the same value regardless of the local time zone. Unless your application uses UTC consistently, timestamp with local time zone is strongly preferred over timestamp for most applications. When users say an event is at 10:00, it is always in reference to a certain timezone and means a point in time, rather than 10:00 in an arbitrary time zone.
例如,“2014-12-12 12:34:56” 的時間戳值被分解為年、月、日、小時、分鐘和秒欄位,但是沒有時區資訊可用。
它不對應於任何特定的時刻。它將始終是相同的值,無論當地時區是什麼。除非你的應用程式一致使用 UTC,否則具有當地時區的時間戳對於大多數應用程式來說都比時間戳更受歡迎。當使用者說一個事件在 10:00 時,它總是與某個時區有關,意思是一個時間點,而不是任意時區中的 10:00。
Timestamp with local time zone ("Instant" semantics)
Java's "Instant" timestamps define a point in time that remains constant regardless of where the data is read. Thus, the timestamp will be adjusted by the local time zone to match the original point in time.
Java 的 "Instant" 時間戳定義了一個無論從何處讀取資料都保持不變的時間點。因此,時間戳將根據當地時區調整,以匹配原始的時間點。
Type | Value in America/Los_Angeles | Value in America/New York |
---|---|---|
timestamp | 2014-12-12 12:34:56 | 2014-12-12 12:34:56 |
timestamp with local time zone | 2014-12-12 12:34:56 | 2014-12-12 15:34:56 |
Comparisons with other tools
見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
1.6、Built In Operators and Functions
The operators and functions listed below are not necessarily up to date. (Hive Operators and UDFs has more current information.) In Beeline or the Hive CLI, use these commands to show the latest documentation:
下面列出的操作符和函式不一定是最新的。在 Beeline 或 Hive 命令列中,使用這些命令顯示最新的文件:
SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;
Case-insensitive. All Hive keywords are case-insensitive, including the names of Hive operators and functions.
不區分大小寫。所有 Hive 關鍵字不區分大小寫,包括 Hive 操作符和函式的名稱。
1.6.1、Built In Operators
Relational Operators:The following operators compare the passed operands and generate a TRUE or FALSE value, depending on whether the comparison between the operands holds or not.
- 關係操作符:下面的操作符比較傳入的運算元,生成一個 TRUE 或 FALSE 值,這取決於運算元之間的比較是否有效。
操作符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
Arithmetic Operators:The following operators support various common arithmetic operations on the operands. All of them return number types.
- 算術操作符:以下操作符支援對運算元進行各種常見的算術操作。它們都返回數字型別。
操作符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
Logical Operators:The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.
- 邏輯操作符:以下操作符提供了對建立邏輯表示式的支援。它們都返回布林值 TRUE 或 FALSE,這取決於運算元的布林值。
操作符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
Operators on Complex Types:The following operators provide mechanisms to access elements in Complex Types
- 在複雜型別上的操作符:下面的操作符提供了訪問複雜型別中元素的機制。
操作符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
1.6.2、Built In Functions
Hive supports the following built in functions:(Function list in source code: FunctionRegistry.java)
- Hive 支援下面的內建函式:
函式表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
The following built in aggregate functions are supported in Hive:
- Hive 中支援下面內建的聚合函式:
函式表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
1.7、Language Capabilities
Hive's SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are:
Hive 的 SQL 提供了基本的 SQL 操作。這些操作在表或分割槽上工作。這些操作是:
-
能夠使用 WHERE 子句從表中過濾行。
-
能夠使用 SELECT 子句從表中選擇某些列。
-
能夠在兩個表之間進行 equi-joins。
-
能夠對儲存在一個表中的資料的多個 “group by” 列上的聚合進行評估。
-
能夠將查詢結果儲存到另一個表中。
-
能夠將表的內容下載到本地目錄(例如,nfs)。
-
能夠將查詢結果儲存在 hadoop dfs 目錄中。
-
能夠管理表和分割槽(建立、刪除和修改)。
-
能夠插入自定義指令碼的語言選擇自定義對映/減少作業。
Ability to filter rows from a table using a WHERE clause.
Ability to select certain columns from the table using a SELECT clause.
Ability to do equi-joins between two tables.
Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
Ability to store the results of a query into another table.
Ability to download the contents of a table to a local (for example,, nfs) directory.
Ability to store the results of a query in a hadoop dfs directory.
Ability to manage tables and partitions (create, drop and alter).
Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.
2、Usage and Examples
NOTE: Many of the following examples are out of date. More up to date information can be found in the LanguageManual.
注意:下面的許多例子是過時的。可以在 LanguageManual 查詢到更多資訊。
The following examples highlight some salient features of the system. A detailed set of query test cases can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case Results.
下面的例子突出了該系統的一些顯著特徵。
詳細的查詢測試用例集可以在 Hive Query Test Cases 中找到,相應的結果可以在 Query Test Case Results 中找到。
- Creating, Showing, Altering, and Dropping Tables
- Loading Data
- Querying and Inserting Data
2.1、Creating, Showing, Altering, and Dropping Tables
See Hive Data Definition Language for detailed information about creating, showing, altering, and dropping tables.
見 Hive Data Definition Language 檢視更多關於建立、展示、修改和刪除表的資訊。
2.1.1、Creating Tables
An example statement that would create the page_view table mentioned above would be like:
建立上面提到的 page_view 表的示例語句如下:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
In this example, the columns of the table are specified with the corresponding types. Comments can be attached both at the column level as well as at the table level. Additionally, the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline as the row delimiter.
在本例中,表的列使用相應的型別指定。
可以在列級別和表級別新增註釋。
此外,partitioned by 子句定義了分割槽列,分割槽列與資料列不同,實際上並不與資料一起儲存。
當以這種方式指定時,假設檔案中的資料以 ASCII 001(ctrl-A)作為欄位分隔符,以換行分隔符作為行分隔符。
The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example:
如果資料不是上述格式,則可以引數化欄位分隔符:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;
The row delimintor currently cannot be changed since it is not determined by Hive but Hadoop delimiters.
行分隔符目前不能更改,因為它不是由 Hive 決定的,而是 Hadoop 決定的。
It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:
在某些列上對錶進行分桶是個好主意,以便對資料集執行有效的抽樣查詢。
如果不使用分桶,對錶的表隨機抽樣仍然可以做,但不是有效的查詢,必須掃描所有的資料。
下面的例子說明了 page_view 表在 userid 列上分桶的情況:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column—n this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.
在上面的例子中,表被 userid 的雜湊函式聚集到 32 個桶中。在每個桶中,資料按 viewTime 的遞增順序排序。
這樣的組織允許使用者在聚集列上進行有效的抽樣,在本例中為 userid。
排序屬性允許內部操作符利用已知的資料結構,同時以更高的效率計算查詢。
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
In this example, the columns that comprise of the table row are specified in a similar way as the definition of types. Comments can be attached both at the column level as well as at the table level. Additionally, the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are terminated, and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the above, example represent the system defaults.
在本例中,由錶行組成的列的指定方式與型別的定義類似。
可以在列級和表級附加註釋。
此外,partitioned by 子句定義了分割槽列,分割槽列與資料列不同,實際上並不與資料一起儲存。
CLUSTERED BY 子句指定使用哪個列進行分桶以及建立多少個桶。
分隔的行格式指定了行在 hive 表中的儲存方式。
對於分隔格式,它指定如何終止欄位,集合(陣列或對映)中的項如何終止,以及如何終止對映鍵。
STORED AS SEQUENCEFILE 表示該資料以二進位制格式(使用hadoop SequenceFiles)儲存在 hdfs 上。
上面示例中顯示的 ROW FORMAT 和 STORED AS 子句的值表示系統預設值。
Table names and column names are case insensitive.
表名和列名不區分大小寫。
2.1.2、Browsing Tables and Partitions
SHOW TABLES;
To list existing tables in the warehouse; there are many of these, likely more than you want to browse.
列出倉庫中已存在的表;有很多,可能比你想瀏覽的要多。
SHOW TABLES 'page.*';
To list tables with prefix 'page'. The pattern follows Java regular expression syntax (so the period is a wildcard).
列出以 “page” 為字首的表。該模式遵循 Java 正規表示式語法(因此句點是萬用字元)。
SHOW PARTITIONS page_view;
To list partitions of a table. If the table is not a partitioned table then an error is thrown.
列出一個表的分割槽。如果表不是一個分割槽表,那麼就丟擲一個錯誤。
DESCRIBE page_view;
To list columns and column types of table.
列出表的列和列型別。
DESCRIBE EXTENDED page_view;
To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging.
列出表的列和所有其他的屬性。這列印很多資訊,那也不是一個好看的格式。通常用於除錯。
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');
To list columns and all other properties of a partition. This also prints lot of information which is usually used for debugging.
列出一個分割槽的列和所有其他的屬性。這列印很多資訊,那也不是一個好看的格式。通常用於除錯。
2.1.3、Altering Tables
To rename existing table to a new name. If a table with new name already exists then an error is returned:
將已存在的表重新命名為一個新名字。如果具有新名字的表已存在,就返回一個錯誤:
ALTER TABLE old_table_name RENAME TO new_table_name;
To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for each preexisting column:
重新命名一個已存在表的列。確保使用相同的列型別,包含一個預先存在的列的入口
ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);
To add columns to an existing table:
給一個已存在的表新增列:
ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');
Note that a change in the schema (such as the adding of the columns), preserves the schema for the old partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the old partitions implicitly return a null value or the specified default values for these columns.
注意,模式中的更改(如新增列)將保留表的舊分割槽的模式,以防它是一個分割槽表。
訪問這些列,並在舊分割槽上執行的所有查詢都會隱式地為這些列返回空值或指定的預設值。
In the later versions, we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.
在以後的版本中,我們可以採用假設某些值的行為,而不是在某個特定分割槽中找不到可配置的列時丟擲錯誤。
2.1.4、Dropping Tables and Partitions
Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future feature) that would have been built on the table. The associated command is:
刪除表非常簡單。對錶進行刪除操作將隱式地刪除在表上構建的任何索引(這是將來的特性)。相關的命令是:
DROP TABLE pv_users;
To dropping a partition. Alter the table to drop the partition.
刪除一個分割槽。修改表以刪除分割槽。
ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')
Note that any data for this table or partitions will be dropped and may not be recoverable.
注意,該表或分割槽的任何資料都將被刪除,並且可能無法恢復。
2.2、Loading Data
There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this:
將資料載入到 Hive 表中有多種方式。
使用者可以建立一個外部表,該表指向 HDFS 中的指定位置。在這種特殊的用法中,使用者可以使用 HDFS 的 put 或 copy 命令將檔案複製到指定的位置,並建立一個指向該位置的表,其中包含所有相關的行格式資訊。
一旦完成,使用者就可以轉換資料,並將它們插入到任何其他 Hive 表中。例如,如果檔案 /tmp/pv_2008-06-08.txt
包含了 2008-06-08 上的以逗號分隔的頁面檢視,並且需要將其載入到 page_view 表的相應分割槽中,下面的命令序列可以實現這一點:
CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';
This code results in an error due to LINES TERMINATED BY limitation
由於 LINES TERMINATED BY 限制,這個程式碼會導致錯誤
FAILED: SemanticException 6:67 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''12''
See
[HIVE-5999](https://issues.apache.org/jira/browse/HIVE-5999) - Allow other characters for LINES TERMINATED BY OPEN
[HIVE-11996](https://issues.apache.org/jira/browse/HIVE-11996) - Row Delimiter other than '\n' throws error in Hive. OPEN
In the example above, nulls are inserted for the array and map types in the destination tables but potentially these can also come from the external table if the proper row formats are specified.
在上面的示例中,將為目標表中的 array 和 map 型別插入 nulls,但如果指定了適當的行格式,這些 nulls 也可能來自外部表。
This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive.
如果 HDFS 中已經有一些遺留資料,使用者想要在這些資料上放一些後設資料,這樣就可以使用 Hive 來查詢和操作這些資料,那麼這種方法非常有用。
Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-08_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax:
此外,系統還支援將本地檔案系統中檔案的資料直接載入到 Hive 表中,並且輸入的資料格式與表格式相同。
如果 /tmp/pv_2008-06-08_us.txt
已經包含了 US 的資料,那麼我們不需要像前面的例子中所示的任何額外的過濾。在這種情況下,可以使用以下語法進行載入:
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
The path argument can take a directory (in which case all the files in the directory are loaded), a single file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory, it cannot contain subdirectories. Similarly, the wildcard must match file names only.
path 引數可以接受一個目錄(在這種情況下,該目錄中的所有檔案都被載入)、一個單個檔名或一個萬用字元(在這種情況下,所有匹配的檔案都被上傳)。
如果引數是目錄,則不能包含子目錄。類似地,萬用字元必須只匹配檔名。
In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load the data into a Hive table:
如果輸入檔案 /tmp/pv_2008-06-08_us.txt
非常大,使用者可能會決定並行載入資料(使用 Hive 外部的工具)。一旦檔案在 HDFS 中,可以使用以下語法將資料載入到 Hive 表中:
LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
It is assumed that the array and map fields in the input.txt files are null fields for these examples.
對於這些例子,假設 input.txt 檔案中的 array 和 map 欄位是空欄位。
See Hive Data Manipulation Language for more information about loading data into Hive tables, and see External Tables for another example of creating an external table.
2.3、Querying and Inserting Data
The Hive query operations are documented in Select, and the insert operations are documented in Inserting data into Hive Tables from queries and Writing data into the filesystem from queries.
Hive 查詢操作記錄在 Select 中,insert 操作記錄在 Inserting data into Hive Tables from queries 和 Writing data into the filesystem from queries 中。
2.3.1、Simple Query
For all the active users, one can use the query of the following form:
對於所有活躍使用者,可以使用下面形式的查詢:
INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;
Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query in Beeline or the Hive CLI:
注意,與 SQL 不同,我們總是將結果插入表中。我們稍後將演示使用者如何檢查這些結果,甚至將它們轉儲到本地檔案。
你也可以在 Beeline 或 Hive 命令列中執行如下查詢:
SELECT user.*
FROM user
WHERE user.active = 1;
This will be internally rewritten to some temporary file and displayed to the Hive client side.
這將在內部被重寫到一些臨時檔案,並顯示到 Hive 客戶端。
2.3.2、Partition Based Query
What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query:
在查詢中,使用什麼分割槽由系統根據分割槽列上的 where 子句條件自動決定。
例如,為了從域名 xyz.com 獲得所有在 2008 年 3 月的 page_views,可以寫以下查詢:
INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
page_views.referrer_url like '%xyz.com';
Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you think!
注意,這裡使用 page_views.date 是因為上面的表是用 PARTITIONED BY(date DATETIME, country STRING)
定義的;如果你給你的分割槽起了不同的名字,別指望 .date
會像你想的那樣!
2.3.3、Joins
In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query:
為了獲得 2008-03-03 的 page_view 的人口統計分類(按性別),需要在 userid 列上連線 page_view 表和 user 表。這可以通過 join 來實現,如下面的查詢所示:
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query:
為了進行外部連線,使用者可以使用 LEFT OUTER、RIGHT OUTER 或 FULL OUTER 關鍵字來限定連線,以表明外部連線的型別(左保留、右保留或兩邊保留)。
例如,為了在上面的查詢中執行 full outer join,相應的語法看起來像下面的查詢:
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.
為了檢查另一個表中是否存在一個鍵,使用者可以使用 LEFT SEMI JOIN,如下面的例子所示。
INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
In order to join more than one tables, the user can use the following syntax:
為了連線多個表,使用者可以使用以下語法:
INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';
Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the join to get the best performance.
Hive 只支援 equi-joins。另外,最好將最大的表放在連線的最右邊,以獲得最佳效能。
2.3.4、Aggregations
In order to count the number of distinct users by gender one could write the following query:
為了按性別計算不同使用者的數量,可以編寫以下查詢:
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible
多個聚合可以同時進行,但是,兩個聚合不能有不同的列。例如,以下是可能的:
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
however, the following query is not allowed
但是,下面的查詢是不允許的
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
2.3.5、Multi Table/File Inserts
The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). For example, if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:
聚合或簡單選擇的輸出可以進一步傳送到多個表中,甚至傳送到 hadoop dfs 檔案中(然後可以使用 hdfs 實用程式操作這些檔案)。
例如,如果在性別分類的同時,需要找到按年齡劃分的唯一頁面瀏覽量,可以通過以下查詢來實現:
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files.
第一個 insert 子句將第一個 group by 的結果傳送到 Hive 表中,而第二個 insert 子句將結果傳送到 hadoop dfs 檔案中。
2.3.6、Dynamic-Partition Insert
In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below.
在前面的例子中,使用者必須知道要插入哪個分割槽,並且在一條 insert 語句中只能插入一個分割槽。
如果想要載入到多個分割槽,必須使用多條 insert 語句,如下所示。
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';
In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job.
為了將某一天的資料載入到所有 country 分割槽中,必須在輸入資料中為每個國家新增 insert 一條語句。
這非常不方便,因為你必須事先知道輸入資料中存在的國家列表,並預先建立分割槽。
如果列表某天更改了,則必須修改 insert DML 和建立分割槽 DML。這也是低效的,因為每個 insert 語句都可能變成 MapReduce Job。
Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to the multiple insert case.
動態分割槽插入(或多分割槽插入)是為了解決這個問題而設計的,它在掃描輸入表時動態地確定應該建立和填充哪些分割槽。
這是一個新新增的特性,僅在 0.6.0 版本中可用。
在動態分割槽插入中,將計算輸入列的值,以確定應該將該行插入哪個分割槽。如果還沒有建立該分割槽,它將自動建立該分割槽。
使用這個特性,你只需要一條 insert 語句就可以建立和填充所有必要的分割槽。另外,由於只有一條 insert 語句,所以對應的 MapReduce job 也只有一個。
與多次插入相比,這顯著提高了效能並減少了 Hadoop 叢集的工作負載。
Below is an example of loading data to all country partitions using one insert statement:
下面是一個使用 insert 語句將資料載入到所有 country 分割槽的例子:
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
There are several syntactic differences from the multi-insert statement:
與多條 insert 語句有幾個語法上的區別:
country appears in the PARTITION specification, but with no value associated. In this case, country is a dynamic partition column. On the other hand, ds has a value associated with it, which means it is a static partition column. If a column is dynamic partition column, its value will be coming from the input column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause because the partition column order indicates its hierarchical order (meaning dt is the root partition, and country is the child partition). You cannot specify a partition clause with (dt, country='US') because that means you need to update all partitions with any date and its country sub-partition is 'US'.
- country 出現在 PARTITION 規範中,但沒有關聯的值。
在本例中,country 是一個動態分割槽列。另一方面,ds 有一個與之相關聯的值,這意味著它是一個靜態分割槽列。
如果一個列是動態分割槽列,那麼它的值將來自輸入列。
目前,我們只允許動態分割槽列是 partition 子句中的最後一列,因為分割槽列的順序表明了它的層次順序(意味著 dt 是根分割槽,而 country 是子分割槽)。
你不能用 (dt, country='US') 指定分割槽子句,因為這意味著你需要用任一日期更新所有分割槽,而它的 country 子分割槽是 'US'。
An additional pvs.country column is added in the select statement. This is the corresponding input column for the dynamic partition column. Note that you do not need to add an input column for the static partition column because its value is already known in the PARTITION clause. Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause.
- 在 select 語句中新增一個額外的 pvs.country 列。
這是動態分割槽列的相應輸入列。
注意,你不需要為靜態 partition 列新增輸入列,因為它的值在 PARTITION 子句中已知。
注意,動態分割槽值是按照順序選擇的,而不是名稱,並作為 select 子句的最後一列。
Semantics of the dynamic partition insert statement:
動態分割槽插入語句的語義:
When there are already non-empty partitions exists for the dynamic partition columns, (for example, country='CA' exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same value (say 'CA') in the input data. This is in line with the 'insert overwrite' semantics. However, if the partition value 'CA' does not appear in the input data, the existing partition will not be overwritten.
- 當動態分割槽列已經存在非空的分割槽時(例如,country='CA' 存在於某些 ds 根分割槽下),如果動態分割槽插入在輸入資料中遇到相同的值(比如'CA'),那麼它將被覆蓋。這符合 'insert overwrite' 語義。但是,如果分割槽值 'CA' 沒有出現在輸入資料中,則不會覆蓋現有分割槽。
Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS path format (URI in Java). Any character having a special meaning in URI (for example, '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value.
- 一個 Hive 分割槽對應 HDFS 中的一個目錄,分割槽值必須符合 HDFS 的路徑格式(Java為URI)。任何在 URI 中有特殊含義的字元(例如'%',':','/','#')將用 '%' 後跟它的 ASCII 值的 2 個位元組進行轉義。
If the input column is a type different than STRING, its value will be first converted to STRING to be used to construct the HDFS path.
- 如果輸入列不是 STRING 型別,則首先將其值轉換為 STRING,用於構建 HDFS 路徑。
If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is HIVE_DEFAULT_PARTITION{}. Basically this partition will contain all "bad" rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by HIVE_DEFAULT_PARTITION{} if you select them Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition column values as well.
- 如果輸入的列值為 NULL 或空字串,行將被放入一個特殊的分割槽,分割槽的名稱由
hive.exec.default.partition.name
引數控制。預設值為HIVE_DEFAULT_PARTITION{}
。基本上,這個分割槽將包含“壞”行,行的值不是有效的分割槽名。注意,如果選擇它們,“壞”值將丟失,並由HIVE_DEFAULT_PARTITION{}
替換。JIRA HIVE-1309 是一個讓使用者指定 “壞檔案” 來保留輸入分割槽列值的解決方案。
Dynamic partition insert could potentially be a resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:
-
動態分割槽插入可能會佔用大量資源,因為它可能會在很短的時間內生成大量分割槽。我們定義了三個引數:
-
hive.exec.max.dynamic.partitions.pernode(預設值是100):是每個 mapper 或 reducer 可以建立的最大動態分割槽。如果一個 mapper 或 reducer 建立的值超過這個閾值,一個致命錯誤將從 mapper/reducer(通過計數器)引發,整個 job 將被殺死。
-
hive.exec.max.dynamic.partitions(預設值為1000):是一個 DML 可以建立的動態分割槽的總數。如果每個 mapper/reducer 都沒有超過這個限制,但是動態分割槽的總數卻超過了,那麼在 job 結束時,中間資料被移動到最終目的地之前,會引發一個異常。
-
hive.exec.max.created.files(預設值為100000)是所有 mappers 和 reducers 建立的最大檔案總數。這是通過在建立新檔案時由每個 mapper/reducer 更新 Hadoop 計數器來實現的。如果總數超過了 hive.exec.max.created.files 時,一個致命錯誤將被丟擲,job 將被終止。
-
hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed.
hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.
hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.
Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false prior to Hive 0.9.0 and true in Hive 0.9.0 and later.
我們希望防止動態分割槽插入的另一種情況是,使用者可能會意外地將所有分割槽指定為動態分割槽,而沒有指定一個靜態分割槽,而最初的目的只是覆蓋一個根分割槽的子分割槽。
我們定義另一個引數 hive.exec.dynamic.partition.mode=strict
,防止出現全部動態分割槽的情況。在嚴格模式下,必須指定至少一個靜態分割槽。預設模式為嚴格的。
此外,我們還有一個引數 hive.exec.dynamic.partition=true/false
來控制是否允許動態分割槽。Hive 0.9.0 之前的預設值為 false, Hive 0.9.0 之後的預設值為 true。
In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).
在 Hive 0.6 中,動態分割槽插入不能與 hive.merge.mapfiles=true
或 hive.merge.mapredfiles=true
一起使用。因此它在內部關閉了合併引數。
Hive 0.7 支援在動態分割槽插入中合併檔案。
Troubleshooting and best practices:
故障排除和最佳實踐:
As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like:
如上所述,一個特定的 mapper/reducer 建立的動態分割槽太多了,可能會引發一個致命錯誤,job 將被終止。錯誤資訊是這樣的:
beeline> set hive.exec.dynamic.partition.mode=nonstrict;
beeline> FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
...
2010-05-07 11:10:19,816 Stage-1 map = 0%, reduce = 0%
[Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
Ended Job = job_201005052204_28178 with errors
...
The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to:
這樣做的問題是,一個 mapper 將取一個隨機的行集,並且很可能不同的(dt, country)對的數量將超過 hive.exec.max.dynamic.partitions.pernode
的限制。
解決這個問題的一種方法是按照 mapper 中的動態分割槽列對行進行分組,並將它們分發到將在其中建立動態分割槽的 reducers 中。
在這種情況下,不同動態分割槽的數量將顯著減少。上面的示例查詢可以重寫為:
beeline> set hive.exec.dynamic.partition.mode=nonstrict;
beeline> FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country
DISTRIBUTE BY ds, country;
This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.
該查詢將生成一個 MapReduce job,而不是 Map-only job。
SELECT 子句將被轉換為 mappers 的計劃,並根據 (ds, country)對的值將輸出分配給 reducers。
INSERT 子句將被轉換為 reducer 中的計劃,用於寫入動態分割槽。
Additional documentation:
2.3.7、Inserting into Local Files
In certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command:
在某些情況下,可能希望將輸出寫入到本地檔案,以便將其載入到 excel 表格中。這可以通過以下命令完成:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;
2.3.8、Sampling
The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table:
抽樣子句允許使用者對資料的樣本(而不是整個表)編寫查詢。
目前,抽樣是在 CREATE TABLE 語句的 CLUSTERED BY 子句中指定的列上進行的。
在下面的例子中,我們從 pv_gender_sum 表的 32 個桶中選擇第三個桶:
INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);
In general the TABLESAMPLE syntax looks like:
一般情況下,TABLESAMPLE 語法如下:
TABLESAMPLE(BUCKET x OUT OF y)
y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following tablesample clause
y 必須是該表在建立表時指定的桶數的倍數或除數。
如果 bucket_number 模組 y 等於 x,則確定所選擇的桶。所以在上面的例子中,下面的 tablesample 子句
TABLESAMPLE(BUCKET 3 OUT OF 16)
would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.
選出第3和第19個桶。桶從0開始編號。
On the other hand the tablesample clause
另一方面,tablesample 子句
TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)
would pick out half of the 3rd bucket.
會取出第三個桶的一半。
2.3.9、Union All
The language also supports union all, for example, if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events:
語言還支援 union all,
例如,如果我們假設有兩個不同的表,跟蹤使用者釋出的一個視訊,和使用者發表的評論。
下面的查詢使用 user 表 join 了 union all 的結果,來建立一個帶註釋的所有視訊釋出和評論釋出事件流:
INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
SELECT av.uid AS uid
FROM action_video av
WHERE av.date = '2008-06-03'
UNION ALL
SELECT ac.uid AS uid
FROM action_comment ac
WHERE ac.date = '2008-06-03'
) actions JOIN users u ON(u.id = actions.uid);
2.3.10、Array Operations
Array columns in tables can be as follows:
表中的陣列列可以如下:
CREATE TABLE array_table (int_array_column ARRAY<INT>);
Assuming that pv.friends is of the type ARRAY
(i.e. it is an array of integers), the user can get a specific element in the array by its index as shown in the following command:
假設 pv.friends 的型別為 ARRAY
SELECT pv.friends[2]
FROM page_views pv;
The select expression gets the third item in the pv.friends array.
select 表示式獲得 pv.friends 陣列中的第三項。
The user can also get the length of the array using the size function as shown below:
使用者也可以使用 size 函式獲取陣列的長度,如下所示:
SELECT pv.userid, size(pv.friends)
FROM page_view pv;
2.3.11、Map (Associative Arrays) Operations
Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently. We will be extending this soon. For the purpose of the current example assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to string. Accordingly, the following query:
對映提供了類似於關聯陣列的集合。目前只能以程式設計方式建立此類結構。我們很快就會擴充套件這個。
對於當前的例子,假設 pv.properties 型別為 map<String, String>,即它是一個從字串到字串的關聯陣列。因此,以下查詢:
INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type']
FROM page_views pv;
can be used to select the 'page_type' property from the page_views table.
可用於從 page_views 表中選擇 'page_type' 屬性。
Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the following query:
與陣列類似,size 函式也可以用於獲取 map 中元素的數量,如下所示:
SELECT size(pv.properties)
FROM page_view pv;
2.3.12、Custom Map/Reduce Scripts
Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive language. for example, in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.
使用者還可以使用 Hive 語言本地支援的特性,在資料流中插入自己的自定義 mappers 和 reducers。
例如,為了執行一個自定義 mapper 指令碼 map_script 和一個自定義 reducer 指令碼 reduce_script,使用者可以發出以下命令,該命令使用 TRANSFORM 子句來嵌入 mapper 和 reducer 指令碼。
Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the standard output of the user script will be treated as TAB-separated string columns. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop.
注意,在提供給使用者指令碼之前,列將被轉換為字串,並以製表符分隔,使用者指令碼的標準輸出將被視為製表符分隔的字串列。
使用者指令碼可以將除錯資訊輸出到標準錯誤,這些錯誤將顯示在 hadoop 上的 task detail 頁面上。
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;
Sample map script (weekday_mapper.py )
抽樣 map 指令碼:
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print ','.join([userid, str(weekday)])
Of course, both MAP and REDUCE are "syntactic sugar" for the more general select transform. The inner query could also have been written as such:
當然,MAP 和 REDUCE 對於更一般的 select 轉換來說都是“語法糖”。內部查詢也可以這樣寫:
SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt FROM pv_users;
Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying "AS key, value" because in that case value will only contains the portion between the first tab and the second tab if there are multiple tabs.
無模式的 map/reduce:
如果 USING map_script 後沒有 AS 子句,Hive 假設指令碼輸出包含兩個部分:key 在第一個製表符前,value 在第一個製表符後。
注意,這與指定 “AS key, value” 不同,因為在這種情況下,如果有多個製表符,value 將只包含第一個製表符和第二個製表符之間的部分。
In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output. User still needs to know the reduce output schema because that has to match what is in the table that we are inserting to.
通過這種方式,我們允許使用者在不知道 map 輸出模式的情況下遷移舊的 map/reduce 指令碼。
使用者仍然需要知道 reduce 輸出模式,因為它必須與我們要插入的表中的內容相匹配。
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
CLUSTER BY key) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;
Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.
Distribute By 和 Sort By:使用者可以指定 Distribute By 和 Sort By,而不是指定 cluster By,因此分割槽列和排序列可以不同。
通常情況下,分割槽列是排序列的字首,但這不是必需的。
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS c1, c2, c3
DISTRIBUTE BY c2
SORT BY c2, c1) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.c1, map_output.c2, map_output.c3
USING 'reduce_script'
AS date, count;
2.3.1、Co-Groups
Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and action_comments table on the uid column and send them to the 'reduce_script' custom reducer, the following syntax can be used by the user:
在使用 map/reduce 的使用者社群中,cogroup 是一種相當常見的操作,其中來自多個表的資料被髮送到一個自定義的 reducer,這樣行就按表上某些列的值分組。
通過 UNION ALL 操作符和 CLUSTER BY 規範,Hive 查詢語言可以通過以下方式實現這一點。
假設我們想在 uid 列上對 actions_video 和 action_comments 表中的行進行分組,並將它們傳送到 'reduce_script' 自定義 reducer,使用者可以使用以下語法:
FROM (
FROM (
FROM action_video av
SELECT av.uid AS uid, av.id AS id, av.date AS date
UNION ALL
FROM action_comment ac
SELECT ac.uid AS uid, ac.id AS id, ac.date AS date
) union_actions
SELECT union_actions.uid, union_actions.id, union_actions.date
CLUSTER BY union_actions.uid) map
INSERT OVERWRITE TABLE actions_reduced
SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val);