歡迎訪問我的GitHub
https://github.com/zq2599/blog_demos
內容:所有原創文章分類彙總及配套原始碼,涉及Java、Docker、Kubernetes、DevOPS等;
《hive學習筆記》系列導航
本篇概覽
- 本文是《hive學習筆記》系列的第七篇,前文熟悉了HiveQL的常用語句,接下來把常用的內建函式簡單過一遍,分為以下幾部分:
- 數學
- 字元
- json處理
- 轉換
- 日期
- 條件
- 聚合
準備資料
- 本次實戰要準備兩個表:學生表和住址表,欄位都很簡單,如下圖所示,學生表有個住址ID欄位,是住址表裡的記錄的唯一ID:
2. 先建立住址表:
create table address (addressid int, province string, city string)
row format delimited
fields terminated by ',';
- 建立address.txt檔案,內容如下:
1,guangdong,guangzhou
2,guangdong,shenzhen
3,shanxi,xian
4,shanxi,hanzhong
6,jiangshu,nanjing
- 載入資料到address表:
load data
local inpath '/home/hadoop/temp/202010/25/address.txt'
into table address;
- 建立學生表,其addressid欄位關聯了address表的addressid欄位:
create table student (name string, age int, addressid int)
row format delimited
fields terminated by ',';
- 建立student.txt檔案,內容如下:
tom,11,1
jerry,12,2
mike,13,3
john,14,4
mary,15,5
- 載入資料到student表:
load data
local inpath '/home/hadoop/temp/202010/25/student.txt'
into table student;
- 至此,本次操作所需資料已準備完畢,如下所示:
hive> select * from address;
OK
1 guangdong guangzhou
2 guangdong shenzhen
3 shanxi xian
4 shanxi hanzhong
6 jiangshu nanjing
Time taken: 0.043 seconds, Fetched: 5 row(s)
hive> select * from student;
OK
tom 11 1
jerry 12 2
mike 13 3
john 14 4
mary 15 5
Time taken: 0.068 seconds, Fetched: 5 row(s)
- 開始體驗內建函式;
總覽
- 進入hive控制檯;
- 執行命令show functions;顯示內建函式列表:
hive> show functions;
OK
!
!=
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
and
array
array_contains
ascii
asin
assert_true
atan
avg
base64
between
bin
case
cbrt
ceil
ceiling
coalesce
collect_list
collect_set
compute_stats
concat
concat_ws
context_ngrams
conv
corr
cos
count
covar_pop
covar_samp
create_union
cume_dist
current_database
current_date
current_timestamp
current_user
date_add
date_format
date_sub
datediff
day
dayofmonth
decode
degrees
dense_rank
div
e
elt
encode
ewah_bitmap
ewah_bitmap_and
ewah_bitmap_empty
ewah_bitmap_or
exp
explode
factorial
field
find_in_set
first_value
floor
format_number
from_unixtime
from_utc_timestamp
get_json_object
greatest
hash
hex
histogram_numeric
hour
if
in
in_file
index
initcap
inline
instr
isnotnull
isnull
java_method
json_tuple
lag
last_day
last_value
lcase
lead
least
length
levenshtein
like
ln
locate
log
log10
log2
lower
lpad
ltrim
map
map_keys
map_values
matchpath
max
min
minute
month
months_between
named_struct
negative
next_day
ngrams
noop
noopstreaming
noopwithmap
noopwithmapstreaming
not
ntile
nvl
or
parse_url
parse_url_tuple
percent_rank
percentile
percentile_approx
pi
pmod
posexplode
positive
pow
power
printf
radians
rand
rank
reflect
reflect2
regexp
regexp_extract
regexp_replace
repeat
reverse
rlike
round
row_number
rpad
rtrim
second
sentences
shiftleft
shiftright
shiftrightunsigned
sign
sin
size
sort_array
soundex
space
split
sqrt
stack
std
stddev
stddev_pop
stddev_samp
str_to_map
struct
substr
substring
sum
tan
to_date
to_unix_timestamp
to_utc_timestamp
translate
trim
trunc
ucase
unbase64
unhex
unix_timestamp
upper
var_pop
var_samp
variance
weekofyear
when
windowingtablefunction
xpath
xpath_boolean
xpath_double
xpath_float
xpath_int
xpath_long
xpath_number
xpath_short
xpath_string
year
|
~
Time taken: 0.003 seconds, Fetched: 216 row(s)
- 以lower函式為例,執行命令describe function lower;即可檢視lower函式的說明:
hive> describe function lower;
OK
lower(str) - Returns str with all characters changed to lowercase
Time taken: 0.005 seconds, Fetched: 1 row(s)
- 接下來從計算函式開始,體驗常用函式;
- 先執行以下命令,使查詢結果中帶有欄位名:
set hive.cli.print.header=true;
計算函式
- 加法+:
hive> select name, age, age+1 as add_value from student;
OK
name age add_value
tom 11 12
jerry 12 13
mike 13 14
john 14 15
mary 15 16
Time taken: 0.098 seconds, Fetched: 5 row(s)
- 減法(-)、乘法(*)、除法(/)的使用與加法類似,不再贅述了;
- 四捨五入round:
hive> select round(1.1), round(1.6);
OK
_c0 _c1
1.0 2.0
Time taken: 0.028 seconds, Fetched: 1 row(s)
- 向上取整ceil:
hive> select ceil(1.1);
OK
_c0
2
Time taken: 0.024 seconds, Fetched: 1 row(s)
- 向下取整floor:
hive> select floor(1.1);
OK
_c0
1
Time taken: 0.024 seconds, Fetched: 1 row(s)
- 平方pow,例如pow(2,3)表示2的三次方,等於8:
hive> select pow(2,3);
OK
_c0
8.0
Time taken: 0.027 seconds, Fetched: 1 row(s)
- 取模pmod:
hive> select pmod(10,3);
OK
_c0
1
Time taken: 0.059 seconds, Fetched: 1 row(s)
字元函式
- 轉小寫lower,轉大寫upper:
hive> select lower(name), upper(name) from student;
OK
_c0 _c1
tom TOM
jerry JERRY
mike MIKE
john JOHN
mary MARY
Time taken: 0.051 seconds, Fetched: 5 row(s)
- 字串長度length:
hive> select name, length(name) from student;
OK
tom 3
jerry 5
mike 4
john 4
mary 4
Time taken: 0.322 seconds, Fetched: 5 row(s)
- 字串拼接concat:
hive> select concat("prefix_", name) from student;
OK
prefix_tom
prefix_jerry
prefix_mike
prefix_john
prefix_mary
Time taken: 0.106 seconds, Fetched: 5 row(s)
- 子串substr,substr(xxx,2)表示從第二位開始到右邊所有,substr(xxx,2,3)表示從第二位開始取三個字元:
hive> select substr("0123456",2);
OK
123456
Time taken: 0.067 seconds, Fetched: 1 row(s)
hive> select substr("0123456",2,3);
OK
123
Time taken: 0.08 seconds, Fetched: 1 row(s)
- 去掉前後空格trim:
hive> select trim(" 123 ");
OK
123
Time taken: 0.065 seconds, Fetched: 1 row(s)
json處理(get_json_object)
為了使用json處理的函式,先準備一些資料:
- 先建立表t15,只有一個欄位用於儲存字串:
create table t15(json_raw string)
row format delimited;
- 建立t15.txt檔案,內容如下:
{"name":"tom","age":"10"}
{"name":"jerry","age":"11"}
- 載入資料到t15表:
load data
local inpath '/home/hadoop/temp/202010/25/015.txt'
into table t15;
- 使用get_json_object函式,解析json_raw欄位,分別取出指定name和age屬性:
select
get_json_object(json_raw, "$.name"),
get_json_object(json_raw, "$.age")
from t15;
得到結果:
hive> select
> get_json_object(json_raw, "$.name"),
> get_json_object(json_raw, "$.age")
> from t15;
OK
tom 10
jerry 11
Time taken: 0.081 seconds, Fetched: 2 row(s)
日期
- 獲取當前日期current_date:
hive> select current_date();
OK
2020-11-02
Time taken: 0.052 seconds, Fetched: 1 row(s)
- 獲取當前時間戳current_timestamp:
hive> select current_timestamp();
OK
2020-11-02 10:07:58.967
Time taken: 0.049 seconds, Fetched: 1 row(s)
- 獲取年份year、月份month、日期day:
hive> select year(current_date()), month(current_date()), day(current_date());
OK
2020 11 2
Time taken: 0.054 seconds, Fetched: 1 row(s)
- 另外,year和current_timestamp也能搭配使用:
hive> select year(current_timestamp()), month(current_timestamp()), day(current_timestamp());
OK
2020 11 2
Time taken: 0.042 seconds, Fetched: 1 row(s)
- 返回日期部分to_date:
hive> select to_date(current_timestamp());
OK
2020-11-02
Time taken: 0.051 seconds, Fetched: 1 row(s)
條件函式
- 條件函式的作用和java中的switch類似,語法是case X when XX then XXX else XXXX end;
- 示例如下,作用是判斷name欄位,如果等於tom就返回tom_case,如果等於jerry就返回jerry_case,其他情況都返回other_case:
select name,
case name when 'tom' then 'tom_case'
when 'jerry' then 'jerry_case'
else 'other_case'
end
from student;
結果如下:
hive> select name,
> case name when 'tom' then 'tom_case'
> when 'jerry' then 'jerry_case'
> else 'other_case'
> end
> from student;
OK
tom tom_case
jerry jerry_case
mike other_case
john other_case
mary other_case
Time taken: 0.08 seconds, Fetched: 5 row(s)
聚合函式
- 返回行數count:
select count(*) from student;
觸發MR,結果如下:
Total MapReduce CPU Time Spent: 2 seconds 170 msec
OK
5
Time taken: 20.823 seconds, Fetched: 1 row(s)
- 分組後組內求和sum:
select province, sum(1) from address group by province;
觸發MR,結果如下:
Total MapReduce CPU Time Spent: 1 seconds 870 msec
OK
guangdong 2
jiangshu 1
shanxi 2
Time taken: 19.524 seconds, Fetched: 3 row(s)
- 分組後,組內最小值min,最大值max,平均值avg:
select province, min(addressid), max(addressid), avg(addressid) from address group by province;
觸發MR,結果如下:
Total MapReduce CPU Time Spent: 1 seconds 650 msec
OK
guangdong 1 2 1.5
jiangshu 6 6 6.0
shanxi 3 4 3.5
Time taken: 20.106 seconds, Fetched: 3 row(s)
- 至此,hive常用到內建函式我們們都體驗過一遍了,希望能給您提供一些參考,接下來的文章會體驗一個常用工具:Sqoop
你不孤單,欣宸原創一路相伴
歡迎關注公眾號:程式設計師欣宸
微信搜尋「程式設計師欣宸」,我是欣宸,期待與您一同暢遊Java世界...
https://github.com/zq2599/blog_demos