好的,我會列出Hive的主要分析函式,然後建立一個表並插入10條資料來演示這些函式的效果。
Hive的主要分析函式:
- ROW_NUMBER()
- RANK()
- DENSE_RANK()
- CUME_DIST()
- PERCENT_RANK()
- NTILE()
- LAG()
- LEAD()
- FIRST_VALUE()
- LAST_VALUE()
現在,讓我們建立一個表並插入資料:
-- 建立表 CREATE TABLE employees ( id INT, name STRING, department STRING, salary INT ); -- 插入資料 INSERT INTO employees VALUES (1, 'Alice', 'HR', 50000), (2, 'Bob', 'Sales', 60000), (3, 'Charlie', 'Marketing', 55000), (4, 'David', 'IT', 65000), (5, 'Eve', 'HR', 52000), (6, 'Frank', 'Sales', 61000), (7, 'Grace', 'Marketing', 58000), (8, 'Henry', 'IT', 67000), (9, 'Ivy', 'HR', 53000), (10, 'Jack', 'Sales', 62000); -- 使用分析函式的查詢 SELECT id, name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank, CUME_DIST() OVER (PARTITION BY department ORDER BY salary DESC) as cume_dist, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as percent_rank, NTILE(2) OVER (PARTITION BY department ORDER BY salary DESC) as ntile, LAG(salary) OVER (PARTITION BY department ORDER BY salary DESC) as prev_salary, LEAD(salary) OVER (PARTITION BY department ORDER BY salary DESC) as next_salary, FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESC) as highest_salary, LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lowest_salary FROM employees;
當然,我會為您展示查詢結果。請注意,這是一個模擬的結果,因為我無法實際執行Hive查詢。但這個結果準確反映了各個分析函式的預期行為。 ``` id | name | department | salary | row_num | rank | dense_rank | cume_dist | percent_rank | ntile | prev_salary | next_salary | highest_salary | lowest_salary ---+---------+------------+--------+---------+------+------------+-----------+--------------+-------+-------------+-------------+----------------+-------------- 1 | Alice | HR | 50000 | 3 | 3 | 3 | 1.0 | 1.0 | 2 | 52000 | NULL | 53000 | 50000 5 | Eve | HR | 52000 | 2 | 2 | 2 | 0.6667 | 0.5 | 1 | 53000 | 50000 | 53000 | 50000 9 | Ivy | HR | 53000 | 1 | 1 | 1 | 0.3333 | 0.0 | 1 | NULL | 52000 | 53000 | 50000 ---+---------+------------+--------+---------+------+------------+-----------+--------------+-------+-------------+-------------+----------------+-------------- 8 | Henry | IT | 67000 | 1 | 1 | 1 | 0.5 | 0.0 | 1 | NULL | 65000 | 67000 | 65000 4 | David | IT | 65000 | 2 | 2 | 2 | 1.0 | 1.0 | 2 | 67000 | NULL | 67000 | 65000 ---+---------+------------+--------+---------+------+------------+-----------+--------------+-------+-------------+-------------+----------------+-------------- 3 | Charlie | Marketing | 55000 | 2 | 2 | 2 | 1.0 | 1.0 | 2 | 58000 | NULL | 58000 | 55000 7 | Grace | Marketing | 58000 | 1 | 1 | 1 | 0.5 | 0.0 | 1 | NULL | 55000 | 58000 | 55000 ---+---------+------------+--------+---------+------+------------+-----------+--------------+-------+-------------+-------------+----------------+-------------- 10 | Jack | Sales | 62000 | 1 | 1 | 1 | 0.3333 | 0.0 | 1 | NULL | 61000 | 62000 | 60000 6 | Frank | Sales | 61000 | 2 | 2 | 2 | 0.6667 | 0.5 | 1 | 62000 | 60000 | 62000 | 60000 2 | Bob | Sales | 60000 | 3 | 3 | 3 | 1.0 | 1.0 | 2 | 61000 | NULL | 62000 | 60000 ``` 這個結果展示了每個分析函式的效果: 1. ROW_NUMBER(): 在每個部門內按工資降序給出唯一的行號。 2. RANK(): 類似ROW_NUMBER,但相同工資會得到相同排名,排名可能不連續。 3. DENSE_RANK(): 類似RANK,但排名總是連續的。 4. CUME_DIST(): 計算累積分佈,值域為0到1。 5. PERCENT_RANK(): 計算百分比排名,值域為0到1。 6. NTILE(2): 將每個部門的員工分為2組。 7. LAG(): 顯示同一部門中工資排名前一位員工的工資。 8. LEAD(): 顯示同一部門中工資排名後一位員工的工資。 9. FIRST_VALUE(): 顯示每個部門的最高工資。 10. LAST_VALUE(): 顯示每個部門的最低工資。 這些函式允許我們在同一個部門內比較和分析員工的工資情況。需要我詳細解釋某個特定函式的結果嗎?