hive經典案列--top N(行轉列\列轉行)

一隻考考拉發表於2020-12-06
列轉行案列:現在有這樣一份資料: exercise_topn.txt
 
需求:
求出每種愛好中,年齡最大的兩個人(愛好,年齡,姓名)注意思考一個問題:如果某個愛好中的第二大年齡
有多個相同的怎麼辦?
 
其中需要注意的是:每一條記錄中的愛好有多個值,以"-"分隔
id,姓名,年齡,愛好
id,name,age,favors

1,huangxiaoming,45,a-c-d-f

2,huangzitao,36,b-c-d-e

3,huanglei,41,c-d-e

4,liushishi,22,a-d-e

5,liudehua,39,e-f-d

6,liuyifei,35,a-d-e

思路總結:

1.先把愛好拆分成一條一條的資料,行轉列我們可以用explode() + lateral view

2.按照愛好進行分組年齡進行排序取rownum<=2  我們可以用rownum over()

實現

1.select a.id,a.name,a.age, favor_view as favor from exercise_topn a 
alter view explode(split(a.favors,“-”))  favor_view as favor;

轉換後的資料:


| 1 | huangxiaoming | 45 | a |
| 1 | huangxiaoming | 45 | c |
| 1 | huangxiaoming | 45 | d |
| 1 | huangxiaoming | 45 | f |
| 2 | huangzitao | 36 | b |
| 2 | huangzitao | 36 | c |
| 2 | huangzitao | 36 | d |
| 2 | huangzitao | 36 | e |
| 3 | huanglei | 41 | c |
| 3 | huanglei | 41 | d |
| 3 | huanglei | 41 | e |
 
SELECT c.id, c.name, c.age, c.favor
FROM (
	SELECT b.id, b.name, b.age, b.favor
		, row_number() OVER (PARTITION BY b.favor ORDER BY b.age DESC) AS rank
	FROM (
		SELECT a.id AS id, a.name AS name, a.age AS age, favor_view.favor
		FROM exercise_topn a
			LATERAL VIEW explode(split(a.favors, '-')) favor_view AS favor
	) b
) c
WHERE c.rank <= 2;
RANK() 生成資料項在分組中的排名,排名相等會在名次中留下空位(比如 1 2 2 4)
DENSE_RANK() 生成資料項在分組中的排名,排名相等會在名次中不會留下空位(比如1,2,2,3)

2.行轉列案列

所有數學課程成績 大於 語文課程成績的學生的學號
資料結構如下:
id,sid,course,score
1,1,yuwen,43 
2,1,shuxue,55 
3,2,yuwen,77 
4,2,shuxue,88 
5,3,yuwen,98 
6,3,shuxue,65 
7,3,yingyu,80

解決方案:
1.行列轉換
SELECT sid
	, max(CASE course
		WHEN 'yuwen' THEN score
		ELSE 0
	END) AS yuwen
	, max(CASE course
		WHEN 'shuxue' THEN score
		ELSE 0
	END) AS shuxue
	, max(CASE course
		WHEN 'yingyu' THEN score
		ELSE 0
	END) AS yingyu
FROM exercise_course
GROUP BY sid;

得到的資料結果為:

+------+--------+---------+---------+ 
| sid | yuwen | shuxue | yingyu | 
+------+--------+---------+---------+ 
| 3 | 98 | 65 | 80 |
| 1 | 43 | 55 | 0 | 
| 2 | 77 | 88 | 0 |
 +------+--------+---------+---------+

2.加過濾條件

SELECT aa.sid
FROM (
	SELECT sid
		, max(CASE course
			WHEN 'yuwen' THEN score
			ELSE 0
		END) AS yuwen
		, max(CASE course
			WHEN 'shuxue' THEN score
			ELSE 0
		END) AS shuxue
		, max(CASE course
			WHEN 'yingyu' THEN score
			ELSE 0
		END) AS yingyu
	FROM exercise_course
	GROUP BY sid
) aa
WHERE aa.shuxue > aa.yuwen;

 

 

相關文章