開窗函式的呼叫格式為: 函式名(列名) over(partition by 列名 order by 列名)
如果你沒有接觸過開窗函式上面這個格式你也許會有些疑惑,但你只要瞭解一些聚合函式,那麼理解開窗函式就非常容易了,我們知道聚合函式對一組值進行計算並返回單一的值,如sum(),count(),max(),min(),avg()等,這些函式常與group by 語句連用。但是一組資料只返回一組指是不能滿足需求的,如我們常想知道的各個地區的第一名是誰? 各個班級的前幾名是誰?這個時候需要每一組返回多個值。 用開窗函式解決就非常方便。
select * from
(select name,class,score ,rank() over(partition by class order by sorce)) as t
where t.rank = 1
Spark 程式碼如下:
object OverFunction extends App {
val sparkConf = new SparkConf().setAppName("over").setMaster("local[*]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
println("//*************** 原始的班級表 ****************//")
val scoreDF = spark.sparkContext.makeRDD(Array( Score("a", 1, 80),
Score("b", 1, 78),
Score("c", 1, 95),
Score("d", 2, 74),
Score("e", 2, 92),
Score("f", 3, 99),
Score("g", 3, 99),
Score("h", 3, 45),
Score("i", 3, 55),
Score("j", 3, 78))).toDF("name","class","score")
scoreDF.createOrReplaceTempView("score")
scoreDF.show()
println("//*************** 求每個班最高成績學生的資訊 ***************/")
println(" /******* 開窗函式的表 ********/")
spark.sql("select name,class,score, rank() over(partition by class order by score desc) rank from score").show()
println(" /******* 計算結果的表 *******")
spark.sql("select * from " +
"( select name,class,score,rank() over(partition by class order by score desc) rank from score) " +
"as t " +
"where t.rank=1").show()
//spark.sql("select name,class,score,row_number() over(partition by class order by score desc) rank from score").show()
println("/************** 求每個班最高成績學生的資訊(groupBY) ***************/")
spark.sql("select class, max(score) max from score group by class").show()
spark.sql("select a.name, b.class, b.max from score a, " +
"(select class, max(score) max from score group by class) as b " +
"where a.score = b.max").show()
spark.stop()
}