本文簡單介紹兩種往SQLContext、HiveContext中註冊自定義函式方法。
下邊以sqlContext為例,在spark-shell下操作示例:
scala> sc res5: org.apache.spark.SparkContext = org.apache.spark.SparkContext@35d4035f scala> sqlContext res7: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@171b0d3 scala> val df = sc.parallelize(Seq(("張三", 25), ("李四", 30),("趙六", 27))).toDF("name", "age") df: org.apache.spark.sql.DataFrame = [name: string, age: int] scala> df.registerTempTable("emp") 1)外部定義函式: scala> def remainWorkYears(age: Int) : Int = { | 60 - age | } remainWorkYears: (age: Int)Int scala> sqlContext.udf.register("remainWorkYears", remainWorkYears _) res1: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List()) scala> sqlContext.sql("select e.*, remainWorkYears(e.age) as remainedWorkYear from emp e").show hiveContext.sql("select e.*, remainWorkYears(e.age) as remainedWorkYear from emp e").show +----+---+----------------+ |name|age|remainedWorkYear| +----+---+----------------+ | 張三| 25| 35| | 李四| 30| 30| | 趙六| 27| 33| +----+---+----------------+ 2)匿名函式: scala> sqlContext.udf.register("remainWorkYears_anoymous", (age: Int) => { | 60 - age | }) res3: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List()) scala> sqlContext.sql("select e.*, remainWorkYears_anoymous(e.age) as remainedWorkYear from emp e").show +----+---+----------------+ |name|age|remainedWorkYear| +----+---+----------------+ | 張三| 25| 35| | 李四| 30| 30| | 趙六| 27| 33| +----+---+----------------+