Spark SQL使用簡介(2)--UDF(使用者自定義函式)
內建的DataFrame函式提供了正常的聚合函式,如count()
, countDistinct()
, avg()
, max()
, min(),我們也可以自己定義
聚合函式,無型別的使用者定義聚合函式按如下方式定義:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
object MyAverage extends UserDefinedAggregateFunction {
// Data types of input arguments of this aggregate function
//聚合函式的輸入引數的型別
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
// Data types of values in the aggregation buffer
//聚合buffer裡的值得資料型別
def bufferSchema: StructType = {
StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
}
// The data type of the returned value
//返回值的資料型別
def dataType: DataType = DoubleType
// Whether this function always returns the same output on the identical input
//對於相同的輸入函式是否返回相同的輸出
def deterministic: Boolean = true
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
//初始化buffer
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
// Updates the given aggregation buffer `buffer` with new input data from `input`
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
}
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
// Calculates the final result
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}
// Register the function to access it
spark.udf.register("myAverage", MyAverage)
val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
下面是型別安全的使用者自定義函式:
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)
object MyAverage extends Aggregator[Employee, Average, Double] {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
b1
}
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
相關文章
- Spark SQL scala和java版本的UDF函式使用SparkSQLJava函式
- SQL Server 2005中的UDF(使用者定義函式)SQLServer函式
- Apache Phoenix自定義函式(UDF)實踐Apache函式
- 自定義UDF函式應用異常函式
- spark2.4.3 sparkSQL 使用者自定義函式筆記SparkSQL函式筆記
- Hive函式(內建函式+自定義標準函式UDF)Hive函式
- 【Spark篇】---SparkSQL中自定義UDF和UDAF,開窗函式的應用SparkSQL函式
- Spark SQL學習——UDF、UDAF和開窗函式SparkSQL函式
- Spark註冊UDF函式,用於DataFrame DSL or SQLSpark函式SQL
- Hive中配置與編寫自定義UDF函式Hive函式
- 【Spark篇】---SparkSql之UDF函式和UDAF函式SparkSQL函式
- Clickhouse 使用者自定義外部函式函式
- Spark Sql 函式使用SparkSQL函式
- Hive中新增自定義函式(UDF)方法及實踐Hive函式
- SQL SERVER 自定義函式SQLServer函式
- SQL 自定義函式FUNCTIONSQL函式Function
- spark三種清理資料的方式:UDF,自定義函式,spark.sql;Python中的zip()與*zip()函式詳解//及python中的*args和**kwargsSpark函式SQLPython
- Loadrunner 使用者自定義函式使用[轉]函式
- MS SQL Server 建立返回表型別的使用者自定義函式SQLServer型別函式
- 簡單介紹tensorflow2 自定義損失函式使用的隱藏坑函式
- Sql Server系列:自定義函式SQLServer函式
- 教你認識AWK 使用者自定義函式函式
- 單據列表呼叫自定義SQL函式SQL函式
- SQL優化案例-自定義函式索引(五)SQL優化函式索引
- 使用者自定義聚集函式--求平均值(avg)--myavg函式
- MySQL使用之五_自定義函式和自定義過程MySql函式
- 簡單介紹SQL中ISNULL函式使用方法SQLNull函式
- Spark SQL 開窗函式SparkSQL函式
- shell自定義函式函式
- Oracle 自定義函式Oracle函式
- perl自定義函式函式
- 無涯教程: Laravel 8 - 自定義函式介紹Laravel函式
- 函式索引使用細節——自定義函式的索引化函式索引
- sql mode 和使用者自定義分割槽SQL
- SQL最佳化案例-自定義函式索引(五)SQL函式索引
- sql中select列有自定義函式 dblinkSQL函式
- T-SQL——自定義函式解析JSON字串SQL函式JSON字串
- matlab自定義函式建立與使用Matlab函式