Spark SQL使用簡介(2)--UDF(使用者自定義函式)
內建的DataFrame函式提供了正常的聚合函式,如count()
, countDistinct()
, avg()
, max()
, min(),我們也可以自己定義
聚合函式,無型別的使用者定義聚合函式按如下方式定義:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
object MyAverage extends UserDefinedAggregateFunction {
// Data types of input arguments of this aggregate function
//聚合函式的輸入引數的型別
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
// Data types of values in the aggregation buffer
//聚合buffer裡的值得資料型別
def bufferSchema: StructType = {
StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
}
// The data type of the returned value
//返回值的資料型別
def dataType: DataType = DoubleType
// Whether this function always returns the same output on the identical input
//對於相同的輸入函式是否返回相同的輸出
def deterministic: Boolean = true
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
//初始化buffer
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
// Updates the given aggregation buffer `buffer` with new input data from `input`
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
}
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
// Calculates the final result
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}
// Register the function to access it
spark.udf.register("myAverage", MyAverage)
val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
下面是型別安全的使用者自定義函式:
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)
object MyAverage extends Aggregator[Employee, Average, Double] {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
b1
}
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
相關文章
- spark2.4.3 sparkSQL 使用者自定義函式筆記SparkSQL函式筆記
- Apache Phoenix自定義函式(UDF)實踐Apache函式
- Hive函式(內建函式+自定義標準函式UDF)Hive函式
- Spark SQL學習——UDF、UDAF和開窗函式SparkSQL函式
- spark三種清理資料的方式:UDF,自定義函式,spark.sql;Python中的zip()與*zip()函式詳解//及python中的*args和**kwargsSpark函式SQLPython
- 【Spark篇】---SparkSql之UDF函式和UDAF函式SparkSQL函式
- Clickhouse 使用者自定義外部函式函式
- 簡單介紹tensorflow2 自定義損失函式使用的隱藏坑函式
- 教你認識AWK 使用者自定義函式函式
- 單據列表呼叫自定義SQL函式SQL函式
- sql中select列有自定義函式 dblinkSQL函式
- T-SQL——自定義函式解析JSON字串SQL函式JSON字串
- SQL優化案例-自定義函式索引(五)SQL優化函式索引
- 簡單介紹SQL中ISNULL函式使用方法SQLNull函式
- Spark SQL 開窗函式SparkSQL函式
- Spark 系列(十一)—— Spark SQL 聚合函式 AggregationsSparkSQL函式
- matlab自定義函式建立與使用Matlab函式
- SQL最佳化案例-自定義函式索引(五)SQL函式索引
- MySQL使用之五_自定義函式和自定義過程MySql函式
- 無涯教程: Laravel 8 - 自定義函式介紹Laravel函式
- shell自定義函式函式
- Oracle 自定義函式Oracle函式
- tensorflow2 自定義損失函式使用的隱藏坑函式
- Hive常用函式及自定義函式Hive函式
- hive學習筆記之十:使用者自定義聚合函式(UDAF)Hive筆記函式
- python教程:自定義函式Python函式
- Hive中自定義函式Hive函式
- hive 3.0.0自定義函式Hive函式
- 【TUNE_ORACLE】列出帶有自定義函式的SQL的SQL參考Oracle函式SQL
- java自定義equals函式和hashCode函式Java函式
- Spark SQL使用簡介(3)--載入和儲存資料SparkSQL
- 函式式API簡介函式API
- 在python中使用sqlite的自定義函式功能PythonSQLite函式
- PHP 自定義函式用法及常用函式集合PHP函式
- FlinkSQL自定義函式開發SQL函式
- Laravel自定義輔助函式Laravel函式
- Laravel 新增自定義助手函式Laravel函式
- laravel 自定義全域性函式Laravel函式