Spark SQL使用簡介（2）--UDF（使用者自定義函式）

瀛999發表於2018-08-02

原文網址 : https://blog.csdn.net/gan785160627/article/details/81369073

內建的DataFrame函式提供了正常的聚合函式，如count(), countDistinct(), avg(), max(), min()，我們也可以自己定義聚合函式，無型別的使用者定義聚合函式按如下方式定義：

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._

object MyAverage extends UserDefinedAggregateFunction {
  // Data types of input arguments of this aggregate function
  //聚合函式的輸入引數的型別
  def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
  // Data types of values in the aggregation buffer
  //聚合buffer裡的值得資料型別
  def bufferSchema: StructType = {
    StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
  }
  // The data type of the returned value
  //返回值的資料型別
  def dataType: DataType = DoubleType
  // Whether this function always returns the same output on the identical input
  //對於相同的輸入函式是否返回相同的輸出
  def deterministic: Boolean = true
  // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
  // the opportunity to update its values. Note that arrays and maps inside the buffer are still
  // immutable.
  //初始化buffer
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }
  // Updates the given aggregation buffer `buffer` with new input data from `input`
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    if (!input.isNullAt(0)) {
      buffer(0) = buffer.getLong(0) + input.getLong(0)
      buffer(1) = buffer.getLong(1) + 1
    }
  }
  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  }
  // Calculates the final result
  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}

// Register the function to access it
spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

下面是型別安全的使用者自定義函式：

import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator

case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)

object MyAverage extends Aggregator[Employee, Average, Double] {
  // A zero value for this aggregation. Should satisfy the property that any b + zero = b
  def zero: Average = Average(0L, 0L)
  // Combine two values to produce a new value. For performance, the function may modify `buffer`
  // and return it instead of constructing a new object
  def reduce(buffer: Average, employee: Employee): Average = {
    buffer.sum += employee.salary
    buffer.count += 1
    buffer
  }
  // Merge two intermediate values
  def merge(b1: Average, b2: Average): Average = {
    b1.sum += b2.sum
    b1.count += b2.count
    b1
  }
  // Transform the output of the reduction
  def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
  // Specifies the Encoder for the intermediate value type
  def bufferEncoder: Encoder[Average] = Encoders.product
  // Specifies the Encoder for the final output value type
  def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

spark2.4.3 sparkSQL 使用者自定義函式筆記
2019-05-21
SparkSQL函式筆記
Apache Phoenix自定義函式（UDF）實踐
2019-01-07
Apache函式
Hive函式（內建函式+自定義標準函式UDF）
2020-09-23
Hive函式
Spark SQL學習——UDF、UDAF和開窗函式
2019-04-05
SparkSQL函式
spark三種清理資料的方式：UDF，自定義函式，spark.sql；Python中的zip()與*zip()函式詳解//及python中的*args和**kwargs
2018-07-30
Spark函式SQLPython
【Spark篇】---SparkSql之UDF函式和UDAF函式
2018-03-07
SparkSQL函式
Clickhouse 使用者自定義外部函式
2022-03-31
函式
簡單介紹tensorflow2 自定義損失函式使用的隱藏坑
2021-08-10
函式
教你認識AWK 使用者自定義函式
2020-10-07
函式
sql中select列有自定義函式 dblink
2018-05-26
SQL函式
單據列表呼叫自定義SQL函式
2024-05-07
SQL函式
SQL優化案例-自定義函式索引（五）
2018-11-28
SQL優化函式索引
T-SQL——自定義函式解析JSON字串
2024-11-13
SQL函式JSON字串
簡單介紹SQL中ISNULL函式使用方法
2022-01-11
SQLNull函式
Spark SQL 開窗函式
2020-03-23
SparkSQL函式
Spark 系列（十一）—— Spark SQL 聚合函式 Aggregations
2019-08-14
SparkSQL函式
matlab自定義函式建立與使用
2020-12-22
Matlab函式
MySQL使用之五_自定義函式和自定義過程
2020-12-22
MySql函式
SQL最佳化案例-自定義函式索引（五）
2018-07-30
SQL函式索引
無涯教程: Laravel 8 - 自定義函式介紹
2021-09-09
Laravel函式
Oracle 自定義函式
2018-10-21
Oracle函式
shell自定義函式
2020-04-05
函式
tensorflow2 自定義損失函式使用的隱藏坑
2021-07-26
函式
Hive常用函式及自定義函式
2018-06-08
Hive函式
hive學習筆記之十：使用者自定義聚合函式(UDAF)
2021-07-09
Hive筆記函式
hive 3.0.0自定義函式
2018-09-06
Hive函式
Hive中自定義函式
2020-10-13
Hive函式
python教程：自定義函式
2024-07-04
Python函式
【TUNE_ORACLE】列出帶有自定義函式的SQL的SQL參考
2021-08-06
Oracle函式SQL
java自定義equals函式和hashCode函式
2019-06-07
Java函式
Spark SQL使用簡介（3）--載入和儲存資料
2018-08-03
SparkSQL
函式式API簡介
2024-05-31
函式API
在python中使用sqlite的自定義函式功能
2020-04-09
PythonSQLite函式
PHP 自定義函式用法及常用函式集合
2019-08-02
PHP函式
Laravel 新增自定義助手函式
2020-05-29
Laravel函式
laravel 自定義全域性函式
2020-04-07
Laravel函式
Laravel 自定義函式存放位置
2020-04-26
Laravel函式
Laravel自定義輔助函式
2021-11-23
Laravel函式

Spark SQL使用簡介（2）--UDF（使用者自定義函式）

相關文章