Spark綜合使用及使用者行為案例區域內熱門商品統計分析實戰-Spark商業應用實戰

凱新雲技術社群發表於2018-12-25

原文網址 : https://juejin.im/post/5c2243936fb9a049ac793108

版權宣告：本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。QQ郵箱地址：1120746959@qq.com，如有任何技術交流，可隨時聯絡。

1 UDAF 電商業務的實現城市資訊累加

MutableAggregationBuffer是一個陣列，這裡我們取 buffer.getString(0)。

把傳進來的字串進行追加到buffer.getString(0)中。

  class GroupConcatDistinctUDAF extends UserDefinedAggregateFunction {
  
   輸入資料型別
   override def inputSchema: StructType = StructType(StructField("cityInfo", StringType) :: Nil)
  
   緩衝資料型別
   override def bufferSchema: StructType = StructType(StructField("bufferCityInfo", StringType) :: Nil)
  
   輸出資料型別
   override def dataType: DataType = StringType
  
   一致性校驗
   override def deterministic: Boolean = true
  
   override def initialize(buffer: MutableAggregationBuffer): Unit = {
      buffer(0)= ""
   }
  
    /**
      * 更新
      * 可以認為是，一個一個地將組內的欄位值傳遞進來
      * 實現拼接的邏輯
      */
    override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
      // 緩衝中的已經拼接過的城市資訊串
      var bufferCityInfo = buffer.getString(0)
      // 剛剛傳遞進來的某個城市資訊
      val cityInfo = input.getString(0)
  
      // 在這裡要實現去重的邏輯
      // 判斷：之前沒有拼接過某個城市資訊，那麼這裡才可以接下去拼接新的城市資訊
      if(!bufferCityInfo.contains(cityInfo)) {
        if("".equals(bufferCityInfo))
          bufferCityInfo += cityInfo
        else {
          // 比如1:北京
          // 1:北京,2:上海
          bufferCityInfo += "," + cityInfo
        }
  
        buffer.update(0, bufferCityInfo)
      }
    }
  
    override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    
      var bufferCityInfo1 = buffer1.getString(0);
      val bufferCityInfo2 = buffer2.getString(0);
  
      for(cityInfo <- bufferCityInfo2.split(",")) {
        if(!bufferCityInfo1.contains(cityInfo)) {
          if("".equals(bufferCityInfo1)) {
            bufferCityInfo1 += cityInfo;
          } else {
            bufferCityInfo1 += "," + cityInfo;
          }
        }
      }
  
      buffer1.update(0, bufferCityInfo1);
    }
  
    override def evaluate(buffer: Row): Any = {
      buffer.getString(0)
    }
  
  }
複製程式碼

2 UDAF 無型別的使用者自定於聚合函式求平均值

分析資料

  第一列為user_id，第二列為item_id，第三列為score
  0162381440670851711,4,7.0
  0162381440670851711,11,4.0
  0162381440670851711,32,1.0
  0162381440670851711,176,27.0
  0162381440670851711,183,11.0
  0162381440670851711,184,5.0
  0162381440670851711,207,9.0
  0162381440670851711,256,3.0
  0162381440670851711,258,4.0
  0162381440670851711,259,16.0
  0162381440670851711,260,8.0
  0162381440670851711,261,18.0
  0162381440670851711,301,1.0
複製程式碼

1、inputSchema

定義輸入資料的Schema，要求型別是StructType，它的引數是由StructField型別構成的陣列。比如這裡要定義score列的Schema，首先使用StructField宣告score列的名字score_column，資料型別為DoubleType。這裡輸入只有score這一列，所以StructField構成的陣列只有一個元素。
```
  override def inputSchema: StructType = StructType(StructField("score_column",DoubleType)::Nil)
複製程式碼
```
2、bufferSchema

計算score的平均值時，需要用到score的總和sum以及score的總個數count這樣的中間資料，那麼就使用bufferSchema來定義它們。
```
    override def bufferSchema: StructType = StructType(StructField("sum",DoubleType)::StructField("count",LongType)::Nil)
複製程式碼
```
3、dataType

我們需要對自定義聚合函式的最終資料型別進行說明，使用dataType函式。比如計算出的平均score是Double型別。
```
   override def dataType: DataType = DoubleType
複製程式碼
```
4、deterministic

deterministic函式用於對輸入資料進行一致性檢驗，是一個布林值，當為true時，表示對於同樣的輸入會得到同樣的輸出。因為對於同樣的score輸入，肯定要得到相同的score平均值，所以定義為true。
```
  override def deterministic: Boolean = true
複製程式碼
```
5、initialize

initialize使用者初始化快取資料。比如score的快取資料有兩個：sum和count，需要初始化為sum=0.0和count=0L，第一個初始化為Double型別，第二個初始化為長整型。

    override def initialize(buffer: MutableAggregationBuffer): Unit = {
          //sum=0.0
          buffer(0)=0.0
          //count=0
          buffer(1)=0L
        }
複製程式碼

6、update

當有新的輸入資料時，update使用者更新快取變數。比如這裡當有新的score輸入時，需要將它的值更新變數sum中，並將count加1

    override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
          //輸入非空
          if(!input.isNullAt(0)){
            //sum=sum+輸入的score
            buffer(0)=buffer.getDouble(0)+input.getDouble(0)
            //count=count+1
            buffer(1)=buffer.getLong(1)+1
          }
        }
複製程式碼

7、merge

merge將更新的快取變數存入到快取中

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
        buffer1(0)=buffer1.getDouble(0)+buffer2.getDouble(0)
        buffer1(1)=buffer1.getLong(1)+buffer2.getLong(1)
      }
複製程式碼

8、evaluate

evaluate是一個計算方法，用於計算我們的最終結果。比如這裡用於計算平均得分average(score)=sum(score)/count(score)
```
  override def evaluate(buffer: Row): Double = buffer.getDouble(0)/buffer.getLong(1)
複製程式碼
```

3 型別安全的使用者自定義聚合函式（Type-Safe User-Defined Aggregate Functions）

Data用於儲存itemdata.data資料，Average用於儲存計算score平均值的中間資料，需要注意的是Average的引數sum和count都要宣告為變數var。
```
  case class Data(user_id: String, item_id: String, score: Double)
  case class Average(var sum: Double,var count: Long)
複製程式碼
```
具體原始碼
聚合函式 toColumn.name("average_score")

使用聚合函式 dataDS.select(averageScore).show()

  import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
  import org.apache.spark.sql.expressions.Aggregator
   
   
  /**
    * 型別安全自定義聚合函式
    */
  object TypeSafeMyAverageTest {
  
    case class Data(user_id: String, item_id: String, score: Double)
    case class Average(var sum: Double,var count: Long)
   
    object SafeMyAverage extends Aggregator[Data, Average, Double] {
    
        zero相當於1中的initialize初始化函式，初始化儲存中間資料的Average
        override def zero: Average = Average(0.0D, 0L)
       
        reduce函式相當於1中的update函式，當有新的資料a時，更新中間資料b
        override def reduce(b: Average, a: Data): Average = {
            b.sum += a.score
            b.count += 1L
            b
          }
       
          override def merge(b1: Average, b2: Average): Average = {
            b1.sum+=b2.sum
            b1.count+= b2.count
            b1
          }
       
          override def finish(reduction: Average): Double = reduction.sum / reduction.count
       
          緩衝資料編碼方式
          override def bufferEncoder: Encoder[Average] = Encoders.product
       
          最終資料輸出編碼方式
          override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
        }
       
        def main(args: Array[String]): Unit = {
          //建立Spark SQL切入點
          val spark = SparkSession.builder().master("local").appName("My-Average").getOrCreate()
          //讀取HDFS檔案系統資料itemdata.data生成RDD
          val rdd = spark.sparkContext.textFile("hdfs://192.168.189.21:8020/input/mahout-demo/itemdata.data")
          //RDD轉化成DataSet
          import spark.implicits._
          val dataDS =rdd.map(_.split(",")).map(d => Data(d(0), d(1), d(2).toDouble)).toDS()
          //自定義聚合函式
          val averageScore = SafeMyAverage.toColumn.name("average_score")
          dataDS.select(averageScore).show()
        }
  }
複製程式碼

4 區域內熱門商品專案實戰

版權宣告：本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。QQ郵箱地址：1120746959@qq.com，如有任何技術交流，可隨時聯絡。

4.1 區域模型

Spark綜合使用及使用者行為案例區域內熱門商品統計分析實戰-Spark商業應用實戰

區域內熱門商品分析流程

4.2 建立Spark客戶端

// 任務的執行ID，使用者唯一標示執行後的結果，用在MySQL資料庫中
val taskUUID = UUID.randomUUID().toString

// 構建Spark上下文
val sparkConf = new SparkConf().setAppName("SessionAnalyzer").setMaster("local[*]")

// 建立Spark客戶端
val spark = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
複製程式碼

4.3 註冊UDF函式

    // 註冊自定義函式
    spark.udf.register("concat_long_string", (v1: Long, v2: String, split: String) => v1.toString + split + v2)
    
    spark.udf.register("get_json_object", (json: String, field: String) => {
      val jsonObject = JSONObject.fromObject(json);
      jsonObject.getString(field)
    })
    spark.udf.register("group_concat_distinct", new GroupConcatDistinctUDAF())

    // 獲取任務日期引數
    val startDate = ParamUtils.getParam(taskParam, Constants.PARAM_START_DATE)
    val endDate = ParamUtils.getParam(taskParam, Constants.PARAM_END_DATE)
複製程式碼

4.4 查詢使用者指定日期範圍內的城市粒度點選行為資料

val cityid2clickActionRDD = getcityid2ClickActionRDDByDate(spark, startDate, endDate)

def getcityid2ClickActionRDDByDate(spark: SparkSession, startDate: String, endDate: String): RDD[(Long, Row)] = {
    // 從user_visit_action中，查詢使用者訪問行為資料
    // 第一個限定：click_product_id，限定為不為空的訪問行為，那麼就代表著點選行為
    // 第二個限定：在使用者指定的日期範圍內的資料

    val sql =
      "SELECT " +
          "city_id," +
          "click_product_id " +
        "FROM user_visit_action " +
        "WHERE click_product_id IS NOT NULL and click_product_id != -1L " +
          "AND date>='" + startDate + "' " +
          "AND date<='" + endDate + "'"

    val clickActionDF = spark.sql(sql)

    //(cityid, row)
    clickActionDF.rdd.map(item => (item.getAs[Long]("city_id"), item))
  }
複製程式碼

4.5 查詢城市資訊

 def getcityid2CityInfoRDD(spark: SparkSession): RDD[(Long, Row)] = {

    val cityInfo = Array((0L, "北京", "華北"), (1L, "上海", "華東"), (2L, "南京", "華東"), (3L, "廣州", "華南"), (4L, "三亞", "華南"), (5L, "武漢", "華中"), (6L, "長沙", "華中"), (7L, "西安", "西北"), (8L, "成都", "西南"), (9L, "哈爾濱", "東北"))
    import spark.implicits._
    val cityInfoDF = spark.sparkContext.makeRDD(cityInfo).toDF("city_id", "city_name", "area")
    cityInfoDF.rdd.map(item => (item.getAs[Long]("city_id"), item))
  }

// 使用(city_id , 城市資訊)
val cityid2cityInfoRDD = getcityid2CityInfoRDD(spark)
複製程式碼

4.6 生成點選商品基礎資訊臨時表

// 將點選行為cityid2clickActionRDD和城市資訊cityid2cityInfoRDD進行Join關聯
// tmp_click_product_basic
generateTempClickProductBasicTable(spark, cityid2clickActionRDD, cityid2cityInfoRDD)

def generateTempClickProductBasicTable(spark: SparkSession, cityid2clickActionRDD: RDD[(Long, Row)], cityid2cityInfoRDD: RDD[(Long, Row)]) {
    // 執行join操作，進行點選行為資料和城市資料的關聯
    val joinedRDD = cityid2clickActionRDD.join(cityid2cityInfoRDD)

    // 將上面的JavaPairRDD，轉換成一個JavaRDD<Row>（才能將RDD轉換為DataFrame）
    val mappedRDD = joinedRDD.map { case (cityid, (action, cityinfo)) =>
      val productid = action.getLong(1)
      //action.getAs[String]("aera")
      val cityName = cityinfo.getString(1)
      val area = cityinfo.getString(2)
      (cityid, cityName, area, productid)
    }
    // 1 北京
    // 2 上海
    // 1 北京
    // group by area,product_id
    // 1:北京,2:上海

    // 兩個函式
    // UDF：concat2()，將兩個欄位拼接起來，用指定的分隔符
    // UDAF：group_concat_distinct()，將一個分組中的多個欄位值，用逗號拼接起來，同時進行去重
    import spark.implicits._
    val df = mappedRDD.toDF("city_id", "city_name", "area", "product_id")
    // 為df建立臨時表
    df.createOrReplaceTempView("tmp_click_product_basic")
複製程式碼

4.7 生成各區域各商品點選次數的臨時表

    generateTempAreaPrdocutClickCountTable(spark)
    
    def generateTempAreaPrdocutClickCountTable(spark: SparkSession) {

        // 按照area和product_id兩個欄位進行分組
        // 計算出各區域各商品的點選次數
        // 可以獲取到每個area下的每個product_id的城市資訊拼接起來的串
        val sql = "SELECT " +
            "area," +
            "product_id," +
            "count(*) click_count, " +
            "group_concat_distinct(concat_long_string(city_id,city_name,':')) city_infos " +
          "FROM tmp_click_product_basic " +
          "GROUP BY area,product_id "
    
        val df = spark.sql(sql)
    
        // 各區域各商品的點選次數（以及額外的城市列表）,再次將查詢出來的資料註冊為一個臨時表
        df.createOrReplaceTempView("tmp_area_product_click_count")
  }
複製程式碼

4.8 生成包含完整商品資訊的各區域各商品點選次數的臨時表

generateTempAreaFullProductClickCountTable(spark)
關聯tmp_area_product_click_count表與product_info表，在tmp_area_product_click_count基礎上引入商品的詳細資訊

def generateTempAreaFullProductClickCountTable(spark: SparkSession) {

// 將之前得到的各區域各商品點選次數表，product_id
// 去關聯商品資訊表，product_id，product_name和product_status
// product_status要特殊處理，0，1，分別代表了自營和第三方的商品，放在了一個json串裡面
// get_json_object()函式，可以從json串中獲取指定的欄位的值
// if()函式，判斷，如果product_status是0，那麼就是自營商品；如果是1，那麼就是第三方商品
// area, product_id, click_count, city_infos, product_name, product_status

// 你拿到到了某個區域top3熱門的商品，那麼其實這個商品是自營的，還是第三方的
// 其實是很重要的一件事

// 技術點：內建if函式的使用

val sql = "SELECT " +
    "tapcc.area," +
    "tapcc.product_id," +
    "tapcc.click_count," +
    "tapcc.city_infos," +
    "pi.product_name," +
    "if(get_json_object(pi.extend_info,'product_status')='0','Self','Third Party') product_status " +
  "FROM tmp_area_product_click_count tapcc " +
    "JOIN product_info pi ON tapcc.product_id=pi.product_id "

val df = spark.sql(sql)

df.createOrReplaceTempView("tmp_area_fullprod_click_count")
複製程式碼

}

4.9 使用開窗函式獲取各個區域內點選次數排名前3的熱門商品

    val areaTop3ProductRDD = getAreaTop3ProductRDD(taskUUID, spark)
    def getAreaTop3ProductRDD(taskid: String, spark: SparkSession): DataFrame = {

    // 華北、華東、華南、華中、西北、西南、東北
    // A級：華北、華東
    // B級：華南、華中
    // C級：西北、西南
    // D級：東北

    // case when
    // 根據多個條件，不同的條件對應不同的值
    // case when then ... when then ... else ... end

    val sql = "SELECT " +
        "area," +
        "CASE " +
          "WHEN area='China North' OR area='China East' THEN 'A Level' " +
          "WHEN area='China South' OR area='China Middle' THEN 'B Level' " +
          "WHEN area='West North' OR area='West South' THEN 'C Level' " +
          "ELSE 'D Level' " +
        "END area_level," +
        "product_id," +
        "city_infos," +
        "click_count," +
        "product_name," +
        "product_status " +
      "FROM (" +
        "SELECT " +
          "area," +
          "product_id," +
          "click_count," +
          "city_infos," +
          "product_name," +
          "product_status," +
          "row_number() OVER (PARTITION BY area ORDER BY click_count DESC) rank " +
        "FROM tmp_area_fullprod_click_count " +
        ") t " +
      "WHERE rank<=3"

    spark.sql(sql)
  }
複製程式碼

4.10 儲存到資料庫

import spark.implicits._
val areaTop3ProductDF = areaTop3ProductRDD.rdd.map(row =>
  AreaTop3Product(taskUUID, row.getAs[String]("area"), row.getAs[String]("area_level"), row.getAs[Long]("product_id"), row.getAs[String]("city_infos"), row.getAs[Long]("click_count"), row.getAs[String]("product_name"), row.getAs[String]("product_status"))
).toDS

areaTop3ProductDF.write
  .format("jdbc")
  .option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
  .option("dbtable", "area_top3_product")
  .option("user", ConfigurationManager.config.getString(Constants.JDBC_USER))
  .option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
  .mode(SaveMode.Append)
  .save()
複製程式碼

5 總結

溫故而知新，本文為了綜合複習，進行程式碼總結，內容粗鄙，勿怪

秦凱新於深圳

Spark綜合使用及使用者行為案例訪問session統計分析實戰-Spark商業應用實戰
2018-12-24
SparkSession
Spark SQL：JSON資料來源複雜綜合案例實戰
2018-09-28
SparkSQLJSON
Spark SQL：Hive資料來源複雜綜合案例實戰
2018-09-28
SparkSQLHive
Spark SQL：JDBC資料來源複雜綜合案例實戰
2018-09-28
SparkSQLJDBC
Hive SQL 綜合應用案例實戰及多項效能指標深入講解-DW商業環境實戰
2018-12-22
HiveSQL指標
Spark應用程式開發引數調優深入剖析-Spark商業調優實戰
2019-03-04
Spark
Spark Streaming調優引數及最佳實踐深入剖析-Spark商業調優實戰
2018-11-18
Spark
基於flink的電商使用者行為資料分析【2】| 實時熱門商品統計
2020-11-24
IDEA開發Spark應用實戰(Scala)
2022-08-21
IdeaSpark
實戰|使用Spark Streaming寫入Hudi
2020-04-18
Spark
某大型商超客戶採購資料分析（Spark實戰）
2024-11-14
Spark
Spark+Hbase 億級流量分析實戰（留存計算）
2019-03-29
Spark
＜Zhuuu_ZZ＞Spark專案實戰-航班飛行網圖分析
2020-11-29
Spark
《Hadoop+Spark大資料分析實戰》簡介
2022-07-27
HadoopSpark大資料
Spark學習進度7-綜合案例
2021-01-10
Spark
【備忘】《圖解Spark 核心技術與案例實戰》PDF
2018-09-01
圖解Spark
spark實戰之：分析維基百科網站統計資料(java版)
2022-08-19
Spark網站Java
Spark SQL知識點與實戰
2021-11-23
SparkSQL
Spark+Hbase 億級流量分析實戰（ PV/UV ）
2019-03-29
Spark
用Spark進行實時流計算
2020-08-04
Spark
Spark+Hbase 億級流量分析實戰（日誌儲存設計）
2019-03-29
Spark
Spark+ClickHouse企業級資料倉儲實戰
2023-05-15
Spark
spark 大型專案實戰(六):使用者訪問session分析(六) --開發配置管理元件
2018-05-23
SparkSession元件
實時計算框架：Spark叢集搭建與入門案例
2021-04-26
框架Spark
Python技術棧與Spark交叉資料分析雙向整合進階實戰–大資料ML樣本集案例實戰
2019-03-01
PythonSpark大資料
Python技術棧與Spark交叉資料分析雙向整合進階實戰--大資料ML樣本集案例實戰
2018-12-17
PythonSpark大資料
Python技術棧與Spark交叉資料分析雙向整合技術實戰--大資料ML樣本集案例實戰
2018-12-17
PythonSpark大資料
Spark SQL知識點大全與實戰
2021-11-21
SparkSQL
Spark 以及 spark streaming 核心原理及實踐
2019-01-05
Spark
Flink SQL Client綜合實戰
2020-11-17
SQLclient
低程式碼行業應用實戰
2022-03-02
行業
kafka rebalance 機制與Consumer多種消費模式案例應用實戰-kafka 商業環境實戰
2018-11-11
Kafka模式
Spark機器學習實戰 (十一) - 文字情感分類專案實戰
2019-04-19
Spark機器學習
010.OpenShift綜合實驗及應用
2020-06-23
Kafka 偏移量管理實現精確一次語義在Spark&Flink中的技術實踐-kafka商業應用實戰
2019-03-10
KafkaSpark
教程：Apache Spark SQL入門及實踐指南！
2018-09-12
ApacheSparkSQL
Flutter 入門與實戰（五十七）：兩個模擬器來聊天 — Provider 綜合應用
2021-08-20
FlutterIDE
Flink SQL結合Kafka、Elasticsearch、Kibana實時分析電商使用者行為
2021-01-16
SQLKafkaElasticsearch