＜Zhuuu_ZZ＞Spark專案之log日誌資料分析處理

Zhuuu_ZZ發表於2020-11-19

Spark

Spark專案之log日誌資料分析處理

一專案準備

需要分析處理的資料如下
日誌資料欄位資料字典

有需要的點選連結獲取
連結: 專案資料.提取碼：599q

二專案需求

使用Spark完成下列日誌分析專案需求：
- 日誌資料清洗
- 使用者留存分析
- 活躍使用者分析

三專案戰鬥

1、資料清洗

讀入日誌檔案並轉化為RDD[Row]型別
- 按照Tab切割資料
- 過濾掉欄位數量少於8個的
對資料進行清洗
- 按照第一列和第二列對資料進行去重
- 過濾掉狀態碼非200
- 過濾掉event_time為空的資料
- 將url按照”&”以及”=”切割
儲存資料
- 將資料寫入mysql表中

日誌欄位拆分分析

//日誌自定義欄位：
event_time
url
method
status
sip
user_uip
action_prepend
action_client

日誌拆分欄位分析

2018-09-04T20:27:31+08:00	http://datacenter.bdqn.cn/logs/user?actionBegin=1536150451540&actionClient=Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0&actionEnd=1536150451668&actionName=startEval&actionTest=0&actionType=3&actionValue=272090&clientType=001_kgc&examType=001&ifEquipment=web&isFromContinue=false&skillIdCount=0&skillLevel=0&testType=jineng&userSID=B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3&userUID=272090&userUIP=1.180.18.157	GET	200	192.168.168.64	-	-   Apache-HttpClient/4.1.2 (java 1.5)

//先按照\t切割，給每一列加上欄位名：
event_time : 2018-09-04T20:27:31+08:00	
url : http://datacenter.bdqn.cn/logs/user?actionBegin=1536150451540&actionClient=Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0&actionEnd=1536150451668&actionName=startEval&actionTest=0&actionType=3&actionValue=272090&clientType=001_kgc&examType=001&ifEquipment=web&isFromContinue=false&skillIdCount=0&skillLevel=0&testType=jineng&userSID=B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3&userUID=272090&userUIP=1.180.18.157	
method : GET	
status : 200	
sip : 192.168.168.64	
user_uip : -	
action_prepend : -   
action_client : Apache-HttpClient/4.1.2 (java 1.5)

//再把url這一列按照?切割：
http://datacenter.bdqn.cn/logs/user?actionBegin=1536150451540&actionClient=Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0&actionEnd=1536150451668&actionName=startEval&actionTest=0&actionType=3&actionValue=272090&clientType=001_kgc&examType=001&ifEquipment=web&isFromContinue=false&skillIdCount=0&skillLevel=0&testType=jineng&userSID=B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3&userUID=272090&userUIP=1.180.18.157	

http://datacenter.bdqn.cn/logs/user
actionBegin=1536150451540&actionClient=Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0&actionEnd=1536150451668&actionName=startEval&actionTest=0&actionType=3&actionValue=272090&clientType=001_kgc&examType=001&ifEquipment=web&isFromContinue=false&skillIdCount=0&skillLevel=0&testType=jineng&userSID=B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3&userUID=272090&userUIP=1.180.18.157	

//再把陣列中的第二個值按照&切割：
actionBegin=1536150451540
actionClient=Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0
actionEnd=1536150451668
actionName=startEval
actionTest=0
actionType=3
actionValue=272090
clientType=001_kgc
examType=001
ifEquipment=web
isFromContinue=false
skillIdCount=0
skillLevel=0
testType=jineng
userSID=B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3
userUID=272090
userUIP=1.180.18.157	


//再按照=切割：
actionBegin   1536150451540
actionClient   Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0
actionEnd   1536150451668
actionName   startEval
actionTest   0
actionType   3
actionValue   272090
clientType   001_kgc
examType   001
ifEquipment   web
isFromContinue   false
skillIdCount   0
skillLevel   0
testType   jineng
userSID   B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3
userUID   272090
userUIP   1.180.18.157


//取每個array陣列第一個為鍵：
actionBegin
actionClient
actionEnd
actionName
actionTest
actionType
actionValue
clientType
examType
ifEquipment
isFromContinue
skillIdCount
skillLevel
testType
userSID
userUID
userUIP

//合併所有欄位：
event_time
url
actionBegin
actionClient
actionEnd
actionName
actionTest
actionType
actionValue
clientType
examType
ifEquipment
isFromContinue
skillIdCount
skillLevel
testType
userSID
userUID
userUIP
method
status 
sip
user_uip
action_prepend
action_client

IDEA開發程式

import java.util.Properties

import org.apache.commons.lang.StringUtils
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}

object DataClear {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]")
      .appName("DataClear").getOrCreate()
    val sc: SparkContext = spark.sparkContext
     import spark.implicits._
    //載入資料
val lineRDD: RDD[String] = sc.textFile("in/project/test.log")

    //按照製表符切分，過濾掉欄位數量不為8個的，並給每一列自定義列名
    val RowRDD: RDD[Row] = lineRDD.map(x => x.split("\t")).filter(x => x.length == 8).map(x => Row(x(0), x(1).trim
      , x(2).trim, x(3).trim, x(4).trim, x(5).trim, x(6).trim, x(7).trim))
    val schema = StructType(
      Array(
        StructField("event_time", StringType),
        StructField("url", StringType),
        StructField("method", StringType),
        StructField("status", StringType),
        StructField("sip", StringType),
        StructField("user_uip", StringType),
        StructField("action_prepend", StringType),
        StructField("action_client", StringType)
      )
    )
    val orgDF: DataFrame = spark.createDataFrame(RowRDD,schema)
//   orgDF.printSchema()
//    orgDF.show()

    //按照第一列和第二列對資料進行去重 過濾掉狀態碼非200 過濾掉event_time為空的資料
    val ds1: Dataset[Row] = orgDF.dropDuplicates("event_time", "url")
      .filter(x => x(3) == "200") //Row，一行進來，找到下標為3的這一列等於200的留下了
//      .filter(x=>x(0).equals("")==false)  //和下方操作等價
      .filter(x => StringUtils.isNotEmpty(x(0).toString))

    //將url按照”&”以及”=”切割
    val detailDF: DataFrame = ds1.map(row => {
      val strings: Array[String] = row.getAs[String]("url").split("\\?")//Row型別，一行進來找到其中url這一列對應的值切割成陣列。
      var map: Map[String, String] = Map("params" -> "null")
      if (strings.length == 2) {
        //     strings(1).split("&").toString.split("=") //陣列toString為一個地址，再按=切割，filter判斷長度都不為2所以為空
        //          .filter(_.length == 2).map(x => (x(0), x(1))).toMap
        val str: Array[String] = strings(1).split("&")
        map = str.map(x => x.split("=")) //呼叫array陣列的map方法，改變型別和數值，不要跟spark運算元弄混了
          .filter(x => x.length == 2).map(x => (x(0), x(1))).toMap //把陣列的每一個元素即二元組轉化成Map鍵值對
      }
      (row.getAs[String]("event_time"),
        map.getOrElse("actionBegin", ""),
        map.getOrElse("actionClient", ""),
        map.getOrElse("actionEnd", ""),
        map.getOrElse("actionName", ""),
        map.getOrElse("actionTest", ""),
        map.getOrElse("actionType", ""),
        map.getOrElse("actionValue", ""),
        map.getOrElse("clientType", ""),
        map.getOrElse("examType", ""),
        map.getOrElse("ifEquipment", ""),
        map.getOrElse("isFromContinue", ""),
        map.getOrElse("testType", ""),
        map.getOrElse("userSID", ""),
        map.getOrElse("userUID", ""),
        map.getOrElse("userUIP", ""),
        row.getAs[String]("method"),
        row.getAs[String]("status"),
        row.getAs[String]("sip"),
        row.getAs[String]("user_uip"),
        row.getAs[String]("action_prepend"),
        row.getAs[String]("action_client"))
    }).toDF(
      "event_time",
      "actionBegin",
      "actionClient",
      "actionEnd",
      "actionName",
      "actionTest",
      "actionType",
      "actionValue",
      "clientType",
      "examType",
      "ifEquipment",
      "isFromContinue",
      "testType",
      "userSID",
      "userUID",
      "userUIP",
      "method",
      "status",
      "sip",
      "user_uip",
      "action_prepend",
      "action_client"
    )
    detailDF.show(2,false)

    //將資料寫入mysql表中
   val url="jdbc:mysql://192.168.198.201:3306/test"
    val prop=new Properties()
    prop.setProperty("user","root")
    prop.setProperty("password","ok")
    prop.setProperty("driver","com.mysql.jdbc.Driver")

    detailDF.write.mode("overwrite").jdbc(url,"detailDF",prop)
    orgDF.write.mode("overwrite").jdbc(url,"orgDF",prop)

  }
}

2、使用者留存分析

計算使用者的次日留存率
- 求當天新增使用者總數n
- 求當天新增的使用者ID與次日登入的使用者ID的交集,得出新增使用者次日登入總數m (次日留存數)
- m/n*100%
計算使用者的次周留存率

package project


import java.text.SimpleDateFormat
import java.util.Properties

import org.apache.spark.SparkContext
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}

object UserAnalysis {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("useranalysis").master("local[*]").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    import spark.implicits._

    //從mysql中讀取資料
    val url="jdbc:mysql://192.168.198.201:3306/test"
    val user="root"
    val password="ok"
    val driver="com.mysql.jdbc.Driver"
    val prop=new Properties()
    prop.setProperty("user",user)
    prop.setProperty("password",password)
    prop.setProperty("driver",driver)

    val detailDF: DataFrame = spark.read.jdbc(url,"detailDF",prop)
//   detailDF.printSchema()
//    detailDF.show()

    //定義一個udf函式可以將時間切割成年月日後轉化為時間戳，以方便計算
    val TimeFun: UserDefinedFunction = spark.udf.register("time", (x: String) => {
      val time: Long = new SimpleDateFormat("yyyy-MM-dd").parse(x.toString.substring(0,10)).getTime
      time
    })

    //過濾出註冊使用者
    val registDS: Dataset[Row] = detailDF.filter(detailDF("actionName")==="Registered")
      .withColumnRenamed("event_time","regist_time")
      .select("userUID","regist_time")
      .withColumnRenamed("userUID","registUID")
    //過濾出登入使用者
    val signDS: Dataset[Row] = detailDF.filter("actionName='Signin'")
        .withColumnRenamed("event_time","sign_time")
        .select("userUID","sign_time")
      .withColumnRenamed("userUID","signUID")

    //呼叫udf函式並過濾掉重複資料
      val registDS1: Dataset[Row] = registDS.select(registDS("registUID"),TimeFun(registDS("regist_time")).as("regist_time")).distinct()
       val signDS1: Dataset[Row] = signDS.select(signDS("signUID"),TimeFun(signDS("sign_time")).as("sign_time")).distinct()
    //註冊使用者與登入使用者以條件為UID關聯，求出交集
   val joinDF: DataFrame = registDS1.join(signDS1,registDS1("registUID")===signDS1("signUID"))//spark中如果有空值，count會報錯,所以不能用left
//   joinDF.show(3,false)

    println("次日留存率")
    //過濾出次日留存資料並根據時間分組求出每日註冊的次日登入數
    val daysignDF: DataFrame = joinDF.filter(joinDF("sign_time") - joinDF("regist_time") === 86400000)
      .groupBy("sign_time")  //這裡的分組條件也可為regist_time，因為我上面已經以條件過濾好了，所以登入時間指向了註冊時間
                                     //我如果按照註冊時間分組其實統計的還是對應的次日的登入數。並且這樣在下方關聯求joinCountDF時關聯條件直接為註冊時間就可以了。
      .count().withColumnRenamed("count","sign_count")

    //註冊使用者以時間分組，求出每日的註冊數
     val dayregistDF: DataFrame = registDS1.groupBy("regist_time").count().withColumnRenamed("count","regist_count")

    //以時間相差一天為條件關聯上面兩DF
     val joinCountDF: DataFrame = dayregistDF.join(daysignDF,daysignDF("sign_time")-dayregistDF("regist_time")===86400000)
    //求出次日留存率
    val dayKeepDF: DataFrame = joinCountDF.map(x => (x.getAs[Long]("regist_time"),
      x.getAs[Long]("regist_count"),
      x(3).toString.toLong,
      x(3).toString.toDouble / x(1).toString.toDouble
    )
    ).toDF("regist_time", "regist_count", "sign_count", "oneDaykeep")
  dayKeepDF.show()

    println("次周留存率")
    //過濾出次日留存資料並根據時間分組求出每日註冊的次周登入數
    val weeksignDF: DataFrame = joinDF.filter(joinDF("sign_time") - joinDF("regist_time") === 86400000*7)
      .groupBy("sign_time")  //這裡的分組條件也可為regist_time，因為我上面已經以條件過濾好了，所以登入時間指向了註冊時間
      //我如果按照註冊時間分組其實統計的還是對應的次日的登入數。並且這樣在下方關聯求joinCountDF時關聯條件直接為註冊時間就可以了。
      .count().withColumnRenamed("count","sign_count")

    //註冊使用者以時間分組，求出每日的註冊數
    val dayregistDF1: DataFrame = registDS1.groupBy("regist_time").count().withColumnRenamed("count","regist_count")

    //以時間相差七天為條件關聯上面兩DF
    val joinCountDF1: DataFrame = dayregistDF1.join(daysignDF,daysignDF("sign_time")-dayregistDF("regist_time")===86400000*7)
    //求出次周留存率
    val weekKeepDF1: DataFrame = joinCountDF1.map(x => (x.getAs[Long]("regist_time"),
      x.getAs[Long]("regist_count"),
      x(3).toString.toLong,
      x(3).toString.toDouble / x(1).toString.toDouble
    )
    ).toDF("regist_time", "regist_count", "sign_count", "weekkeep")
    weekKeepDF1.show()

  }
}


/*
次日留存率
+-------------+------------+----------+-----------------+
|  regist_time|regist_count|sign_count|       oneDaykeep|
+-------------+------------+----------+-----------------+
|1535990400000|         381|       355|0.931758530183727|
+-------------+------------+----------+-----------------+

次周留存率
+-----------+------------+----------+--------+
|regist_time|regist_count|sign_count|weekkeep|
+-----------+------------+----------+--------+
+-----------+------------+----------+--------+

3 活躍使用者分析

統計分析需求
- 讀取資料庫，統計每天的活躍使用者數
- 統計規則：有看課和買課行為的使用者才屬於活躍使用者
- 對UID進行去重

  println("統計每天的活躍使用者數")
    //detailDF.show(3,false)

    val splitTime: UserDefinedFunction = spark.udf.register("splitTime", (x: String) => {
      x.substring(0, 10)
    })

    detailDF.filter("actionName='StartLearn' or actionName='BuyCourse'")
      .select($"userUID",splitTime(detailDF("event_time")).as("active_time"),$"actionName")
      .dropDuplicates("userUID","active_time")
      .groupBy("active_time").count().show()
  
/*
+-----------+-----+
|active_time|count|
+-----------+-----+
| 2018-09-04|  275|
| 2018-09-05|  255|
+-----------+-----+

四專案擴充

複雜Json格式的log日誌處理分析

日誌詳情

有需要的點選連結獲取
連結: 專案資料.提取碼：599q
開發程式

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

object jsonClear {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("jsonClear").getOrCreate()
    val sc: SparkContext = spark.sparkContext
     import  spark.implicits._
    import  org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._

     val rdd1: RDD[String] = sc.textFile("in/project/op.log")
      rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1))).map(x=>{
        val str: String = x._2.substring(1,x._2.length)
        "{\"id\":\""+x._1+"\","+str
      }).toDF("value")
      .select(get_json_object($"value","$.id").as("id"),get_json_object($"value","$.ap").as("ap"),
        get_json_object($"value","$.cm").as("cm") ,get_json_object($"value","$.et").as("et"))
      .select($"id",$"ap",get_json_object($"cm","$.ln").as("ln"),get_json_object($"cm","$.sv").as("sv"),get_json_object($"cm","$.os").as("os"),
        get_json_object($"cm","$.g").as("g"),get_json_object($"cm","$.mid").as("mid"),get_json_object($"cm","$.nw").as("nw"),
        get_json_object($"cm","$.l").as("l"),get_json_object($"cm","$.vc").as("vc"),get_json_object($"cm","$.hw").as("hw"),
        get_json_object($"cm","$.ar").as("ar"),get_json_object($"cm","$.uid").as("uid"),get_json_object($"cm","$.t").as("t"),
        get_json_object($"cm","$.la").as("la"),get_json_object($"cm","$.md").as("md"),get_json_object($"cm","$.vn").as("vn"),
        get_json_object($"cm","$.ba").as("ba"),get_json_object($"cm","$.sr").as("sr")
        ,$"et").show(false)
    
  }

}

在這裡插入圖片描述
未完待續。。。

＜Zhuuu_ZZ＞Spark專案實戰-航班飛行網圖分析
2020-11-29
Spark
node專案錯誤處理與日誌
2018-12-23
Spark SQL：實現日誌離線批處理
2018-09-17
SparkSQL
SQLServer資料庫日誌太大處理方式
2024-10-15
SQLServer資料庫
【LOG】Oracle資料庫清理日誌、跟蹤檔案利器
2021-12-13
Oracle資料庫
Oracle listener log 日誌分析方法
2022-10-08
Oracle
ELK實時分析之php的laravel專案日誌
2019-01-03
PHPLaravel
spark處理json資料Demo
2019-02-26
SparkJSON
Log日誌
2020-09-25
分析Oracle資料庫日誌檔案（三）EP
2022-03-22
Oracle資料庫
分析Oracle資料庫日誌檔案（二）DO
2022-03-22
Oracle資料庫
分析Oracle資料庫日誌檔案（一）HB
2022-03-22
Oracle資料庫
linux 日誌log檔案截斷
2024-12-09
Linux
專案日誌
2024-04-30
SpringBoot部落格開發之AOP日誌處理
2021-08-31
Spring Boot
Python 日誌（Log）
2018-08-28
Python
log 日誌原理
2024-06-30
mysql之 slow log 慢查詢日誌
2018-08-30
MySql
Spark——為資料分析處理提供更為靈活的賦能
2018-10-16
Spark
資料分析--資料預處理
2023-12-14
shell日誌顏色處理
2018-09-27
orbeon form 的日誌處理
2021-12-19
ORBORM
Golang 快速讀取處理大日誌檔案工具
2020-08-27
Golang
Linux C日誌log
2019-01-11
Linux
Log 工具列印日誌
2024-10-04
ORA-00279異常處理_offline資料檔案缺失日誌檔案問題一鍵修復
2020-09-10
mysqlbinlog 處理二進位制日誌檔案的工具
2022-03-17
MySql
MySQL更新資料時，日誌（redo log、binlog）執行流程
2022-01-29
MySql
node錯誤處理與日誌
2019-02-16
DATAGUARD中手工處理日誌GAP
2019-04-12
SpringBoot專案整合日誌
2019-02-12
Spring Boot
【Mysql】三大日誌 redo log、bin log、undo log
2021-11-24
MySql
日誌分析-apache日誌分析
2024-04-28
Apache
《深入理解Spark》之sparkSQL 處理流程
2018-10-11
SparkSQL
MySQL重做日誌（redo log）
2020-12-05
MySql
體面編碼之異常日誌和測試處理
2018-12-31
技術分享 | OceanBase 資料處理之控制檔案
2023-02-21
OCP開源專案：日誌公共元件的實現（log-spring-boot-starter）
2020-01-13
元件Springboot

＜Zhuuu_ZZ＞Spark專案之log日誌資料分析處理

Spark專案之log日誌資料分析處理

一 專案準備

二 專案需求

三 專案戰鬥

1、資料清洗

日誌欄位拆分分析

IDEA開發程式

2、使用者留存分析

3 活躍使用者分析

四 專案擴充

複雜Json格式的log日誌處理分析

相關文章

一專案準備

二專案需求

三專案戰鬥

四專案擴充