spark效能優化幾點注意

yunqiublog發表於2019-07-25

一.優化方向,序列化

1.官網位置

2.解釋:


預設用的是java序列化,但是會很慢,第二種很快,但是不一定能實現所有序列化
第二種,有些自定義類你需要在程式碼中註冊(Kryo)

3.StorageLevel.MEMORY_ONLY) 方式儲存程式碼


  def main(args: Array[String]) {
    val sparkConf = new SparkConf()
    val sc = new SparkContext(sparkConf)
    val names = Array[String]("G304","G305","G306")
    val genders = Array[String]("male","female")
    val addresses = Array[String]("beijing","shenzhen","wenzhou","hangzhou")
    val infos = new ArrayBuffer[Info]()
    for (i<-1 to 1000000){
      val name = names(Random.nextInt(3))
      val gender = genders(Random.nextInt(2))
      val address = addresses((Random.nextInt(4)))
      infos += Info(name, gender, address)
    }
    val rdd = sc.parallelize(infos)
    rdd.persist(StorageLevel.MEMORY_ONLY_SER)
    rdd.count()
    // rdd.persist(StorageLevel.MEMORY_ONLY)
    sc.stop()
  }
  case class Info(name:String, gender:String, address:String)
}

4.結果34.3 java序列化預設

二.Kyro序列化

1.配置檔案位置及配置(spark.serialize)在spark-default.conf

2.程式碼


 def main(args: Array[String]) {
    val sparkConf = new SparkConf()
    sparkConf.registerKryoClasses(Array(classOf[Info]))
    val sc = new SparkContext(sparkConf)
    val names = Array[String]("G304","G305","G306")
    val genders = Array[String]("male","female")
    val addresses = Array[String]("beijing","shenzhen","wenzhou","hangzhou")
    val infos = new ArrayBuffer[Info]()
    for (i<-1 to 1000000){
      val name = names(Random.nextInt(3))
      val gender = genders(Random.nextInt(2))
      val address = addresses((Random.nextInt(4)))
      infos += Info(name, gender, address)
    }
    val rdd = sc.parallelize(infos)
    rdd.persist(StorageLevel.MEMORY_ONLY_SER)
    rdd.count()
    // rdd.persist(StorageLevel.MEMORY_ONLY_SER)
    sc.stop()

3.沒註冊情況下截圖

4.官網沒註冊結果

5.加上一句話,否則所有的東西都要加入變大


sparkConf.registerKryoClasses(Array(classOf[Info]))

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/69941978/viewspace-2651785/,如需轉載,請註明出處,否則將追究法律責任。

相關文章