spark中配置啟用LZO壓縮

loveheping發表於2018-03-17
參考列表:
   centos7中安裝LZO與配置,請參考:
http://blog.itpub.net/31511218/viewspace-2151945/
  Hadoop配置LZO,請參考:http://blog.itpub.net/31511218/viewspace-2151946/

Spark中配置啟用LZO壓縮,步驟如下:

一、spark-env.sh配置

點選(此處)摺疊或開啟

  1. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
  2. export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
  3. export SPARK_CLASSPATH=$SPARK_CLASSPATH:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/tools/lib/*:/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/*

二、spark-defaults.conf配置

點選(此處)摺疊或開啟

  1. spark.driver.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar
  2. spark.executor.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar
注:指向編譯生成lzo的jar包

三、測試
1、讀取Lzo檔案

點選(此處)摺疊或開啟

  1. spark-shell --master local[2]
  2. scala> import com.hadoop.compression.lzo.LzopCodec
  3. scala> val page_views = sc.textFile("/user/hive/warehouse/page_views_lzo/page_views.dat.lzo")
2、寫出lzo檔案

點選(此處)摺疊或開啟

  1. spark-shell --master local[2]
  2. scala> import com.hadoop.compression.lzo.LzopCodec
  3. scala> val lzoTest = sc.parallelize(1 to 10)
  4. scala> lzoTest.saveAsTextFile("/input/test_lzo", classOf[LzopCodec])
  5. 結果:
  6. [hadoop@spark220 common]$ hdfs dfs -ls /input/test_lzo
    Found 3 items
    -rw-r--r--   1 hadoop supergroup          0 2018-03-16 23:24 /input/test_lzo/_SUCCESS
    -rw-r--r--   1 hadoop supergroup         60 2018-03-16 23:24 /input/test_lzo/part-00000.lzo
    -rw-r--r--   1 hadoop supergroup         61 2018-03-16 23:24 /input/test_lzo/part-00001.lzo
至此配置與測試完成。

四、配置與測試中存問題
1、引用native,缺少LD_LIBRARY_PATH
   1.1、錯誤提示:
點選(此處)摺疊或開啟
  1. Caused by: java.lang.RuntimeException: native-lzo library not available
      at com.hadoop.compression.lzo.LzopCodec.getDecompressorType(LzopCodec.java:120)
      at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
      at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:111)
      at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
      at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:246)
      at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
      at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:203)
      at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      at org.apache.spark.scheduler.Task.run(Task.scala:108)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
   1.2、解決辦法:在spark的conf中配置spark-evn.sh,增加以下內容:
點選(此處)摺疊或開啟
  1. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
    export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
    export SPARK_CLASSPATH=$SPARK_CLASSPATH:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/tools/lib/*:/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/*
2、無法找到LzopCodec類
   2.1、錯誤提示:

點選(此處)摺疊或開啟

  1. Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzopCodec not found.
  2.     at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
  3.     at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:175)
  4.     at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
  5. Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzopCodec not found
  6.     at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
  7.     at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)
   2.2、解決辦法:在spark的conf中配置spark-defaults.conf,增加以下內容:

點選(此處)摺疊或開啟

  1. spark.driver.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar
  2. spark.executor.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar


來自@若澤大資料


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/31511218/viewspace-2151948/,如需轉載,請註明出處,否則將追究法律責任。

相關文章