使用Scala編寫Spark程式求基站下移動使用者停留時長TopN
1. 需求:根據手機基站日誌計算停留時長的TopN
我們的手機之所以能夠實現行動通訊,是因為在全國各地有許許多多的基站,只要手機一開機,就會和附近的基站嘗試建立連線,而每一次建立連線和斷開連線都會被記錄到移動運營商的基站伺服器的日誌中。
雖然我們不知道手機使用者所在的具體位置,但是根據基站的位置就可以大致判斷手機使用者的所處的地理範圍,然後商家就可以根據使用者的位置資訊來做一些推薦廣告。
為了便於理解,我們簡單模擬了基站上的一些移動使用者日誌資料,共4個欄位:手機號碼,時間戳,基站id,連線型別(1表示建立連線,0表示斷開連線)
基站A的使用者日誌(19735E1C66.log檔案):
18688888888,20160327082400,16030401EAFB68F1E3CDF819735E1C66,1
18611132889,20160327082500,16030401EAFB68F1E3CDF819735E1C66,1
18688888888,20160327170000,16030401EAFB68F1E3CDF819735E1C66,0
18611132889,20160327180000,16030401EAFB68F1E3CDF819735E1C66,0
基站B的使用者日誌(DDE7970F68.log檔案):
18611132889,20160327075000,9F36407EAD0629FC166F14DDE7970F68,1
18688888888,20160327075100,9F36407EAD0629FC166F14DDE7970F68,1
18611132889,20160327081000,9F36407EAD0629FC166F14DDE7970F68,0
18688888888,20160327081300,9F36407EAD0629FC166F14DDE7970F68,0
18688888888,20160327175000,9F36407EAD0629FC166F14DDE7970F68,1
18611132889,20160327182000,9F36407EAD0629FC166F14DDE7970F68,1
18688888888,20160327220000,9F36407EAD0629FC166F14DDE7970F68,0
18611132889,20160327230000,9F36407EAD0629FC166F14DDE7970F68,0
基站C的使用者日誌(E549D940E0.log檔案):
18611132889,20160327081100,CC0710CC94ECC657A8561DE549D940E0,1 18688888888,20160327081200,CC0710CC94ECC657A8561DE549D940E0,1 18688888888,20160327081900,CC0710CC94ECC657A8561DE549D940E0,0 18611132889,20160327082000,CC0710CC94ECC657A8561DE549D940E0,0 18688888888,20160327171000,CC0710CC94ECC657A8561DE549D940E0,1 18688888888,20160327171600,CC0710CC94ECC657A8561DE549D940E0,0 18611132889,20160327180500,CC0710CC94ECC657A8561DE549D940E0,1 18611132889,20160327181500,CC0710CC94ECC657A8561DE549D940E0,0
下面是基站表的資料(loc_info.txt檔案),共4個欄位,分別代表基站id和經緯度以及訊號的輻射型別(訊號級別,比如2G訊號、3G訊號和4G訊號等):
9F36407EAD0629FC166F14DDE7970F68,116.304864,40.050645,6 CC0710CC94ECC657A8561DE549D940E0,116.303955,40.041935,6 16030401EAFB68F1E3CDF819735E1C66,116.296302,40.032296,6
基於以上資料,要求計算每個手機號碼在每個基站停留時間最長的2個地點(經緯度)。
思路:
(1)讀取日誌資料,並切分欄位; (2)整理欄位,以手機號碼和基站id為key,時間為value封裝成Tuple; (3)根據key進行聚合,將時間累加; (4)將資料以基站id為key,以手機號碼和時間為value封裝成Tuple,便於後面和基站表進行join; (5)讀取基站資料,並切分欄位; (6)整理欄位,以基站id為key,以基站的經度和緯度為value封裝到Tuple; (7)將兩個Tuple進行join; (8)對join後的結果按照手機號碼分組; (9)將分組後的結果轉成List,在按照時間排序,在反轉,最後取Top2; (10)將計算結果寫入HDFS;
2. 準備測試資料
1) 移動使用者的日誌資訊(即上面的3個log檔案):
2) 基站資料(即上面的loc_info.txt檔案):
3. pom檔案
1 <?xml version="1.0" encoding="UTF-8"?> 2 <project xmlns="http://maven.apache.org/POM/4.0.0" 3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 4 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 5 <modelVersion>4.0.0</modelVersion> 6 7 <groupId>com.xuebusi</groupId> 8 <artifactId>spark</artifactId> 9 <version>1.0-SNAPSHOT</version> 10 11 <properties> 12 <maven.compiler.source>1.7</maven.compiler.source> 13 <maven.compiler.target>1.7</maven.compiler.target> 14 <encoding>UTF-8</encoding> 15 16 <!-- 這裡對jar包版本做集中管理 --> 17 <scala.version>2.10.6</scala.version> 18 <spark.version>1.6.2</spark.version> 19 <hadoop.version>2.6.4</hadoop.version> 20 </properties> 21 22 <dependencies> 23 <dependency> 24 <!-- scala語言核心包 --> 25 <groupId>org.scala-lang</groupId> 26 <artifactId>scala-library</artifactId> 27 <version>${scala.version}</version> 28 </dependency> 29 <dependency> 30 <!-- spark核心包 --> 31 <groupId>org.apache.spark</groupId> 32 <artifactId>spark-core_2.10</artifactId> 33 <version>${spark.version}</version> 34 </dependency> 35 36 <dependency> 37 <!-- hadoop的客戶端,用於訪問HDFS --> 38 <groupId>org.apache.hadoop</groupId> 39 <artifactId>hadoop-client</artifactId> 40 <version>${hadoop.version}</version> 41 </dependency> 42 </dependencies> 43 44 <build> 45 <pluginManagement> 46 47 <plugins> 48 <!-- scala-maven-plugin:編譯scala程式的Maven外掛 --> 49 <plugin> 50 <groupId>net.alchim31.maven</groupId> 51 <artifactId>scala-maven-plugin</artifactId> 52 <version>3.2.2</version> 53 </plugin> 54 <!-- maven-compiler-plugin:編譯java程式的Maven外掛 --> 55 <plugin> 56 <groupId>org.apache.maven.plugins</groupId> 57 <artifactId>maven-compiler-plugin</artifactId> 58 <version>3.5.1</version> 59 </plugin> 60 </plugins> 61 </pluginManagement> 62 <plugins> 63 <!-- 編譯scala程式的Maven外掛的一些配置引數 --> 64 <plugin> 65 <groupId>net.alchim31.maven</groupId> 66 <artifactId>scala-maven-plugin</artifactId> 67 <executions> 68 <execution> 69 <id>scala-compile-first</id> 70 <phase>process-resources</phase> 71 <goals> 72 <goal>add-source</goal> 73 <goal>compile</goal> 74 </goals> 75 </execution> 76 <execution> 77 <id>scala-test-compile</id> 78 <phase>process-test-resources</phase> 79 <goals> 80 <goal>testCompile</goal> 81 </goals> 82 </execution> 83 </executions> 84 </plugin> 85 <!-- 編譯java程式的Maven外掛的一些配置引數 --> 86 <plugin> 87 <groupId>org.apache.maven.plugins</groupId> 88 <artifactId>maven-compiler-plugin</artifactId> 89 <executions> 90 <execution> 91 <phase>compile</phase> 92 <goals> 93 <goal>compile</goal> 94 </goals> 95 </execution> 96 </executions> 97 </plugin> 98 <!-- maven-shade-plugin:打jar包用的Mavne外掛 --> 99 <plugin> 100 <groupId>org.apache.maven.plugins</groupId> 101 <artifactId>maven-shade-plugin</artifactId> 102 <version>2.4.3</version> 103 <executions> 104 <execution> 105 <phase>package</phase> 106 <goals> 107 <goal>shade</goal> 108 </goals> 109 <configuration> 110 <filters> 111 <filter> 112 <artifact>*:*</artifact> 113 <excludes> 114 <exclude>META-INF/*.SF</exclude> 115 <exclude>META-INF/*.DSA</exclude> 116 <exclude>META-INF/*.RSA</exclude> 117 </excludes> 118 </filter> 119 </filters> 120 </configuration> 121 </execution> 122 </executions> 123 </plugin> 124 </plugins> 125 </build> 126 127 </project>
4. 編寫Scala程式
在src/main/scala/目錄下建立一個名為MobileLocation的Object:
MobileLocation.scala完整程式碼:
package com.xuebusi.spark import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} /** * 計算每個基站下停留時間最長的2個手機號 * Created by SYJ on 2017/1/24. */ object MobileLocation { def main(args: Array[String]) { /** * 建立SparkConf
*
* 一些說明:
* 為了便於在IDEA中進行Debug測試, * 這裡就設定為local模式,即在本地執行Spark程式; * 但是這種方式存在一個問題,如果要從HDFS中讀資料, * 在Windows平臺下讀取Linux上HDFS中的資料的話, * 可能會拋異常,因為它在讀取資料的時候要用到Windows * 下的一些本地庫; * * 在使用Eclipse在Windows上執行MapReduce程式的時候也會遇到 * 該問題,但是在Linux和MacOS作業系統中則不會遇到這種問題. * * Hadoop的壓縮和解壓縮要用到Windows的一些本地庫, * 而這些庫是C或者C++編寫的,而C和C++編寫的庫檔案是不跨平臺的, * 所以要想在Windows下除錯MapReduce程式需要先安裝好本地庫; * * 建議在Windows下安裝Linux虛擬機器,帶有圖形介面的, * 這樣除錯就不會有問題. * */ //本地執行 val conf: SparkConf = new SparkConf().setAppName("MobileLocation").setMaster("local") //建立SparkConf,預設以叢集方式執行 //val conf: SparkConf = new SparkConf().setAppName("MobileLocation") //建立SparkContext val sc: SparkContext = new SparkContext(conf) //從檔案系統讀取資料 val lines: RDD[String] = sc.textFile(args(0)) /** * 切分資料 * 這裡使用了兩個map方法,不建議使用這種方式, * 我們可以在一個map方法中完成 */ //lines.map(_.split(",")).map(arr => (arr(0), arr(1).toLong, arr(2), arr(3))) //在一個map方法中實現對資料的切分,並組裝成元組的形式 val splited = lines.map(line => { val fields: Array[String] = line.split(",") val mobile: String = fields(0) //val time: Long = fields(1).toLong val lac: String = fields(2) val tp: String = fields(3) val time: Long = if(tp == "1") -fields(1).toLong else fields(1).toLong //將欄位拼接成元組再返回 //((手機號碼, 基站id), 時間) ((mobile, lac), time) }) //分組聚合 val reduced: RDD[((String, String), Long)] = splited.reduceByKey(_+_) //整理成元組格式,便於下一步和基站表進行join val lacAndMobieTime = reduced.map(x => { //(基站id, (手機號碼, 時間)) (x._1._2, (x._1._1, x._2)) }) //讀取基站資料 val lacInfo: RDD[String] = sc.textFile(args(1)) //切分資料並jion val splitedLacInfo = lacInfo.map(line => { val fields: Array[String] = line.split(",") val id: String = fields(0)//基站id val x: String = fields(1)//基站經度 val y: String = fields(2)//基站緯度 /** * 返回資料 * 只有key-value型別的資料才可以進行join, * 所以這裡返回元組,以基站id為key, * 以基站的經緯度為value: * (基站id, (經度, 緯度)) */ (id, (x, y)) }) //join //返回:RDD[(基站id, ((手機號碼, 時間), (經度, 緯度)))] val joined: RDD[(String, ((String, Long), (String, String)))] = lacAndMobieTime.join(splitedLacInfo) //ArrayBuffer((CC0710CC94ECC657A8561DE549D940E0,((18688888888,1300),(116.303955,40.041935))), (CC0710CC94ECC657A8561DE549D940E0,((18611132889,1900),(116.303955,40.041935))), (16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645)))) //System.out.println(joined.collect().toBuffer) //按手機號碼分組 val groupedByMobile = joined.groupBy(_._2._1._1) //ArrayBuffer((18688888888,CompactBuffer((CC0710CC94ECC657A8561DE549D940E0,((18688888888,1300),(116.303955,40.041935))), (16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645))))), (18611132889,CompactBuffer((CC0710CC94ECC657A8561DE549D940E0,((18611132889,1900),(116.303955,40.041935))), (16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645)))))) //System.out.println(groupedByMobile.collect().toBuffer) /** * 先轉成List,再按照時間排序,再反轉元素,再取Top2 */ val result = groupedByMobile.mapValues(_.toList.sortBy(_._2._1._2).reverse.take(2)) //ArrayBuffer((18688888888,List((16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645))))), (18611132889,List((16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645)))))) //System.out.println(result.collect().toBuffer) //將結果寫入到檔案系統 result.saveAsTextFile(args(2)) //釋放資源 sc.stop() } }
5. 本地測試
編輯配置資訊:
點選“+”號,新增一個配置:
選擇“Application”:
選擇Main方法所在的類:
填寫配置的名稱,在Program arguments輸入框中填寫3個引數,分別為兩個輸入目錄和一個輸出目錄:
在本地執行程式:
丟擲一個記憶體不足異常:
17/01/24 17:17:58 ERROR SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: System memory 259522560 must be at least 4.718592E8. Please use a larger heap size. at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:198) at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:354) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288) at org.apache.spark.SparkContext.<init>(SparkContext.scala:457) at com.xuebusi.spark.MobileLocation$.main(MobileLocation.scala:37) at com.xuebusi.spark.MobileLocation.main(MobileLocation.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
可以程式中新增一行程式碼來解決:
conf.set("spark.testing.memory", "536870912")//後面的值大於512m即可
但是上面的方式不夠靈活,這裡我們採用給JVM傳參的方式。修改配置,在VM options輸入框中新增引數“-Xmx512m”或者“-Dspark.testing.memory=536870912”,記憶體大小為512M:
程式輸出日誌以及結果:
6. 提交到Spark叢集上執行
本地執行和提交到叢集上執行,在程式碼上有所區別,需要修改一行程式碼:
將編寫好的程式使用Maven外掛打成jar包:
將jar包上傳到Spark叢集伺服器:
將測試所用的資料也上傳到HDFS叢集:
執行命令:
/root/apps/spark/bin/spark-submit \ --master spark://hadoop01:7077,hadoop02:7077 \ --executor-memory 512m \ --total-executor-cores 7 \ --class com.xuebusi.spark.MobileLocation \ /root/spark-1.0-SNAPSHOT.jar \ hdfs://hadoop01:9000/mobile/input/mobile_logs \ hdfs://hadoop01:9000/mobile/input/loc_logs \ hdfs://hadoop01:9000/mobile/output
7. 程式執行時的輸出日誌
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/01/24 11:27:48 INFO SparkContext: Running Spark version 1.6.2 17/01/24 11:27:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/01/24 11:27:52 INFO SecurityManager: Changing view acls to: root 17/01/24 11:27:52 INFO SecurityManager: Changing modify acls to: root 17/01/24 11:27:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 17/01/24 11:27:54 INFO Utils: Successfully started service 'sparkDriver' on port 41762. 17/01/24 11:27:55 INFO Slf4jLogger: Slf4jLogger started 17/01/24 11:27:55 INFO Remoting: Starting remoting 17/01/24 11:27:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.71.11:49399] 17/01/24 11:27:55 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 49399. 17/01/24 11:27:55 INFO SparkEnv: Registering MapOutputTracker 17/01/24 11:27:55 INFO SparkEnv: Registering BlockManagerMaster 17/01/24 11:27:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-6c4988a1-3ad2-49ad-8cc2-02390384792b 17/01/24 11:27:55 INFO MemoryStore: MemoryStore started with capacity 517.4 MB 17/01/24 11:27:56 INFO SparkEnv: Registering OutputCommitCoordinator 17/01/24 11:28:02 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/01/24 11:28:02 INFO SparkUI: Started SparkUI at http://192.168.71.11:4040 17/01/24 11:28:02 INFO HttpFileServer: HTTP File server directory is /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/httpd-34193396-8d9c-4ea7-9fca-5caf8c712f86 17/01/24 11:28:02 INFO HttpServer: Starting HTTP Server 17/01/24 11:28:02 INFO Utils: Successfully started service 'HTTP file server' on port 58135. 17/01/24 11:28:06 INFO SparkContext: Added JAR file:/root/spark-1.0-SNAPSHOT.jar at http://192.168.71.11:58135/jars/spark-1.0-SNAPSHOT.jar with timestamp 1485286086076 17/01/24 11:28:06 INFO Executor: Starting executor ID driver on host localhost 17/01/24 11:28:06 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55934. 17/01/24 11:28:06 INFO NettyBlockTransferService: Server created on 55934 17/01/24 11:28:06 INFO BlockManagerMaster: Trying to register BlockManager 17/01/24 11:28:06 INFO BlockManagerMasterEndpoint: Registering block manager localhost:55934 with 517.4 MB RAM, BlockManagerId(driver, localhost, 55934) 17/01/24 11:28:06 INFO BlockManagerMaster: Registered BlockManager 17/01/24 11:28:09 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB) 17/01/24 11:28:09 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB) 17/01/24 11:28:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:55934 (size: 13.9 KB, free: 517.4 MB) 17/01/24 11:28:09 INFO SparkContext: Created broadcast 0 from textFile at MobileLocation.scala:40 17/01/24 11:28:12 INFO FileInputFormat: Total input paths to process : 3 17/01/24 11:28:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 86.4 KB, free 253.9 KB) 17/01/24 11:28:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 19.3 KB, free 273.2 KB) 17/01/24 11:28:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:55934 (size: 19.3 KB, free: 517.4 MB) 17/01/24 11:28:13 INFO SparkContext: Created broadcast 1 from textFile at MobileLocation.scala:73 17/01/24 11:28:13 INFO FileInputFormat: Total input paths to process : 1 17/01/24 11:28:13 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 17/01/24 11:28:13 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 17/01/24 11:28:13 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 17/01/24 11:28:13 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 17/01/24 11:28:13 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 17/01/24 11:28:14 INFO SparkContext: Starting job: saveAsTextFile at MobileLocation.scala:110 17/01/24 11:28:14 INFO DAGScheduler: Registering RDD 2 (map at MobileLocation.scala:50) 17/01/24 11:28:14 INFO DAGScheduler: Registering RDD 7 (map at MobileLocation.scala:76) 17/01/24 11:28:14 INFO DAGScheduler: Registering RDD 4 (map at MobileLocation.scala:67) 17/01/24 11:28:14 INFO DAGScheduler: Registering RDD 11 (groupBy at MobileLocation.scala:98) 17/01/24 11:28:14 INFO DAGScheduler: Got job 0 (saveAsTextFile at MobileLocation.scala:110) with 3 output partitions 17/01/24 11:28:14 INFO DAGScheduler: Final stage: ResultStage 4 (saveAsTextFile at MobileLocation.scala:110) 17/01/24 11:28:14 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3) 17/01/24 11:28:14 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 3) 17/01/24 11:28:14 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[7] at map at MobileLocation.scala:76), which has no missing parents 17/01/24 11:28:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.8 KB, free 277.0 KB) 17/01/24 11:28:14 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.2 KB, free 279.2 KB) 17/01/24 11:28:14 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:55934 (size: 2.2 KB, free: 517.4 MB) 17/01/24 11:28:14 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 17/01/24 11:28:14 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[7] at map at MobileLocation.scala:76) 17/01/24 11:28:14 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 17/01/24 11:28:14 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[2] at map at MobileLocation.scala:50), which has no missing parents 17/01/24 11:28:14 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.9 KB, free 283.1 KB) 17/01/24 11:28:14 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.2 KB, free 285.3 KB) 17/01/24 11:28:14 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, partition 0,ANY, 2210 bytes) 17/01/24 11:28:14 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:55934 (size: 2.2 KB, free: 517.4 MB) 17/01/24 11:28:14 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006 17/01/24 11:28:14 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[2] at map at MobileLocation.scala:50) 17/01/24 11:28:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks 17/01/24 11:28:14 INFO Executor: Running task 0.0 in stage 1.0 (TID 0) 17/01/24 11:28:14 INFO Executor: Fetching http://192.168.71.11:58135/jars/spark-1.0-SNAPSHOT.jar with timestamp 1485286086076 17/01/24 11:28:15 INFO Utils: Fetching http://192.168.71.11:58135/jars/spark-1.0-SNAPSHOT.jar to /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/userFiles-1790c4be-0a0d-45fd-91f9-c66c49e765a5/fetchFileTemp1508611941727652704.tmp 17/01/24 11:28:19 INFO Executor: Adding file:/tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/userFiles-1790c4be-0a0d-45fd-91f9-c66c49e765a5/spark-1.0-SNAPSHOT.jar to class loader 17/01/24 11:28:19 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/loc_logs/loc_info.txt:0+171 17/01/24 11:28:20 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 2255 bytes result sent to driver 17/01/24 11:28:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 1, localhost, partition 0,ANY, 2215 bytes) 17/01/24 11:28:20 INFO Executor: Running task 0.0 in stage 0.0 (TID 1) 17/01/24 11:28:20 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/mobile_logs/19735E1C66.log:0+248 17/01/24 11:28:20 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 5593 ms on localhost (1/1) 17/01/24 11:28:20 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 17/01/24 11:28:20 INFO DAGScheduler: ShuffleMapStage 1 (map at MobileLocation.scala:76) finished in 5.716 s 17/01/24 11:28:20 INFO DAGScheduler: looking for newly runnable stages 17/01/24 11:28:20 INFO DAGScheduler: running: Set(ShuffleMapStage 0) 17/01/24 11:28:20 INFO DAGScheduler: waiting: Set(ShuffleMapStage 2, ShuffleMapStage 3, ResultStage 4) 17/01/24 11:28:20 INFO DAGScheduler: failed: Set() 17/01/24 11:28:20 INFO Executor: Finished task 0.0 in stage 0.0 (TID 1). 2255 bytes result sent to driver 17/01/24 11:28:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 2, localhost, partition 1,ANY, 2215 bytes) 17/01/24 11:28:20 INFO Executor: Running task 1.0 in stage 0.0 (TID 2) 17/01/24 11:28:20 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 1) in 401 ms on localhost (1/3) 17/01/24 11:28:20 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/mobile_logs/DDE7970F68.log:0+496 17/01/24 11:28:20 INFO Executor: Finished task 1.0 in stage 0.0 (TID 2). 2255 bytes result sent to driver 17/01/24 11:28:20 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 3, localhost, partition 2,ANY, 2215 bytes) 17/01/24 11:28:20 INFO Executor: Running task 2.0 in stage 0.0 (TID 3) 17/01/24 11:28:20 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 2) in 191 ms on localhost (2/3) 17/01/24 11:28:20 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/mobile_logs/E549D940E0.log:0+496 17/01/24 11:28:20 INFO Executor: Finished task 2.0 in stage 0.0 (TID 3). 2255 bytes result sent to driver 17/01/24 11:28:20 INFO DAGScheduler: ShuffleMapStage 0 (map at MobileLocation.scala:50) finished in 6.045 s 17/01/24 11:28:20 INFO DAGScheduler: looking for newly runnable stages 17/01/24 11:28:20 INFO DAGScheduler: running: Set() 17/01/24 11:28:20 INFO DAGScheduler: waiting: Set(ShuffleMapStage 2, ShuffleMapStage 3, ResultStage 4) 17/01/24 11:28:20 INFO DAGScheduler: failed: Set() 17/01/24 11:28:20 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 3) in 176 ms on localhost (3/3) 17/01/24 11:28:20 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 17/01/24 11:28:20 INFO DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[4] at map at MobileLocation.scala:67), which has no missing parents 17/01/24 11:28:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KB, free 288.3 KB) 17/01/24 11:28:21 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1800.0 B, free 290.0 KB) 17/01/24 11:28:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:55934 (size: 1800.0 B, free: 517.4 MB) 17/01/24 11:28:21 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006 17/01/24 11:28:21 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[4] at map at MobileLocation.scala:67) 17/01/24 11:28:21 INFO TaskSchedulerImpl: Adding task set 2.0 with 3 tasks 17/01/24 11:28:21 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,NODE_LOCAL, 1947 bytes) 17/01/24 11:28:21 INFO Executor: Running task 0.0 in stage 2.0 (TID 4) 17/01/24 11:28:21 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks 17/01/24 11:28:21 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 17 ms 17/01/24 11:28:21 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1376 bytes result sent to driver 17/01/24 11:28:21 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,NODE_LOCAL, 1947 bytes) 17/01/24 11:28:21 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 352 ms on localhost (1/3) 17/01/24 11:28:21 INFO Executor: Running task 1.0 in stage 2.0 (TID 5) 17/01/24 11:28:21 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 3 blocks 17/01/24 11:28:21 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms 17/01/24 11:28:21 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1376 bytes result sent to driver 17/01/24 11:28:21 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 6, localhost, partition 2,NODE_LOCAL, 1947 bytes) 17/01/24 11:28:21 INFO Executor: Running task 2.0 in stage 2.0 (TID 6) 17/01/24 11:28:21 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 190 ms on localhost (2/3) 17/01/24 11:28:21 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 3 blocks 17/01/24 11:28:21 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 17/01/24 11:28:21 INFO Executor: Finished task 2.0 in stage 2.0 (TID 6). 1376 bytes result sent to driver 17/01/24 11:28:21 INFO DAGScheduler: ShuffleMapStage 2 (map at MobileLocation.scala:67) finished in 0.673 s 17/01/24 11:28:21 INFO DAGScheduler: looking for newly runnable stages 17/01/24 11:28:21 INFO DAGScheduler: running: Set() 17/01/24 11:28:21 INFO DAGScheduler: waiting: Set(ShuffleMapStage 3, ResultStage 4) 17/01/24 11:28:21 INFO DAGScheduler: failed: Set() 17/01/24 11:28:21 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 6) in 162 ms on localhost (3/3) 17/01/24 11:28:21 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 17/01/24 11:28:21 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[11] at groupBy at MobileLocation.scala:98), which has no missing parents 17/01/24 11:28:21 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 4.1 KB, free 294.2 KB) 17/01/24 11:28:22 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 296.2 KB) 17/01/24 11:28:22 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:55934 (size: 2.1 KB, free: 517.4 MB) 17/01/24 11:28:22 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006 17/01/24 11:28:22 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[11] at groupBy at MobileLocation.scala:98) 17/01/24 11:28:22 INFO TaskSchedulerImpl: Adding task set 3.0 with 3 tasks 17/01/24 11:28:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 7, localhost, partition 0,PROCESS_LOCAL, 2020 bytes) 17/01/24 11:28:22 INFO Executor: Running task 0.0 in stage 3.0 (TID 7) 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 17/01/24 11:28:22 INFO Executor: Finished task 0.0 in stage 3.0 (TID 7). 1376 bytes result sent to driver 17/01/24 11:28:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 8, localhost, partition 1,PROCESS_LOCAL, 2020 bytes) 17/01/24 11:28:22 INFO Executor: Running task 1.0 in stage 3.0 (TID 8) 17/01/24 11:28:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 7) in 290 ms on localhost (1/3) 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 17/01/24 11:28:22 INFO Executor: Finished task 1.0 in stage 3.0 (TID 8). 1376 bytes result sent to driver 17/01/24 11:28:22 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 9, localhost, partition 2,PROCESS_LOCAL, 2020 bytes) 17/01/24 11:28:22 INFO Executor: Running task 2.0 in stage 3.0 (TID 9) 17/01/24 11:28:22 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 8) in 191 ms on localhost (2/3) 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 17/01/24 11:28:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms 17/01/24 11:28:22 INFO Executor: Finished task 2.0 in stage 3.0 (TID 9). 1376 bytes result sent to driver 17/01/24 11:28:22 INFO DAGScheduler: ShuffleMapStage 3 (groupBy at MobileLocation.scala:98) finished in 0.606 s 17/01/24 11:28:22 INFO DAGScheduler: looking for newly runnable stages 17/01/24 11:28:22 INFO DAGScheduler: running: Set() 17/01/24 11:28:22 INFO DAGScheduler: waiting: Set(ResultStage 4) 17/01/24 11:28:22 INFO DAGScheduler: failed: Set() 17/01/24 11:28:22 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 9) in 172 ms on localhost (3/3) 17/01/24 11:28:22 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 17/01/24 11:28:22 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[14] at saveAsTextFile at MobileLocation.scala:110), which has no missing parents 17/01/24 11:28:23 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 66.4 KB, free 362.6 KB) 17/01/24 11:28:23 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 23.0 KB, free 385.6 KB) 17/01/24 11:28:24 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:55934 (size: 23.0 KB, free: 517.3 MB) 17/01/24 11:28:24 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006 17/01/24 11:28:24 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 4 (MapPartitionsRDD[14] at saveAsTextFile at MobileLocation.scala:110) 17/01/24 11:28:24 INFO TaskSchedulerImpl: Adding task set 4.0 with 3 tasks 17/01/24 11:28:24 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 10, localhost, partition 0,NODE_LOCAL, 1958 bytes) 17/01/24 11:28:24 INFO Executor: Running task 0.0 in stage 4.0 (TID 10) 17/01/24 11:28:24 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks 17/01/24 11:28:24 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms 17/01/24 11:28:26 INFO FileOutputCommitter: Saved output of task 'attempt_201701241128_0004_m_000000_10' to hdfs://hadoop01:9000/mobile/output/_temporary/0/task_201701241128_0004_m_000000 17/01/24 11:28:26 INFO SparkHadoopMapRedUtil: attempt_201701241128_0004_m_000000_10: Committed 17/01/24 11:28:26 INFO Executor: Finished task 0.0 in stage 4.0 (TID 10). 2080 bytes result sent to driver 17/01/24 11:28:26 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 11, localhost, partition 1,NODE_LOCAL, 1958 bytes) 17/01/24 11:28:26 INFO Executor: Running task 1.0 in stage 4.0 (TID 11) 17/01/24 11:28:26 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 10) in 1823 ms on localhost (1/3) 17/01/24 11:28:26 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks 17/01/24 11:28:26 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 17/01/24 11:28:27 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:55934 in memory (size: 2.2 KB, free: 517.3 MB) 17/01/24 11:28:27 INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost:55934 in memory (size: 2.1 KB, free: 517.3 MB) 17/01/24 11:28:27 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:55934 in memory (size: 1800.0 B, free: 517.3 MB) 17/01/24 11:28:27 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:55934 in memory (size: 2.2 KB, free: 517.4 MB) 17/01/24 11:28:27 INFO FileOutputCommitter: Saved output of task 'attempt_201701241128_0004_m_000001_11' to hdfs://hadoop01:9000/mobile/output/_temporary/0/task_201701241128_0004_m_000001 17/01/24 11:28:27 INFO SparkHadoopMapRedUtil: attempt_201701241128_0004_m_000001_11: Committed 17/01/24 11:28:27 INFO Executor: Finished task 1.0 in stage 4.0 (TID 11). 2080 bytes result sent to driver 17/01/24 11:28:27 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 12, localhost, partition 2,NODE_LOCAL, 1958 bytes) 17/01/24 11:28:27 INFO Executor: Running task 2.0 in stage 4.0 (TID 12) 17/01/24 11:28:27 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 11) in 1675 ms on localhost (2/3) 17/01/24 11:28:27 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks 17/01/24 11:28:27 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 17/01/24 11:28:28 INFO FileOutputCommitter: Saved output of task 'attempt_201701241128_0004_m_000002_12' to hdfs://hadoop01:9000/mobile/output/_temporary/0/task_201701241128_0004_m_000002 17/01/24 11:28:28 INFO SparkHadoopMapRedUtil: attempt_201701241128_0004_m_000002_12: Committed 17/01/24 11:28:28 INFO Executor: Finished task 2.0 in stage 4.0 (TID 12). 2080 bytes result sent to driver 17/01/24 11:28:28 INFO DAGScheduler: ResultStage 4 (saveAsTextFile at MobileLocation.scala:110) finished in 3.750 s 17/01/24 11:28:28 INFO TaskSetManager: Finished task 2.0 in stage 4.0 (TID 12) in 288 ms on localhost (3/3) 17/01/24 11:28:28 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 17/01/24 11:28:28 INFO DAGScheduler: Job 0 finished: saveAsTextFile at MobileLocation.scala:110, took 14.029200 s 17/01/24 11:28:28 INFO SparkUI: Stopped Spark web UI at http://192.168.71.11:4040 17/01/24 11:28:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/01/24 11:28:28 INFO MemoryStore: MemoryStore cleared 17/01/24 11:28:28 INFO BlockManager: BlockManager stopped 17/01/24 11:28:28 INFO BlockManagerMaster: BlockManagerMaster stopped 17/01/24 11:28:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/01/24 11:28:28 INFO SparkContext: Successfully stopped SparkContext 17/01/24 11:28:28 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 17/01/24 11:28:28 INFO ShutdownHookManager: Shutdown hook called 17/01/24 11:28:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e 17/01/24 11:28:28 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 17/01/24 11:28:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/httpd-34193396-8d9c-4ea7-9fca-5caf8c712f86 [root@hadoop01 ~]# [root@hadoop01 ~]# [root@hadoop01 ~]#
8. 檢視輸出結果
[root@hadoop01 ~]# hdfs dfs -cat /mobile/output/part-00000 [root@hadoop01 ~]# hdfs dfs -cat /mobile/output/part-00001 (18688888888,List((16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645))))) (18611132889,List((16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645))))) [root@hadoop01 ~]# hdfs dfs -cat /mobile/output/part-00002 [root@hadoop01 ~]#