Spark 從零到開發(四)單詞計數的三種環境實現

weixin_34239169發表於2018-08-18

實現一:spark shell

主要用於測試,在部署到叢集之前,自己使用集合測試資料來測試流程是否通順。

1.1 檔案上傳hdfs

首先先得把文字檔案上傳到HDFS上的spark目錄下 文字內容:

[root@s166 fantj]# cat spark.txt 
What is “version control”, and why should you care? Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. For the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.

If you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use. It allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more. Using a VCS also generally means that if you screw things up or lose files, you can easily recover. In addition, you get all this for very little overhead.

Local Version Control Systems
Many people’s version-control method of choice is to copy files into another directory (perhaps a time-stamped directory, if they’re clever). This approach is very common because it is so simple, but it is also incredibly error prone. It is easy to forget which directory you’re in and accidentally write to the wrong file or copy over files you don’t mean to.

To deal with this issue, programmers long ago developed local VCSs that had a simple database that kept all the changes to files under revision control.
複製程式碼
[root@s166 fantj]# vim spark.txt
[root@s166 fantj]# hadoop fs -mkdir -p /spark
[root@s166 fantj]# hadoop fs -put spark.txt /spark
[root@s166 fantj]# hadoop fs -ls -R /spark
-rw-r--r--   3 root supergroup       1527 2018-07-30 23:12 /spark/spark.txt
複製程式碼

1.2 開啟shell

[root@s166 fantj]# spark-shell 
18/07/31 04:53:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/07/31 04:53:55 INFO spark.SecurityManager: Changing view acls to: root
18/07/31 04:53:55 INFO spark.SecurityManager: Changing modify acls to: root
18/07/31 04:53:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
18/07/31 04:53:58 INFO spark.HttpServer: Starting HTTP Server
18/07/31 04:53:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
18/07/31 04:53:59 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:36422
18/07/31 04:53:59 INFO util.Utils: Successfully started service 'HTTP class server' on port 36422.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

...
...
18/07/31 04:57:28 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/6814e7a5-b896-49ac-bcd8-0b94e1a4b165/_tmp_space.db
18/07/31 04:57:30 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

scala> 
複製程式碼

1.3 執行scala程式

sc.textFile("/spark/spark.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/spark/out")

sc是SparkContext物件,該物件是提交spark程式的入口
textFile("/spark/spark.txt")是hdfs中讀取資料
flatMap(_.split(" "))先map再壓平
map((_,1))將單詞和1構成元組
reduceByKey(_+_)按照key進行reduce,並將value累加
saveAsTextFile("/spark/out")將結果寫入到hdfs中
複製程式碼
scala> sc.textFile("/spark/spark.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/spark/out")
18/07/31 04:59:40 INFO storage.MemoryStore: ensureFreeSpace(57160) called with curMem=0, maxMem=560497950
18/07/31 04:59:40 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 55.8 KB, free 534.5 MB)
18/07/31 04:59:44 INFO storage.MemoryStore: ensureFreeSpace(17347) called with curMem=57160, maxMem=560497950
18/07/31 04:59:44 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 534.5 MB)
18/07/31 04:59:44 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37951 (size: 16.9 KB, free: 534.5 MB)
18/07/31 04:59:44 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:22
18/07/31 04:59:49 INFO mapred.FileInputFormat: Total input paths to process : 1
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/07/31 05:00:51 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
18/07/31 05:00:51 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 4730 ms on localhost (1/1)
18/07/31 05:00:51 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/07/31 05:00:51 INFO scheduler.DAGScheduler: ResultStage 1 (saveAsTextFile at <console>:22) finished in 4.733 s
18/07/31 05:00:52 INFO scheduler.DAGScheduler: Job 0 finished: saveAsTextFile at <console>:22, took 15.399221 s
複製程式碼

1.4 檢視執行結果

[root@s166 ~]# hadoop fs -cat /spark/out/p*
(simple,,1)
(nearly,1)
(For,1)
(back,2)
(this,4)
(under,1)
(it,2)
(means,1)
(introduced,1)
(revision,1)
(when,,1)
...
...
(To,1)
((which,1)
...
(prone.,1)
(an,2)
(time,,1)
(things,1)
(they’re,1)
...
(might,1)
(would,1)
(issue,,1)
(state,,2)
(Systems,1)
(System,1)
(write,1)
(being,1)
(programmers,1)
複製程式碼

實現二:java 本地執行處理

主要用於臨時性的處理。

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.fantj</groupId>
    <artifactId>bigdata</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>bigdata</name>
    <description>Demo project for Spring Boot</description>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.4.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <java.version>1.8</java.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
        <!--<dependency>-->
            <!--<groupId>org.apache.hive</groupId>-->
            <!--<artifactId>hive-jdbc</artifactId>-->
            <!--<version>2.1.0</version>-->
        <!--</dependency>-->
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>ch.cern.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.5.1</version>
            <classifier>sources</classifier>
            <type>java-source</type>
        </dependency>
        <dependency>
            <groupId>ai.h2o</groupId>
            <artifactId>sparkling-water-core_2.10</artifactId>
            <version>1.6.1</version>
            <type>pom</type>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <!-- 去除內嵌tomcat -->
            <exclusions>
                <exclusion>
                    <groupId>org.springframework.boot</groupId>
                    <artifactId>spring-boot-starter-tomcat</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!--新增servlet的依賴-->
        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>javax.servlet-api</artifactId>
            <version>3.1.0</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>
    <build>
        <finalName>wordcount</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.4</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>com.fantj.bigdata.WordCountCluster</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
        </plugins>
    </build>


</project>
複製程式碼

我用的springboot搭建的環境,所以pom中需要將springboot內建的tomcat移除,我們不需要容器來執行java指令碼。最後打成jar包將main方法的路徑告訴hadoop即可,不需要容器。然後就是匯入hadoop spark的相關依賴。沒maven基礎的先學習maven。 WordCountLocal

package com.fantj.bigdata;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

import java.util.Arrays;

/**
 * Created by Fant.J.
 */
public class WordCountLocal {
    public static void main(String[] args) {
        /**
         * 建立sparkConfig物件,設定spark應用配置資訊,
            setMaster設定spark應用程式要連線的spark叢集的master節點的url, 
            local則代表在本地執行
         */
        SparkConf conf = new SparkConf().setAppName("WordCountLocal").setMaster("local");
        /**
         * 建立JavaSparkContext 物件(最重要的物件)
         *
         */
        JavaSparkContext sc = new JavaSparkContext(conf);
        /**
         * 針對輸入源(hdfs檔案、本地檔案等)建立一個初始的RDD,這裡是本地測試,所以就針對本地檔案
         * textFile()方法用於根據檔案型別的輸入源建立RDD
         * RDD的概念:如果是hdfs或者本地檔案,建立的RDD每個元素就相當於是檔案裡的一行。
         */
        JavaRDD<String> lines = sc.textFile("C:\\Users\\84407\\Desktop\\spark.txt");
        /**
         * 對初始RDD進行transformation 計算操作
         * 通常操作會通過建立function,並配合RDD的map、flatMap等運算元來執行
         * function通常,如果比較簡單,則建立指定function的匿名內部類
         * 如果function比較複雜,則會單獨建立一個類,作為實現這個function介面的類
         * 現將每一行拆分成單個的單詞
         * FlatMapFunction,有兩個泛型闡述,分別代表了輸入和輸出型別
         * 這裡只用FLatMap運算元的作用,其實就是講RDD的一個元素,給拆分成一個或多個元素。
         */
        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public Iterable<String> call(String s) throws Exception {
                //分割
                return Arrays.asList(s.split(" "));
            }
        });
        /**
         * 接著,需要將每一個單詞,對映為(單詞,1)的這種格式來進行每個單詞的出現次數的累加
         * mapTopair其實就是將每個元素以後干涉為一個(v1,v2)這樣的Tuple2型別的元素
         * 如果還記得scala的tuple,那麼沒錯,這裡的tuple2就是scala型別,包含了兩個值
         * mapToPair這個運算元,要求的是與PairFunction配合使用,第一給泛型引數代表了 輸入型別
         * 第二個和第三個泛型引數,代表的輸出的Tuple2的第一給值和第二個值的型別
         * JavaPairRdd的兩個泛型引數,分別代表了tuple元素的第一給值和第二個值的型別
         */
        JavaPairRDD<String,Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(String word) throws Exception {
                return new Tuple2<String,Integer>(word,1);
            }
        });
        /**
         * 然後需要以單詞作為key,統計每個單詞出現的次數
         * 這裡要使用reduceBykey這個運算元對每個key對應的value都進行reduce操作
         * 比如JavaPairRDD中有幾個元素,假設分別為(hello,1)(hello,1)(hello,1)
         * reduce操作相當於是吧第一個值和第二個值進行計算,然後再講結果與第三個至進行計算
         * 比如這裡的helo,那麼就相當於是,1+1=2,然後2+1=3
         * 最後返回的JavaPairRdd中的元素,也是tuple,但是第一個值就是每個key,第二個值就是key的value,也就是次數
         */
        JavaPairRDD<String,Integer> wordCounts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1+v2;
            }
        });
        /**
         * 我們已經統計出了單詞的次數
         * 但是,之前我們使用的flatMap、mapToPair、reduceByKey這種操作,都叫做transformation操作
         * 一個Spark應用中,只有transformation操作是不行的,我用foreach來觸發程式的執行
         */
        wordCounts.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> wordCount) throws Exception {
                System.out.println(wordCount._1 + "appeared "+ wordCount._2 );
            }
        });
    }
}
複製程式碼

我們可以看到裡面有很多的匿名內部類,我們可以用lambda將它代替,使程式碼更簡潔。

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("WordCountLocal").setMaster("local");

        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> lines = sc.textFile("C:\\Users\\84407\\Desktop\\spark.txt");

        JavaRDD<String> words = lines.flatMap((FlatMapFunction<String, String>) s -> {
            //分割
            return Arrays.asList(s.split(" "));
        });

        JavaPairRDD<String, Integer> pairs = words.mapToPair((PairFunction<String, String, Integer>) word -> new Tuple2<>(word, 1));

        JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey((Integer v1, Integer v2) -> {
            return v1 + v2;
        });

        wordCounts.foreach((Tuple2<String, Integer> wordCount) -> {
            System.out.println(wordCount._1 + "appeared " + wordCount._2);
        });
    }
複製程式碼

然後執行該main方法:

控制檯列印:
Systemsappeared 1
examplesappeared 1
withappeared 2
inappeared 3
specificappeared 1
versionsappeared 1
recallappeared 1
copyappeared 2
Inappeared 1
VCSsappeared 1
controlled,appeared 1
Whatappeared 1
directory,appeared 1
Manyappeared 1
setappeared 1
loseappeared 1
...
...
systemappeared 1
Systemappeared 1
writeappeared 1
beingappeared 1
programmersappeared 1
複製程式碼

實現三:叢集執行

最常用,主要可以針對HDFS上儲存的大資料並進行離線批處理。

準備工作: 在這之前,需要將spark.txt文字上傳到hdfs上。

3.1 修改程式碼

如果要在叢集上執行,需要修改兩個地方的程式碼:

        SparkConf conf = new SparkConf().setAppName("WordCountCluster");

        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> lines = sc.textFile("hdfs://s166/spark/spark.txt");
複製程式碼

setAppName和java類名相一致。然後把路徑改成hdfs的檔案路徑。

3.2 Maven打包

需要將第二種實現方式的java專案打包成jar,然後放到叢集中,通過指令碼執行。

3.3 上傳到叢集

3.4 寫執行指令碼wordcount.sh
[root@s166 fantj]# cat wordcount.sh 

/home/fantj/spark/bin/spark-submit \
--class com.fantj.bigdata.WordCountCluster \
s--num-executors 1 \
--driver-memory 100m \
--executor-cores 1 \
/home/fantj/worldcount.jar \
複製程式碼
3.5 執行指令碼

./wordcount.sh

[root@s166 fantj]# ./wordcount.sh 
18/07/31 09:43:49 INFO spark.SparkContext: Running Spark version 1.5.1
18/07/31 09:43:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/07/31 09:43:52 INFO spark.SecurityManager: Changing view acls to: root
18/07/31 09:43:52 INFO spark.SecurityManager: Changing modify acls to: root
18/07/31 09:43:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
18/07/31 09:43:54 INFO slf4j.Slf4jLogger: Slf4jLogger started
18/07/31 09:43:54 INFO Remoting: Starting remoting
18/07/31 09:43:55 INFO util.Utils: Successfully started service 'sparkDriver' on port 41710.
18/07/31 09:43:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.27.166:41710]
18/07/31 09:43:55 INFO spark.SparkEnv: Registering MapOutputTracker
18/07/31 09:43:55 INFO spark.SparkEnv: Registering BlockManagerMaster
18/07/31 09:43:55 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-96c433c9-8f43-40fa-ba4f-1dc888608140
18/07/31 09:43:55 INFO storage.MemoryStore: MemoryStore started with capacity 52.2 MB
18/07/31 09:43:55 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-5d613c5d-e9c3-416f-8d8b-d87bc5e03e02/httpd-3609b712-55f4-4140-9e05-2ecee834b18c
18/07/31 09:43:55 INFO spark.HttpServer: Starting HTTP Server
..
...
18/07/31 09:44:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
simple,appeared 1
nearlyappeared 1
Forappeared 1
backappeared 2
thisappeared 4
underappeared 1
itappeared 2
meansappeared 1
introducedappeared 1
revisionappeared 1
when,appeared 1
previousappeared 2
realityappeared 1
typeappeared 1
developedappeared 1
Localappeared 1
simpleappeared 1
...
causingappeared 1
changesappeared 3
andappeared 5
designerappeared 1
approachappeared 1
modifiedappeared 1
systemappeared 1
Systemappeared 1
writeappeared 1
beingappeared 1
programmersappeared 1
18/07/31 09:44:12 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
18/07/31 09:44:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 200 ms on localhost (1/1)
18/07/31 09:44:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/07/31 09:44:12 INFO scheduler.DAGScheduler: ResultStage 1 (foreach at WordCountCluster.java:44) finished in 0.209 s
18/07/31 09:44:12 INFO scheduler.DAGScheduler: Job 0 finished: foreach at WordCountCluster.java:44, took 2.938418 s
18/07/31 09:44:12 INFO spark.SparkContext: Invoking stop() from shutdown hook
...
..
18/07/31 09:44:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
複製程式碼

相關文章