Spark從入門到放棄——Spark2.4.7安裝和啟動（二）

╭⌒若隱_RowYet發表於2020-12-14

原文網址 : https://blog.csdn.net/LXWalaz1s1s/article/details/111143217

Spark版本

Spark是Apache開源的頂級專案，官網地址,目前也一直不斷更新迭代，截至本部落格釋出時間(20201202)最新版本是Spark 3.0.1 released (Sep 08, 2020),因為公司目前生產用的是Spark2.4.7,所以後續的教程都是以Spark2.4.7為主；

講解版本：Spark2.4.7

工欲善其事，必先利其器！這裡就講解下Spark的安裝與啟動；

安裝準備

Spark從入門到放棄——初始Spark（一）說到，Spark只是計算框架，取代的是MapReduce，因此Spark必須依賴檔案系統，常用開原始檔系統我們通常選用Hadoop的HDFS或者Hive等，因此關於HDFS的安裝，Hive的安裝，以及需要叢集的免密登入，依賴的Java版本等請參考以下部落格，這裡不再反覆累贅敘說；

本著資料本地性原則，基本上有HDFS上有NameNode，DataNode最好都安裝上Spark，那我們在測試叢集的node1，node2，node3，node4上都裝上Spark；

Scala安裝

Spark框架本身由Scala開發，而且天然支援Scala API介面，不論是臨時利用Spark拉取資料，還是編寫Spark應用程式，Scala都是非常簡潔方便的，就連spark-shell預設啟動的互動式介面，也是scala語言，以上這些，就註定了Scala和Spark天生一對，密不可分，就連Java之父詹姆斯·高斯林（James Gosling）也曾說過，拋開Java而言，再讓他選一門開發語言，他會選擇Scala！因此，安裝Spark之前必須先安裝好Scala；

Scala官網連結：傳送門

將scala-2.11.12.tgz上傳到node1的/data/tools目錄下，採用以下命令解壓scala-2.11.12.tgz；

tar -zxvf scala-2.11.12.tgz

設定Scala的環境變數；

# 編輯環境變數檔案/etc/profile
sudo vim /etc/profile

# 在檔案/etc/profile內最下面新增scala的路徑，然後wq儲存
 export SCALA_HOME=/data/tools/scala-2.11.12
 export PATH==$PATH:$SCALA_HOME/bin

# 重新整理環境變數檔案
source /etc/profile

測試scala是否安裝成功,因為設定了環境變數，隨便一個folder下，輸入scala啟動scala的互動式介面，裡面輸入1+1，返回一個Int變數res0等於2，具體如下，恭喜你，scala安裝成功了，按ctrl+c結束scala互動式介面;

[hadoop@node1 tools]$ scala
Welcome to Scala 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211).
Type in expressions for evaluation. Or try :help.

scala> 1+1
res0: Int = 2

scala>

在node2,node3，node4上都安裝好scala，如下scp到不同節點；

[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node2:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  40.0MB/s   00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node3:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  42.8MB/s   00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node4:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  47.8MB/s   00:00

這裡我就不寫迴圈shell操作了，畢竟就幾臺機器，偷個懶，使用MobaXterm的Write commands on all terminals偷個懶，即n屏聯動功能，可以讓開啟的遠端視窗同時編輯相同的內容，是不是很拽，很好用的一個功能,具體如下圖；

在這裡插入圖片描述

圖1 開啟MobaXterm的Write commands on all terminals功能

在這裡插入圖片描述

圖2 MobaXterm的Write commands on all terminals功能使用教學

到此就算scala安裝完成；

Spark2.4.7安裝

Spark2.4.7下載

官網下載：傳送門
下載完後將spark-2.4.7-bin-hadoop2.7.tgz上傳到node1的/data/toolsfolder下，利用以下指令解壓；

tar -zxvf spark-2.4.7-bin-hadoop2.7.tgz

修改spark的配置檔案

spark的配置檔案主要設定在/data/tools/spark-2.4.7-bin-hadoop2.7/conf 目錄下,具體如下;

-rwxr-xr-x. 1 hadoop hadoop    996 9月   8 13:48 docker.properties.template
-rwxr-xr-x. 1 hadoop hadoop   1105 9月   8 13:48 fairscheduler.xml.template
-rwxr-xr-x. 1 hadoop hadoop   2025 9月   8 13:48 log4j.properties.template
-rwxr-xr-x. 1 hadoop hadoop   7801 9月   8 13:48 metrics.properties.template
-rwxr-xr-x. 1 hadoop hadoop    865 9月   8 13:48 slaves.template
-rwxr-xr-x. 1 hadoop hadoop   1292 9月   8 13:48 spark-defaults.conf.template
-rwxr-xr-x. 1 hadoop hadoop   4221 9月   8 13:48 spark-env.sh.template

需要修改的配置檔案有spark-env.sh,spark-defaults.conf和slaves,這些檔案上面也沒有呀，這些檔案需要自己從他們的template檔案總cp出來；

修改spark的配置檔案spark-env.sh

先cp出這個檔案，shell指令如下；

# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh檔案
cp ./spark-env.sh.template ./spark-env.sh

編輯檔案spark-env.sh，具體如下,只需要關注最後未被註釋的內容即可，設定的值根據自己的機器來配置，export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)一定要寫，不然容易出異常，/data/tools/hadoop-2.8.5/bin/hadoop根據自己的hadoop路徑來配置，不要照抄！；

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS


export JAVA_HOME=/data/tools/jdk1.8.0_211
export SCALA_HOME=/data/tools/scala-2.11.12
export HADOOP_HOME=/data/tools/hadoop-2.8.5
export HADOOP_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop
export YARN_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop

export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077

export SPARK_DRIVER_MEMORY=1G
export SPARK_EXECUTOR_CORES=4
export SPARK_EXECUTOR_MEMORY=2G

export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=2G
export SPARK_WORKER_INSTANCES=1

export SPARK_LOG_DIR=/data/logs/spark/
export SPARK_WORKER_DIR=/data/logs/spark/worker
export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=4000 -Dspark.history.retainedApplications=100 -Dspark.history.fs.logDirectory=hdfs://dw-cluster:8020/opt/spark/applicationHistory"

修改spark的配置檔案spark-defaults.conf

一樣的，先cp出這個檔案，shell指令如下；

# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh檔案
cp ./spark-defaults.conf.template ./spark-defaults.conf

編輯檔案spark-defaults.conf，具體內容如下,node1是我一個節點的hostname，hdfs://dw-cluster:8020/user/hive/warehouse是我HA的HDFS上的Hive資料倉儲的根目錄；

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

spark.master                            spark://node1:7077
spark.eventLog.enabled                  true
spark.eventLog.dir                      hdfs://dw-cluster:8020/opt/spark/applicationHistory
spark.serializer                        org.apache.spark.serializer.KryoSerializer
spark.eventLog.compress                 true
spark.yarn.historyServer.address        http://node1:18018
spark.sql.warehouse.dir                 hdfs://dw-cluster:8020/user/hive/warehouse
spark.sql.parquet.enableVectorizedReader        false
spark.sql.parquet.writeLegacyFormat             true
spark.debug.maxToStringFields           100
spark.network.timeout                   300000
spark.yarn.jars               hdfs://dw-cluster/tmp/spark_jars/*.jar
spark.port.maxRetries   100

修改spark的配置檔案slaves

一樣的，先cp出這個檔案，shell指令如下；

# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh檔案
cp ./slaves.template ./slaves

編輯檔案slaves，具體如下，就是設定spark的work節點；

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
node2
node3
node4

新增hive-site.xml檔案

將你安裝好的hive的配置檔案複製一份到spark的conf下，實現指令如下，具體路徑根據自己的hive和spark路徑做調整；

cp /data/tools/apache-hive-2.3.5-bin/conf/hive-site.xml /data/tools/spark-2.4.7-bin-hadoop2.7/conf/

目的是為了實現spark on hive，即採用spark-sql讀取操作hive內的資料；

新增/mysql-connector-java-8.0.13.jar

因為Hive的後設資料在mysql上，所以需要將hive內連線mysql的驅動包mysql-connector-java-8.0.13.jar也一起拿過來，具體操作如下；

cp /data/tools/apache-hive-2.3.5-bin/lib/mysql-connector-java-8.0.13.jar /data/tools/spark-2.4.7-bin-hadoop2.7/jars/

到此就能真正實現spark on hive；

所有的配置檔案到此就配置完成了！

將spark複製到其他節點

先在node1上將spark-2.4.7-bin-hadoop2.7.tgz scp到其他的節點；

scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node2:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node3:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node4:/data/tools/

使用MobaXterm的Write commands on all terminals解壓spark-2.4.7-bin-hadoop2.7.tgz，然後將node1上的配置檔案全部copy到其他節點，操作如下；

scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node2:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node3:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node4:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/

Spark2.4.7配置環境變數

使用MobaXterm的Write commands on all terminals在node1，node2和node3，node4上配置好spark的環境變數，即sudo vim /etc/profile,然後最後面追加以下內容；

export SPARK_HOME=/data/tools/spark-2.4.7-bin-hadoop2.7
export PATH==$PATH:$SPARK_HOME/bin

Spark2.4.7啟動

隨便找一個節點，這裡就選擇node1為實驗，先啟動spark服務，操作如下；

[hadoop@node1 conf]$ cd /data/tools/spark-2.4.7-bin-hadoop2.7/sbin/
[hadoop@node1 sbin]$ ll
總用量 92
-rwxr-xr-x. 1 hadoop hadoop 2803 9月   8 13:48 slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1429 9月   8 13:48 spark-config.sh
-rwxr-xr-x. 1 hadoop hadoop 5689 9月   8 13:48 spark-daemon.sh
-rwxr-xr-x. 1 hadoop hadoop 1262 9月   8 13:48 spark-daemons.sh
-rwxr-xr-x. 1 hadoop hadoop 1190 9月   8 13:48 start-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1274 9月   8 13:48 start-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 2050 9月   8 13:48 start-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1877 9月   8 13:48 start-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1423 9月   8 13:48 start-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1279 9月   8 13:48 start-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 3151 9月   8 13:48 start-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1527 9月   8 13:48 start-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1857 9月   8 13:48 start-thriftserver.sh
-rwxr-xr-x. 1 hadoop hadoop 1478 9月   8 13:48 stop-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1056 9月   8 13:48 stop-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 1080 9月   8 13:48 stop-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1227 9月   8 13:48 stop-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1084 9月   8 13:48 stop-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1067 9月   8 13:48 stop-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1557 9月   8 13:48 stop-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1064 9月   8 13:48 stop-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1066 9月   8 13:48 stop-thriftserver.sh
[hadoop@node1 sbin]$ pwd
/data/tools/spark-2.4.7-bin-hadoop2.7/sbin
[hadoop@node1 sbin]$ ./start-all.sh

開啟spark-shell和spark-sql做互動式操作；

先來測試下spark-sql，在node1上輸入spark-sql，如下；

[hadoop@node1 ~]$ spark-sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
……

然後一頓載入，不報錯的話，基本就成功了，報錯把相應的bug解決，然後測試下hive內的資料請求，我在hive先建好了一個student表，只有兩個欄位id和sname，表內有一行記錄;

表1 hive表student內的記錄

id	sname
1	rowyet

然後我們的spark-sql測試語句效果如下；

spark-sql> select * from student;
20/12/14 02:27:33 INFO metastore.HiveMetaStore: 0: get_table : db=dw tbl=student
20/12/14 02:27:33 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr      cmd=get_table : db=dw tbl=student
20/12/14 02:27:33 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 558.8 KB, free 412.7 MB)
20/12/14 02:27:34 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 51.4 KB, free 412.7 MB)
20/12/14 02:27:34 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on node1:37157 (size: 51.4 KB, free: 413.7 MB)
20/12/14 02:27:34 INFO spark.SparkContext: Created broadcast 1 from
20/12/14 02:27:35 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/14 02:27:35 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 3 output partitions
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376)
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Missing parents: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376), which has no missing parents
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 8.2 KB, free 412.7 MB)
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.4 KB, free 412.7 MB)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on node1:37157 (size: 4.4 KB, free: 413.7 MB)
20/12/14 02:27:35 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1184
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1, 2))
20/12/14 02:27:35 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
20/12/14 02:27:35 INFO spark.ContextCleaner: Cleaned accumulator 36
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.144.138, executor 0, partition 0, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 192.168.144.140, executor 2, partition 1, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, 192.168.144.139, executor 1, partition 2, ANY, 7963 bytes)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.138:46066 (size: 4.4 KB, free: 1007.7 MB)
20/12/14 02:27:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.138:46066 (size: 51.4 KB, free: 1007.6 MB)
20/12/14 02:27:45 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.140:37936 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:28:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.140:37936 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:28:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 46439 ms on 192.168.144.138 (executor 0) (1/3)
20/12/14 02:28:34 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 59298 ms on 192.168.144.140 (executor 2) (2/3)
20/12/14 02:28:47 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.139:43240 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:29:01 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.139:43240 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:29:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 94410 ms on 192.168.144.139 (executor 1) (3/3)
20/12/14 02:29:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/12/14 02:29:09 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 94.516 s
20/12/14 02:29:09 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 94.576264 s
1       rowyet
Time taken: 97.013 seconds, Fetched 1 row(s)

再測試下spark-shell,在node1上輸入spark-shell,具體效果如下,會載入spark的版本，然後是scala語言的互動式介面；

[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:20:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/12/14 03:20:37 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://node1:4041
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214032040-0004).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

再測試下spark-shell的功能是否有異常，具體操作如下,初始化一個只有一列的dataframe(概念後續會講到，這裡只做測試！)，並顯示出來；

[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:33:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://node1:4040
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214033402-0005).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val myRange = spark.range(1000).toDF("number")
myRange: org.apache.spark.sql.DataFrame = [number: bigint]

scala> myRange.show
+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
+------+
only showing top 20 rows

Spark監控介面

spark的預設監控介面是8080，即在瀏覽器上輸入http://node1:8080/即可檢視哪些Job正在running，哪些已經完成，叢集的資源狀況等；
在這裡插入圖片描述

圖3 spark的監控介面

Spark on Yarn的啟動

Spark on Yarn是開源用的最廣的一種模式，後續會細講，這裡就先開個引子，先講講啟動，主要涉及的引數如下；

–master：指定master在哪裡執行
–deploy-mode：指定yanr的cluster 模式還是client模式
–executor-memory ：每個executor的記憶體
–executor-cores：每個executor的cores
–num-executors: executor個數，說明這個啟動的環境是20核，32G記憶體
–queue：指定該Job在yarn的什麼佇列執行
–conf:配置引數

#啟動spark-sql 
spark-sql --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m

#啟動spark-shell
spark-shell --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m

#spark-submit
spark-submit --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m  --class cn.ruoying.esapp.App  hdfs:///app/hive_to_es/etl_jar/SparkOnHiveToEs_v1.jar

到此，整個Spark2.4.7的安裝配置就算是完成了！

Spark從入門到放棄---RDD
2020-08-17
Spark
Spark從入門到放棄——初始Spark（一）
2020-12-09
Spark
AOP埋點從入門到放棄（二）
2018-08-12
Git 從入門到放棄
2019-03-03
Git
XXE從入門到放棄
2020-01-17
Vue 從入門到放棄
2019-12-30
Vue
Nginx從入門到放棄
2020-09-30
Nginx
GraphQL從入門到放棄
2019-03-04
deepspeed從入門到放棄
2024-09-01
NumPy從入門到放棄
2024-08-08
webpack從入門到放棄
2018-03-07
Web
openstack從入門到放棄
2018-04-21
HTTP從入門到放棄
2018-04-18
HTTP
Docker從入門到放棄(1) Docker簡介與安裝
2022-05-01
Docker
swoole——從入門到放棄（一）
2019-02-16
swoole——從入門到放棄（三）
2019-01-19
快取從入門到放棄
2019-02-28
快取
webpack 從入門到放棄(一)
2019-03-03
Web
從入門到放棄 - 事件溯源
2021-08-16
事件
hive從入門到放棄(二)——DDL資料定義
2022-03-16
Hive
JavaScript30秒，從入門到放棄之Array（二）
2018-12-13
JavaScript
HTTP快取從入門到放棄
2018-11-29
HTTP快取
Flink從入門到放棄-大綱
2019-02-24
Taro 小程式從入門到放棄！
2018-07-30
Python 從入門到放棄——Python科普！
2020-04-04
Python
Scikit-learn從入門到放棄
2024-08-18
t-SNE 從入門到放棄
2021-10-13
webpack -> vue Component 從入門到放棄（四）
2019-02-16
WebVue
Realm資料庫從入門到“放棄”
2019-03-02
資料庫
分散式訓練從入門到放棄
2019-04-11
分散式
AOP埋點從入門到放棄（三）
2018-08-21
從入門到放棄之promise用法(上)
2018-06-14
Promise
Elasticsearch從入門到放棄：瞎說Mapping
2020-08-04
ElasticsearchAPP
Elasticsearch從入門到放棄：再聊搜尋
2020-07-14
Elasticsearch
從入門到放棄，我用了五年
2020-09-24
Redis從入門到放棄系列(十) Cluster
2019-07-02
Redis
從入門到放棄之大資料Hive
2019-05-12
大資料Hive
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python