Spark從入門到放棄——Spark2.4.7安裝和啟動(二)

╭⌒若隱_RowYet發表於2020-12-14

Spark版本

  Spark是Apache開源的頂級專案,官網地址,目前也一直不斷更新迭代,截至本部落格釋出時間(20201202)最新版本是Spark 3.0.1 released (Sep 08, 2020),因為公司目前生產用的是Spark2.4.7,所以後續的教程都是以Spark2.4.7為主;

   講解版本:Spark2.4.7

  工欲善其事,必先利其器!這裡就講解下Spark的安裝與啟動;

安裝準備

  Spark從入門到放棄——初始Spark(一)說到,Spark只是計算框架,取代的是MapReduce,因此Spark必須依賴檔案系統,常用開原始檔系統我們通常選用Hadoop的HDFS或者Hive等,因此關於HDFS的安裝,Hive的安裝,以及需要叢集的免密登入,依賴的Java版本等請參考以下部落格,這裡不再反覆累贅敘說;

  1. Hadoop叢集大資料解決方案之搭建Hadoop3.X+HA模式(二)
  2. Hive從入門到放棄——Hive安裝和基本使用(二)

  本著資料本地性原則,基本上有HDFS上有NameNode,DataNode最好都安裝上Spark,那我們在測試叢集的node1,node2,node3,node4上都裝上Spark;

Scala安裝

  Spark框架本身由Scala開發,而且天然支援Scala API介面,不論是臨時利用Spark拉取資料,還是編寫Spark應用程式,Scala都是非常簡潔方便的,就連spark-shell預設啟動的互動式介面,也是scala語言,以上這些,就註定了Scala和Spark天生一對,密不可分,就連Java之父詹姆斯·高斯林 (James Gosling)也曾說過,拋開Java而言,再讓他選一門開發語言,他會選擇Scala!因此,安裝Spark之前必須先安裝好Scala;

  Scala官網連結:傳送門

  1. scala-2.11.12.tgz上傳到node1的/data/tools目錄下,採用以下命令解壓scala-2.11.12.tgz
tar -zxvf scala-2.11.12.tgz
  1. 設定Scala的環境變數;
# 編輯環境變數檔案/etc/profile
sudo vim /etc/profile

# 在檔案/etc/profile內最下面新增scala的路徑,然後wq儲存
 export SCALA_HOME=/data/tools/scala-2.11.12
 export PATH==$PATH:$SCALA_HOME/bin

# 重新整理環境變數檔案
source /etc/profile
  1. 測試scala是否安裝成功,因為設定了環境變數,隨便一個folder下,輸入scala啟動scala的互動式介面,裡面輸入1+1,返回一個Int變數res0等於2,具體如下,恭喜你,scala安裝成功了,按ctrl+c結束scala互動式介面;
[hadoop@node1 tools]$ scala
Welcome to Scala 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211).
Type in expressions for evaluation. Or try :help.

scala> 1+1
res0: Int = 2

scala>

  1. 在node2,node3,node4上都安裝好scala,如下scp到不同節點;
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node2:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  40.0MB/s   00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node3:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  42.8MB/s   00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node4:/data/tools/
scala-2.11.12.tgz                                                                                                                                                            100%   28MB  47.8MB/s   00:00

  1. 這裡我就不寫迴圈shell操作了,畢竟就幾臺機器,偷個懶,使用MobaXtermWrite commands on all terminals偷個懶,即n屏聯動功能,可以讓開啟的遠端視窗同時編輯相同的內容,是不是很拽,很好用的一個功能,具體如下圖;

在這裡插入圖片描述

圖1 開啟MobaXterm的Write commands on all terminals功能

在這裡插入圖片描述

圖2 MobaXterm的Write commands on all terminals功能使用教學

  到此就算scala安裝完成;

Spark2.4.7安裝

Spark2.4.7下載

  官網下載:傳送門
  下載完後將spark-2.4.7-bin-hadoop2.7.tgz上傳到node1的/data/toolsfolder下,利用以下指令解壓;

tar -zxvf spark-2.4.7-bin-hadoop2.7.tgz

修改spark的配置檔案

  spark的配置檔案主要設定在/data/tools/spark-2.4.7-bin-hadoop2.7/conf 目錄下,具體如下;

-rwxr-xr-x. 1 hadoop hadoop    996 9月   8 13:48 docker.properties.template
-rwxr-xr-x. 1 hadoop hadoop   1105 9月   8 13:48 fairscheduler.xml.template
-rwxr-xr-x. 1 hadoop hadoop   2025 9月   8 13:48 log4j.properties.template
-rwxr-xr-x. 1 hadoop hadoop   7801 9月   8 13:48 metrics.properties.template
-rwxr-xr-x. 1 hadoop hadoop    865 9月   8 13:48 slaves.template
-rwxr-xr-x. 1 hadoop hadoop   1292 9月   8 13:48 spark-defaults.conf.template
-rwxr-xr-x. 1 hadoop hadoop   4221 9月   8 13:48 spark-env.sh.template

  需要修改的配置檔案有spark-env.sh,spark-defaults.confslaves,這些檔案上面也沒有呀,這些檔案需要自己從他們的template檔案總cp出來;

修改spark的配置檔案spark-env.sh

  1. 先cp出這個檔案,shell指令如下;
# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh檔案
cp ./spark-env.sh.template ./spark-env.sh
  1. 編輯檔案spark-env.sh,具體如下,只需要關注最後未被註釋的內容即可,設定的值根據自己的機器來配置,export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)一定要寫,不然容易出異常,/data/tools/hadoop-2.8.5/bin/hadoop根據自己的hadoop路徑來配置,不要照抄!;
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS


export JAVA_HOME=/data/tools/jdk1.8.0_211
export SCALA_HOME=/data/tools/scala-2.11.12
export HADOOP_HOME=/data/tools/hadoop-2.8.5
export HADOOP_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop
export YARN_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop

export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077

export SPARK_DRIVER_MEMORY=1G
export SPARK_EXECUTOR_CORES=4
export SPARK_EXECUTOR_MEMORY=2G

export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=2G
export SPARK_WORKER_INSTANCES=1

export SPARK_LOG_DIR=/data/logs/spark/
export SPARK_WORKER_DIR=/data/logs/spark/worker
export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=4000 -Dspark.history.retainedApplications=100 -Dspark.history.fs.logDirectory=hdfs://dw-cluster:8020/opt/spark/applicationHistory"

修改spark的配置檔案spark-defaults.conf

  1. 一樣的,先cp出這個檔案,shell指令如下;
# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh檔案
cp ./spark-defaults.conf.template ./spark-defaults.conf
  1. 編輯檔案spark-defaults.conf,具體內容如下,node1是我一個節點的hostname,hdfs://dw-cluster:8020/user/hive/warehouse是我HA的HDFS上的Hive資料倉儲的根目錄;
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

spark.master                            spark://node1:7077
spark.eventLog.enabled                  true
spark.eventLog.dir                      hdfs://dw-cluster:8020/opt/spark/applicationHistory
spark.serializer                        org.apache.spark.serializer.KryoSerializer
spark.eventLog.compress                 true
spark.yarn.historyServer.address        http://node1:18018
spark.sql.warehouse.dir                 hdfs://dw-cluster:8020/user/hive/warehouse
spark.sql.parquet.enableVectorizedReader        false
spark.sql.parquet.writeLegacyFormat             true
spark.debug.maxToStringFields           100
spark.network.timeout                   300000
spark.yarn.jars               hdfs://dw-cluster/tmp/spark_jars/*.jar
spark.port.maxRetries   100

修改spark的配置檔案slaves

  1. 一樣的,先cp出這個檔案,shell指令如下;
# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf

# cp spark-env.sh檔案
cp ./slaves.template ./slaves
  1. 編輯檔案slaves,具體如下,就是設定spark的work節點;
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
node2
node3
node4

新增hive-site.xml檔案

  將你安裝好的hive的配置檔案複製一份到spark的conf下,實現指令如下,具體路徑根據自己的hive和spark路徑做調整;

cp /data/tools/apache-hive-2.3.5-bin/conf/hive-site.xml /data/tools/spark-2.4.7-bin-hadoop2.7/conf/

  目的是為了實現spark on hive,即採用spark-sql讀取操作hive內的資料;

新增/mysql-connector-java-8.0.13.jar

  因為Hive的後設資料在mysql上,所以需要將hive內連線mysql的驅動包mysql-connector-java-8.0.13.jar也一起拿過來,具體操作 如下;

cp /data/tools/apache-hive-2.3.5-bin/lib/mysql-connector-java-8.0.13.jar /data/tools/spark-2.4.7-bin-hadoop2.7/jars/

  到此就能真正實現spark on hive;

   所有的配置檔案到此就配置完成了!

將spark複製到其他節點

  先在node1上將spark-2.4.7-bin-hadoop2.7.tgz scp到其他的節點;

scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node2:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node3:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node4:/data/tools/

  使用MobaXtermWrite commands on all terminals解壓spark-2.4.7-bin-hadoop2.7.tgz,然後將node1上的配置檔案全部copy到其他節點,操作如下;

scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node2:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node3:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node4:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/

Spark2.4.7配置環境變數

  使用MobaXtermWrite commands on all terminals在node1,node2和node3,node4上配置好spark的環境變數,即sudo vim /etc/profile,然後最後面追加以下內容;

export SPARK_HOME=/data/tools/spark-2.4.7-bin-hadoop2.7
export PATH==$PATH:$SPARK_HOME/bin

Spark2.4.7啟動

  1. 隨便找一個節點,這裡就選擇node1為實驗,先啟動spark服務,操作如下;
[hadoop@node1 conf]$ cd /data/tools/spark-2.4.7-bin-hadoop2.7/sbin/
[hadoop@node1 sbin]$ ll
總用量 92
-rwxr-xr-x. 1 hadoop hadoop 2803 9月   8 13:48 slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1429 9月   8 13:48 spark-config.sh
-rwxr-xr-x. 1 hadoop hadoop 5689 9月   8 13:48 spark-daemon.sh
-rwxr-xr-x. 1 hadoop hadoop 1262 9月   8 13:48 spark-daemons.sh
-rwxr-xr-x. 1 hadoop hadoop 1190 9月   8 13:48 start-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1274 9月   8 13:48 start-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 2050 9月   8 13:48 start-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1877 9月   8 13:48 start-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1423 9月   8 13:48 start-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1279 9月   8 13:48 start-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 3151 9月   8 13:48 start-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1527 9月   8 13:48 start-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1857 9月   8 13:48 start-thriftserver.sh
-rwxr-xr-x. 1 hadoop hadoop 1478 9月   8 13:48 stop-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1056 9月   8 13:48 stop-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 1080 9月   8 13:48 stop-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1227 9月   8 13:48 stop-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1084 9月   8 13:48 stop-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1067 9月   8 13:48 stop-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1557 9月   8 13:48 stop-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1064 9月   8 13:48 stop-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1066 9月   8 13:48 stop-thriftserver.sh
[hadoop@node1 sbin]$ pwd
/data/tools/spark-2.4.7-bin-hadoop2.7/sbin
[hadoop@node1 sbin]$ ./start-all.sh
  1. 開啟spark-shellspark-sql做互動式操作;
  • 先來測試下spark-sql,在node1上輸入spark-sql,如下;
[hadoop@node1 ~]$ spark-sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
……

  然後一頓載入,不報錯的話,基本就成功了,報錯把相應的bug解決,然後測試下hive內的資料請求,我在hive先 建好了一個student表,只有兩個欄位id和sname,表內有一行記錄;

表1 hive表student內的記錄
idsname
1rowyet

  然後我們的spark-sql測試語句效果如下;

spark-sql> select * from student;
20/12/14 02:27:33 INFO metastore.HiveMetaStore: 0: get_table : db=dw tbl=student
20/12/14 02:27:33 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr      cmd=get_table : db=dw tbl=student
20/12/14 02:27:33 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 558.8 KB, free 412.7 MB)
20/12/14 02:27:34 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 51.4 KB, free 412.7 MB)
20/12/14 02:27:34 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on node1:37157 (size: 51.4 KB, free: 413.7 MB)
20/12/14 02:27:34 INFO spark.SparkContext: Created broadcast 1 from
20/12/14 02:27:35 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/14 02:27:35 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 3 output partitions
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376)
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Missing parents: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376), which has no missing parents
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 8.2 KB, free 412.7 MB)
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.4 KB, free 412.7 MB)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on node1:37157 (size: 4.4 KB, free: 413.7 MB)
20/12/14 02:27:35 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1184
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1, 2))
20/12/14 02:27:35 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
20/12/14 02:27:35 INFO spark.ContextCleaner: Cleaned accumulator 36
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.144.138, executor 0, partition 0, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 192.168.144.140, executor 2, partition 1, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, 192.168.144.139, executor 1, partition 2, ANY, 7963 bytes)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.138:46066 (size: 4.4 KB, free: 1007.7 MB)
20/12/14 02:27:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.138:46066 (size: 51.4 KB, free: 1007.6 MB)
20/12/14 02:27:45 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.140:37936 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:28:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.140:37936 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:28:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 46439 ms on 192.168.144.138 (executor 0) (1/3)
20/12/14 02:28:34 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 59298 ms on 192.168.144.140 (executor 2) (2/3)
20/12/14 02:28:47 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.139:43240 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:29:01 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.139:43240 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:29:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 94410 ms on 192.168.144.139 (executor 1) (3/3)
20/12/14 02:29:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/12/14 02:29:09 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 94.516 s
20/12/14 02:29:09 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 94.576264 s
1       rowyet
Time taken: 97.013 seconds, Fetched 1 row(s)

  • 再測試下spark-shell,在node1上輸入spark-shell,具體效果如下,會載入spark的版本,然後是scala語言的互動式介面;
[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:20:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/12/14 03:20:37 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://node1:4041
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214032040-0004).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

  再測試下spark-shell的功能是否有異常,具體操作如下,初始化一個只有一列的dataframe(概念後續會講到,這裡只做測試!),並顯示出來;

[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:33:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://node1:4040
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214033402-0005).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val myRange = spark.range(1000).toDF("number")
myRange: org.apache.spark.sql.DataFrame = [number: bigint]

scala> myRange.show
+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
+------+
only showing top 20 rows



Spark監控介面

  spark的預設監控介面是8080,即在瀏覽器上輸入http://node1:8080/即可檢視哪些Job正在running,哪些已經完成,叢集的資源狀況等;
在這裡插入圖片描述

圖3 spark的監控介面

Spark on Yarn的啟動

  Spark on Yarn是開源用的最廣的一種模式,後續會細講,這裡就先開個引子,先講講啟動,主要涉及的引數如下;

–master: 指定master在哪裡執行
–deploy-mode: 指定yanr的cluster 模式還是client模式
–executor-memory :每個executor的記憶體
–executor-cores:每個executor的cores
–num-executors: executor個數,說明這個啟動的環境是20核,32G記憶體
–queue:指定該Job在yarn的什麼佇列執行
–conf:配置引數

#啟動spark-sql 
spark-sql --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m

#啟動spark-shell
spark-shell --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m

#spark-submit
spark-submit --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m  --class cn.ruoying.esapp.App  hdfs:///app/hive_to_es/etl_jar/SparkOnHiveToEs_v1.jar

  到此,整個Spark2.4.7的安裝配置就算是完成了!

相關文章