Spark從入門到放棄——Spark2.4.7安裝和啟動(二)
Spark版本
Spark是Apache開源的頂級專案,官網地址,目前也一直不斷更新迭代,截至本部落格釋出時間(20201202)最新版本是Spark 3.0.1 released (Sep 08, 2020)
,因為公司目前生產用的是Spark2.4.7
,所以後續的教程都是以Spark2.4.7
為主;
講解版本:Spark2.4.7
工欲善其事,必先利其器!這裡就講解下Spark的安裝與啟動;
安裝準備
Spark從入門到放棄——初始Spark(一)說到,Spark只是計算框架,取代的是MapReduce,因此Spark必須依賴檔案系統,常用開原始檔系統我們通常選用Hadoop的HDFS或者Hive等,因此關於HDFS的安裝,Hive的安裝,以及需要叢集的免密登入,依賴的Java版本等請參考以下部落格,這裡不再反覆累贅敘說;
本著資料本地性原則,基本上有HDFS上有NameNode,DataNode最好都安裝上Spark,那我們在測試叢集的node1,node2,node3,node4上都裝上Spark;
Scala安裝
Spark框架本身由Scala開發,而且天然支援Scala API介面,不論是臨時利用Spark拉取資料,還是編寫Spark應用程式,Scala都是非常簡潔方便的,就連spark-shell
預設啟動的互動式介面,也是scala語言,以上這些,就註定了Scala和Spark天生一對,密不可分,就連Java之父詹姆斯·高斯林 (James Gosling)也曾說過,拋開Java而言,再讓他選一門開發語言,他會選擇Scala!因此,安裝Spark之前必須先安裝好Scala;
Scala官網連結:傳送門
- 將
scala-2.11.12.tgz
上傳到node1的/data/tools
目錄下,採用以下命令解壓scala-2.11.12.tgz
;
tar -zxvf scala-2.11.12.tgz
- 設定Scala的環境變數;
# 編輯環境變數檔案/etc/profile
sudo vim /etc/profile
# 在檔案/etc/profile內最下面新增scala的路徑,然後wq儲存
export SCALA_HOME=/data/tools/scala-2.11.12
export PATH==$PATH:$SCALA_HOME/bin
# 重新整理環境變數檔案
source /etc/profile
- 測試scala是否安裝成功,因為設定了環境變數,隨便一個folder下,輸入
scala
啟動scala的互動式介面,裡面輸入1+1
,返回一個Int
變數res0
等於2
,具體如下,恭喜你,scala安裝成功了,按ctrl+c
結束scala
互動式介面;
[hadoop@node1 tools]$ scala
Welcome to Scala 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211).
Type in expressions for evaluation. Or try :help.
scala> 1+1
res0: Int = 2
scala>
- 在node2,node3,node4上都安裝好scala,如下
scp
到不同節點;
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node2:/data/tools/
scala-2.11.12.tgz 100% 28MB 40.0MB/s 00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node3:/data/tools/
scala-2.11.12.tgz 100% 28MB 42.8MB/s 00:00
[hadoop@node1 tools]$ scp ./scala-2.11.12.tgz node4:/data/tools/
scala-2.11.12.tgz 100% 28MB 47.8MB/s 00:00
- 這裡我就不寫迴圈shell操作了,畢竟就幾臺機器,偷個懶,使用
MobaXterm
的Write commands on all terminals
偷個懶,即n屏聯動功能,可以讓開啟的遠端視窗同時編輯相同的內容,是不是很拽,很好用的一個功能,具體如下圖;
到此就算scala安裝完成;
Spark2.4.7安裝
Spark2.4.7下載
官網下載:傳送門
下載完後將spark-2.4.7-bin-hadoop2.7.tgz
上傳到node1的/data/tools
folder下,利用以下指令解壓;
tar -zxvf spark-2.4.7-bin-hadoop2.7.tgz
修改spark的配置檔案
spark的配置檔案主要設定在/data/tools/spark-2.4.7-bin-hadoop2.7/conf
目錄下,具體如下;
-rwxr-xr-x. 1 hadoop hadoop 996 9月 8 13:48 docker.properties.template
-rwxr-xr-x. 1 hadoop hadoop 1105 9月 8 13:48 fairscheduler.xml.template
-rwxr-xr-x. 1 hadoop hadoop 2025 9月 8 13:48 log4j.properties.template
-rwxr-xr-x. 1 hadoop hadoop 7801 9月 8 13:48 metrics.properties.template
-rwxr-xr-x. 1 hadoop hadoop 865 9月 8 13:48 slaves.template
-rwxr-xr-x. 1 hadoop hadoop 1292 9月 8 13:48 spark-defaults.conf.template
-rwxr-xr-x. 1 hadoop hadoop 4221 9月 8 13:48 spark-env.sh.template
需要修改的配置檔案有spark-env.sh
,spark-defaults.conf
和slaves
,這些檔案上面也沒有呀,這些檔案需要自己從他們的template檔案總cp出來;
修改spark的配置檔案spark-env.sh
- 先cp出這個檔案,shell指令如下;
# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf
# cp spark-env.sh檔案
cp ./spark-env.sh.template ./spark-env.sh
- 編輯檔案spark-env.sh,具體如下,只需要關注最後未被註釋的內容即可,設定的值根據自己的機器來配置,
export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)
一定要寫,不然容易出異常,/data/tools/hadoop-2.8.5/bin/hadoop
根據自己的hadoop路徑來配置,不要照抄!;
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1 Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1 Disable multi-threading of OpenBLAS
export JAVA_HOME=/data/tools/jdk1.8.0_211
export SCALA_HOME=/data/tools/scala-2.11.12
export HADOOP_HOME=/data/tools/hadoop-2.8.5
export HADOOP_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop
export YARN_CONF_DIR=/data/tools/hadoop-2.8.5/etc/hadoop
export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077
export SPARK_DRIVER_MEMORY=1G
export SPARK_EXECUTOR_CORES=4
export SPARK_EXECUTOR_MEMORY=2G
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=2G
export SPARK_WORKER_INSTANCES=1
export SPARK_LOG_DIR=/data/logs/spark/
export SPARK_WORKER_DIR=/data/logs/spark/worker
export SPARK_DIST_CLASSPATH=$(/data/tools/hadoop-2.8.5/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=4000 -Dspark.history.retainedApplications=100 -Dspark.history.fs.logDirectory=hdfs://dw-cluster:8020/opt/spark/applicationHistory"
修改spark的配置檔案spark-defaults.conf
- 一樣的,先cp出這個檔案,shell指令如下;
# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf
# cp spark-env.sh檔案
cp ./spark-defaults.conf.template ./spark-defaults.conf
- 編輯檔案spark-defaults.conf,具體內容如下,
node1
是我一個節點的hostname,hdfs://dw-cluster:8020/user/hive/warehouse
是我HA的HDFS上的Hive資料倉儲的根目錄;
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.master spark://node1:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://dw-cluster:8020/opt/spark/applicationHistory
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.eventLog.compress true
spark.yarn.historyServer.address http://node1:18018
spark.sql.warehouse.dir hdfs://dw-cluster:8020/user/hive/warehouse
spark.sql.parquet.enableVectorizedReader false
spark.sql.parquet.writeLegacyFormat true
spark.debug.maxToStringFields 100
spark.network.timeout 300000
spark.yarn.jars hdfs://dw-cluster/tmp/spark_jars/*.jar
spark.port.maxRetries 100
修改spark的配置檔案slaves
- 一樣的,先cp出這個檔案,shell指令如下;
# 進入/data/tools/spark-2.4.7-bin-hadoop2.7/conf
cd /data/tools/spark-2.4.7-bin-hadoop2.7/conf
# cp spark-env.sh檔案
cp ./slaves.template ./slaves
- 編輯檔案slaves,具體如下,就是設定spark的work節點;
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# A Spark Worker will be started on each of the machines listed below.
node2
node3
node4
新增hive-site.xml檔案
將你安裝好的hive的配置檔案複製一份到spark的conf下,實現指令如下,具體路徑根據自己的hive和spark路徑做調整;
cp /data/tools/apache-hive-2.3.5-bin/conf/hive-site.xml /data/tools/spark-2.4.7-bin-hadoop2.7/conf/
目的是為了實現spark on hive,即採用spark-sql讀取操作hive內的資料;
新增/mysql-connector-java-8.0.13.jar
因為Hive的後設資料在mysql上,所以需要將hive內連線mysql的驅動包mysql-connector-java-8.0.13.jar
也一起拿過來,具體操作 如下;
cp /data/tools/apache-hive-2.3.5-bin/lib/mysql-connector-java-8.0.13.jar /data/tools/spark-2.4.7-bin-hadoop2.7/jars/
到此就能真正實現spark on hive;
所有的配置檔案到此就配置完成了!
將spark複製到其他節點
先在node1上將spark-2.4.7-bin-hadoop2.7.tgz
scp到其他的節點;
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node2:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node3:/data/tools/
scp /data/tools/spark-2.4.7-bin-hadoop2.7.tgz node4:/data/tools/
使用MobaXterm
的Write commands on all terminals
解壓spark-2.4.7-bin-hadoop2.7.tgz
,然後將node1上的配置檔案全部copy到其他節點,操作如下;
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node2:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node3:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
scp /data/tools/spark-2.4.7-bin-hadoop2.7/conf/* node4:/data/tools/spark-2.4.7-bin-hadoop2.7/conf/
Spark2.4.7配置環境變數
使用MobaXterm
的Write commands on all terminals
在node1,node2和node3,node4上配置好spark的環境變數,即sudo vim /etc/profile
,然後最後面追加以下內容;
export SPARK_HOME=/data/tools/spark-2.4.7-bin-hadoop2.7
export PATH==$PATH:$SPARK_HOME/bin
Spark2.4.7啟動
- 隨便找一個節點,這裡就選擇node1為實驗,先啟動spark服務,操作如下;
[hadoop@node1 conf]$ cd /data/tools/spark-2.4.7-bin-hadoop2.7/sbin/
[hadoop@node1 sbin]$ ll
總用量 92
-rwxr-xr-x. 1 hadoop hadoop 2803 9月 8 13:48 slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1429 9月 8 13:48 spark-config.sh
-rwxr-xr-x. 1 hadoop hadoop 5689 9月 8 13:48 spark-daemon.sh
-rwxr-xr-x. 1 hadoop hadoop 1262 9月 8 13:48 spark-daemons.sh
-rwxr-xr-x. 1 hadoop hadoop 1190 9月 8 13:48 start-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1274 9月 8 13:48 start-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 2050 9月 8 13:48 start-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1877 9月 8 13:48 start-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1423 9月 8 13:48 start-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1279 9月 8 13:48 start-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 3151 9月 8 13:48 start-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1527 9月 8 13:48 start-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1857 9月 8 13:48 start-thriftserver.sh
-rwxr-xr-x. 1 hadoop hadoop 1478 9月 8 13:48 stop-all.sh
-rwxr-xr-x. 1 hadoop hadoop 1056 9月 8 13:48 stop-history-server.sh
-rwxr-xr-x. 1 hadoop hadoop 1080 9月 8 13:48 stop-master.sh
-rwxr-xr-x. 1 hadoop hadoop 1227 9月 8 13:48 stop-mesos-dispatcher.sh
-rwxr-xr-x. 1 hadoop hadoop 1084 9月 8 13:48 stop-mesos-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1067 9月 8 13:48 stop-shuffle-service.sh
-rwxr-xr-x. 1 hadoop hadoop 1557 9月 8 13:48 stop-slave.sh
-rwxr-xr-x. 1 hadoop hadoop 1064 9月 8 13:48 stop-slaves.sh
-rwxr-xr-x. 1 hadoop hadoop 1066 9月 8 13:48 stop-thriftserver.sh
[hadoop@node1 sbin]$ pwd
/data/tools/spark-2.4.7-bin-hadoop2.7/sbin
[hadoop@node1 sbin]$ ./start-all.sh
- 開啟
spark-shell
和spark-sql
做互動式操作;
- 先來測試下
spark-sql
,在node1上輸入spark-sql,如下;
[hadoop@node1 ~]$ spark-sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
……
然後一頓載入,不報錯的話,基本就成功了,報錯把相應的bug解決,然後測試下hive內的資料請求,我在hive先 建好了一個student表,只有兩個欄位id和sname,表內有一行記錄;
id | sname |
---|---|
1 | rowyet |
然後我們的spark-sql測試語句效果如下;
spark-sql> select * from student;
20/12/14 02:27:33 INFO metastore.HiveMetaStore: 0: get_table : db=dw tbl=student
20/12/14 02:27:33 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=dw tbl=student
20/12/14 02:27:33 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 558.8 KB, free 412.7 MB)
20/12/14 02:27:34 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 51.4 KB, free 412.7 MB)
20/12/14 02:27:34 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on node1:37157 (size: 51.4 KB, free: 413.7 MB)
20/12/14 02:27:34 INFO spark.SparkContext: Created broadcast 1 from
20/12/14 02:27:35 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/14 02:27:35 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 3 output partitions
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376)
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Missing parents: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376), which has no missing parents
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 8.2 KB, free 412.7 MB)
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.4 KB, free 412.7 MB)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on node1:37157 (size: 4.4 KB, free: 413.7 MB)
20/12/14 02:27:35 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1184
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1, 2))
20/12/14 02:27:35 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
20/12/14 02:27:35 INFO spark.ContextCleaner: Cleaned accumulator 36
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.144.138, executor 0, partition 0, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 192.168.144.140, executor 2, partition 1, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, 192.168.144.139, executor 1, partition 2, ANY, 7963 bytes)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.138:46066 (size: 4.4 KB, free: 1007.7 MB)
20/12/14 02:27:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.138:46066 (size: 51.4 KB, free: 1007.6 MB)
20/12/14 02:27:45 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.140:37936 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:28:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.140:37936 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:28:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 46439 ms on 192.168.144.138 (executor 0) (1/3)
20/12/14 02:28:34 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 59298 ms on 192.168.144.140 (executor 2) (2/3)
20/12/14 02:28:47 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.139:43240 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:29:01 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.139:43240 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:29:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 94410 ms on 192.168.144.139 (executor 1) (3/3)
20/12/14 02:29:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/12/14 02:29:09 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 94.516 s
20/12/14 02:29:09 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 94.576264 s
1 rowyet
Time taken: 97.013 seconds, Fetched 1 row(s)
- 再測試下
spark-shell
,在node1上輸入spark-shell,具體效果如下,會載入spark的版本,然後是scala語言的互動式介面;
[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:20:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/12/14 03:20:37 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://node1:4041
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214032040-0004).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.7
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
再測試下spark-shell的功能是否有異常,具體操作如下,初始化一個只有一列的dataframe(概念後續會講到,這裡只做測試!),並顯示出來;
[hadoop@node1 ~]$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/14 03:33:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://node1:4040
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20201214033402-0005).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.7
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val myRange = spark.range(1000).toDF("number")
myRange: org.apache.spark.sql.DataFrame = [number: bigint]
scala> myRange.show
+------+
|number|
+------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+------+
only showing top 20 rows
Spark監控介面
spark的預設監控介面是8080
,即在瀏覽器上輸入http://node1:8080/
即可檢視哪些Job正在running,哪些已經完成,叢集的資源狀況等;
Spark on Yarn的啟動
Spark on Yarn
是開源用的最廣的一種模式,後續會細講,這裡就先開個引子,先講講啟動,主要涉及的引數如下;
–master: 指定master在哪裡執行
–deploy-mode: 指定yanr的cluster 模式還是client模式
–executor-memory :每個executor的記憶體
–executor-cores:每個executor的cores
–num-executors: executor個數,說明這個啟動的環境是20核,32G記憶體
–queue:指定該Job在yarn的什麼佇列執行
–conf:配置引數
#啟動spark-sql
spark-sql --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m
#啟動spark-shell
spark-shell --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m
#spark-submit
spark-submit --master yarn --deploy-mode cluster --executor-memory 8G --executor-cores 5 --num-executors 4 --queue etl --conf spark.kryoserializer.buffer.max=256m --conf spark.kryoserializer.buffer=64m --class cn.ruoying.esapp.App hdfs:///app/hive_to_es/etl_jar/SparkOnHiveToEs_v1.jar
到此,整個Spark2.4.7的安裝配置就算是完成了!
相關文章
- Spark從入門到放棄---RDDSpark
- Spark從入門到放棄——初始Spark(一)Spark
- AOP埋點從入門到放棄(二)
- Git 從入門到放棄Git
- XXE從入門到放棄
- Vue 從入門到放棄Vue
- Nginx從入門到放棄Nginx
- GraphQL從入門到放棄
- deepspeed從入門到放棄
- NumPy從入門到放棄
- webpack從入門到放棄Web
- openstack從入門到放棄
- HTTP從入門到放棄HTTP
- Docker從入門到放棄(1) Docker簡介與安裝Docker
- swoole——從入門到放棄(一)
- swoole——從入門到放棄(三)
- 快取從入門到放棄快取
- webpack 從入門到放棄(一)Web
- 從入門到放棄 - 事件溯源事件
- hive從入門到放棄(二)——DDL資料定義Hive
- JavaScript30秒, 從入門到放棄之Array(二)JavaScript
- HTTP快取從入門到放棄HTTP快取
- Flink從入門到放棄-大綱
- Taro 小程式 從入門到放棄!
- Python 從入門到放棄——Python科普!Python
- Scikit-learn從入門到放棄
- t-SNE 從入門到放棄
- webpack -> vue Component 從入門到放棄(四)WebVue
- Realm資料庫 從入門到“放棄”資料庫
- 分散式訓練從入門到放棄分散式
- AOP埋點從入門到放棄(三)
- 從入門到放棄之promise用法(上)Promise
- Elasticsearch從入門到放棄:瞎說MappingElasticsearchAPP
- Elasticsearch從入門到放棄:再聊搜尋Elasticsearch
- 從入門到放棄,我用了五年
- Redis從入門到放棄系列(十) ClusterRedis
- 從入門到放棄之大資料Hive大資料Hive
- 【爬蟲】python爬蟲從入門到放棄爬蟲Python