Spark Core基礎面試題總結(上)
Spark Core基礎面試題總結(上)
微信搜尋公眾號:知了小巷
公眾號後臺回覆“資料”兩個字,有大資料神祕學習大禮包!
1. Spark的幾種部署模式及其特點
SparkSubmit#prepareSubmitEnvironment
// Set the cluster manager
val clusterManager: Int = args.master match {
case "yarn" => YARN
case m if m.startsWith("spark") => STANDALONE
case m if m.startsWith("mesos") => MESOS
case m if m.startsWith("k8s") => KUBERNETES
case m if m.startsWith("local") => LOCAL
case _ =>
error("Master must either be yarn or start with spark, mesos, k8s, or local")
-1
}
-
本地模式
Spark不一定非要跑在Hadoop叢集(主要是YARN),可以在本地,起多個執行緒的方式來指定。將Spark應用以多執行緒的方式直接執行在本地,一般都是為了方便除錯,本地模式分為三類:
-
local 只啟動一個Executor
-
local[k] 啟動k個Executor
-
local[*] 啟動和CPU數目相同的Executor
-
Standalone模式
分散式部署叢集,自帶完整的服務,資源管理和任務監控是Spark自己監控,這個模式也是其他模式的基礎。
-
Spark on YARN
分散式部署叢集,資源和任務監控交給YARN管理,Spark客戶端直接連線YARN,不需要額外構建Spark叢集。有yarn-client和yarn-cluster兩種模式,主要區別在於:Driver程式的執行節點。
-
cluster適合生產,Driver執行在叢集子節點,具有容錯能力
-
client適合除錯,Driver執行在客戶端節點
-
Spark on K8s
最新文件:
http://spark.apache.org/docs/latest/running-on-kubernetes.html
-
Standalone 模式
Spark 執行在 Kubernetes 叢集上的第一種可行方式是將 Spark 以 Standalone 模式執行,但是很快社群就提出使用 Kubernetes 原生 Scheduler 的執行模式,也就是 Native 的模式。
-
Kubernetes Native 模式
Native 模式簡而言之就是將 Driver 和 Executor Pod 化,使用者將之前向 YARN 提交 Spark 作業的方式提交給 Kubernetes 的 apiserver,提交命令如下:
$ bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
其中 master 就是 kubernetes 的 apiserver 地址。提交之後整個作業的執行方式如下,先將 Driver 通過 Pod 啟動起來,然後 Driver 會啟動 Executor 的 Pod。
-
Spark Operator
Google 雲平臺,也就是 GCP 在 github 上面開源了 Spark 的 Operator,repo 地址:
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
這裡有詳細的使用文件:
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md
-
還有Mesos(略)
2. Driver端程式的功能是什麼?
一個Spark作業執行時包括一個Driver程式,也是作業的主程式,具有main函式,並且有SparkContext的例項,是程式的入口點。
功能:負責向叢集申請資源,向Master註冊資訊,負責作業的排程,負責作業的解析、生成Stage並排程Task到Executor上。包括DAGScheduler和TaskScheduler。
3. Hadoop MapReduce和Spark都是平行計算,那麼他們有什麼相同點和區別?
兩者都是用MR模型進行平行計算,Hadoop的一個作業稱為Job(在YARN上也是Application),Job裡面分為MapTask和ReduceTask,每個Task都是在自己的程式中進行的,當Task結束時,程式也會結束。
Spark使用者提交的任務被稱為Application,一個Application對應一個SparkContext,App中存在多個Job,沒觸發一次Action操作就會產生一個Job。這些Job可以並行或序列執行,每個Job中有多個Stage,Stage是Shuffle過程中DAGScheduler通過RDD之間的依賴關係劃分Job而來的,每個Stage裡面有多個Task,組成TaskSet由TaskScheduler分發到各個Executor中執行,Executor的生命週期是和App一樣的,即使沒有Job執行也是存在的,所以Task可以快速啟動讀取記憶體進行計算,Spark的迭代計算都是在記憶體中進行的,API中提供了大量的RDD操作如join,group by等,而且通過DAG圖可以實現良好的容錯。
Hadoop的job只有map和reduce操作,表達能力比較欠缺而且在MR過程中會重複的讀寫HDFS,造成大量IO操作,多個Job需要自己管理依賴關係。
4. Spark中的RDD
RDD:Resilient Distributed DataSet,彈性分散式資料集,是Spark中最基本的資料抽象,它代表一個不可變、可分割槽、裡面的元素可平行計算的集合。
/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the
* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
* [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
* Doubles; and
* [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
* can be saved as SequenceFiles.
* All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
* through implicit.
*
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
* for more details on RDD internals.
*/
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {...}
RDD五大特性:
-
A list of partitions 一個分割槽列表,RDD中的資料都存在一個分割槽列表裡面
-
A function for computing each split 作用在每一個分割槽中的函式
-
A list of dependencies on other RDDs 一個RDD依賴於其他多個RDD(RDD容錯機制)
-
Optionally, a Partitioner for key-value RDDs KV型別的RDD
-
Optionally, a list of preferred locations to compute each split on 資料本地性,最近的資料位置
5. 寬依賴和窄依賴,groupByKey、reduceByKey、map、filter、union五種運算元的寬窄依賴
-
窄依賴
父RDD的每一個分割槽最多被一個子RDD的分割槽所使用,表現為一個父RDD的分割槽對應於一個子RDD的分割槽,和兩個父RDD的分割槽對應於一個子RDD的分割槽。map和filter、union屬於第一類,對輸入進行協同劃分(co-partitioned)的join屬於第二類。 -
寬依賴
子RDD的分割槽依賴於父RDD的所有分割槽,Shuffle類操作的結果。
Shuffle的本質就是group by,把相同型別或相同規則的資料放在一起(磁碟或網路IO進行分類)。
-
運算元的寬窄依賴
對RDD進行map、filter、union等transformation一般是窄依賴。
寬依賴一般是對RDD進行groupByKey、reduceByKey等操作,就是對RDD中的partition中的資料進行重分割槽(Shuffle)。
join操作既可能是寬依賴,也可能是窄依賴,當要對RDD進行join操作時,如果RDD進行過重分割槽則為窄依賴,否則為寬依賴。
6. Spark如何防止記憶體溢位?
-
Driver端的記憶體溢位
可以增加Driver的記憶體引數:
# 預設1G
spark.driver.memory
這個引數用來設定Driver端的記憶體。在Spark程式中,SparkContext、DAGScheduler都是執行在Driver端的。對應RDD的Stage切分也是在Driver端執行,如果使用者自己寫的程式有過多的步驟,切分出過多的Stage,這部分資訊消耗的是Driver的記憶體,這個時候就需要調大Driver的記憶體。
-
map過程產生大量物件導致記憶體溢位
這種溢位的原因是在單個map中產生了大量的物件導致的,比如:rdd.map(x => for(i <-1 to 10000) yield i.toString),這個操作在RDD中,每個物件都產生了10000個物件,這肯定很容易產生記憶體溢位的問題。針對這種問題,在不增加記憶體的情況下,可以通過減少每個Task的大小,以便達到每個Task即使產生大量的物件Executor的記憶體也能夠裝得下。具體做法可以在會產生大量物件的map操作之前呼叫repartition方法,分割槽成更小的塊傳入map,例如:
rdd.repartition(10000).map(x => for(i <-1 to 10000) yield i.toString)
注意,不能使用rdd.coalesce,這個方法只能減少分割槽,不能增加分割槽,不會有Shuffle的過程。
-
資料不均衡導致記憶體溢位
資料不均衡,除了有可能導致記憶體溢位外,也有可能導致效能問題,解決方法和上面類似,就是呼叫repartition重新分割槽。 -
Shuffle後記憶體溢位
Shuffle記憶體溢位的情況可以說基本上都是Shuffle後,單個檔案過大導致的。在Spark中,join、reduceByKey這一類的操作,都會有Shuffle的過程,在Shuffle的時候,需要傳入一個分割槽器Partitioner,大部分Spark中的Shuffle操作,預設的Partitioner都是HashPartitioner,預設值是父RDD中最大的分割槽數,這個引數通過
# 只對HashPartitioner有效
spark.default.parallelism
# Spark SQL使用下面引數
spark.sql.shuffle.partitions
控制。
如果是別的Partitioner導致的Shuffle記憶體溢位,就需要從Partitioner的程式碼增加partitions數量。
-
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)代替rdd.cache()
記憶體不足時,rdd.cache會丟失資料,再次使用的時候會重算,StorageLevel.MEMORY_AND_DISK_SER記憶體不足時會儲存在磁碟,避免重新計算,只是會消耗一點IO時間。
7. Stage、Task和Job的區別與劃分方式
-
Job:一個由多個任務組成的平行計算,當需要執行一個RDD的Action操作的時候,會生成一個Job
-
Stage:每個Job被拆分成更小的被稱作Stage(階段)的Task(任務)組,Stage彼此之間是相互依賴的,各個Stage會按照執行順序依次執行(Pipeline)
-
Task:一個將要傳送到Executor中的工作單元。是Stage的一個任務執行單元,一般來說,一個RDD有多少個Partition,就會有多少個Task,因為每一個Task只是處理一個Partition上的資料。
Stage
private[scheduler] abstract class Stage(
val id: Int,
val rdd: RDD[_],
val numTasks: Int,
val parents: List[Stage],
val firstJobId: Int,
val callSite: CallSite,
val resourceProfileId: Int)
extends Logging {
val numPartitions = rdd.partitions.length
/** Set of jobs that this stage belongs to. */
val jobIds = new HashSet[Int]
/** The ID to use for the next new attempt for this stage. */
private var nextAttemptId: Int = 0
val name: String = callSite.shortForm
val details: String = callSite.longForm
/**
* Pointer to the [[StageInfo]] object for the most recent attempt. This needs to be initialized
* here, before any attempts have actually been created, because the DAGScheduler uses this
* StageInfo to tell SparkListeners when a job starts (which happens before any stage attempts
* have been created).
*/
private var _latestInfo: StageInfo =
StageInfo.fromStage(this, nextAttemptId, resourceProfileId = resourceProfileId)
/**
* Set of stage attempt IDs that have failed. We keep track of these failures in order to avoid
* endless retries if a stage keeps failing.
* We keep track of each attempt ID that has failed to avoid recording duplicate failures if
* multiple tasks from the same stage attempt fail (SPARK-5945).
*/
val failedAttemptIds = new HashSet[Int]
private[scheduler] def clearFailures() : Unit = {
failedAttemptIds.clear()
}
/** Creates a new attempt for this stage by creating a new StageInfo with a new attempt ID. */
def makeNewStageAttempt(
numPartitionsToCompute: Int,
taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit = {
val metrics = new TaskMetrics
metrics.register(rdd.sparkContext)
_latestInfo = StageInfo.fromStage(
this, nextAttemptId, Some(numPartitionsToCompute), metrics, taskLocalityPreferences,
resourceProfileId = resourceProfileId)
nextAttemptId += 1
}
/** Returns the StageInfo for the most recent attempt for this stage. */
def latestInfo: StageInfo = _latestInfo
override final def hashCode(): Int = id
override final def equals(other: Any): Boolean = other match {
case stage: Stage => stage != null && stage.id == id
case _ => false
}
/** Returns the sequence of partition ids that are missing (i.e. needs to be computed). */
def findMissingPartitions(): Seq[Int]
def isIndeterminate: Boolean = {
rdd.outputDeterministicLevel == DeterministicLevel.INDETERMINATE
}
}
Task
private[spark] abstract class Task[T](
val stageId: Int,
val stageAttemptId: Int,
val partitionId: Int,
@transient var localProperties: Properties = new Properties,
// The default value is only used in tests.
serializedTaskMetrics: Array[Byte] =
SparkEnv.get.closureSerializer.newInstance().serialize(TaskMetrics.registered).array(),
val jobId: Option[Int] = None,
val appId: Option[String] = None,
val appAttemptId: Option[String] = None,
val isBarrier: Boolean = false) extends Serializable {
@transient lazy val metrics: TaskMetrics =
SparkEnv.get.closureSerializer.newInstance().deserialize(ByteBuffer.wrap(serializedTaskMetrics))
/**
* Called by [[org.apache.spark.executor.Executor]] to run this task.
*
* @param taskAttemptId an identifier for this task attempt that is unique within a SparkContext.
* @param attemptNumber how many times this task has been attempted (0 for the first attempt)
* @param resources other host resources (like gpus) that this task attempt can access
* @return the result of the task along with updates of Accumulators.
*/
final def run(
taskAttemptId: Long,
attemptNumber: Int,
metricsSystem: MetricsSystem,
resources: Map[String, ResourceInformation]): T = {
SparkEnv.get.blockManager.registerTask(taskAttemptId)
...
}
...
}
TaskSet
/**
* A set of tasks submitted together to the low-level TaskScheduler, usually representing
* missing partitions of a particular stage.
*/
private[spark] class TaskSet(
val tasks: Array[Task[_]],
val stageId: Int,
val stageAttemptId: Int,
val priority: Int,
val properties: Properties,
val resourceProfileId: Int) {
val id: String = stageId + "." + stageAttemptId
override def toString: String = "TaskSet " + id
}
8. Spark提交作業引數
原始碼在spark/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
在提交任務時的幾個重要引數:
-
executor-cores 每個executor使用的核心數,預設為1 建議2-5個;一般可以設定4個
Spark standalone, YARN and Kubernetes only:
--executor-cores NUM Number of cores used by each executor. (Default: 1 in YARN and K8S modes, or all available cores on the worker in standalone mode).
原始碼裡有個測試類:
spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
"--executor-cores", "5",
-
num-executors 啟動executors的數量,預設為2
Spark on YARN and Kubernetes only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of executors will be at least NUM.
-
executor-memory executor的記憶體大小,預設1GB
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
-
driver-cores Driver端使用的核心數,預設為1
Cluster deploy mode only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
-
driver-memory Driver端使用的記憶體大小,預設512MB
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: ${mem_mb}M).
例如提交任務On YARN:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-cores 2 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 2 \
--num-executors 2 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
9. Spark中reduceByKey VS groupByKey區別與用法
reduceByKey用於對每個Key對應的多個Value進行merge操作,最重要的是它能夠在本地先進行merge操作,並且merge操作可以通過函式自定義。
JavaPairRDD#reduceByKey#similarly to a "combiner" in MapReduce.
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V] =
fromRDD(rdd.reduceByKey(partitioner, func))
groupByKey也是對每個Key對應的多個Value進行操作,但是隻是彙總生成一個Sequence,本身不能自定義函式,只能通過額外的map(func)來實現。
/**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
*
* @note If you are grouping in order to perform an aggregation (such as a sum or average) over
* each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
* will provide much better performance.
*/
def groupByKey(partitioner: Partitioner): JavaPairRDD[K, JIterable[V]] =
fromRDD(groupByResultToJava(rdd.groupByKey(partitioner)))
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with into `numPartitions` partitions.
*
* @note If you are grouping in order to perform an aggregation (such as a sum or average) over
* each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
* will provide much better performance.
*/
def groupByKey(numPartitions: Int): JavaPairRDD[K, JIterable[V]] =
fromRDD(groupByResultToJava(rdd.groupByKey(numPartitions)))
在大的資料集上,reduceByKey(func)的效果比groupByKey()的效果更好一些。因為reduceByKey()會在Shuffle之前對資料進行合併,傳輸速度優於groupByKey(網路IO)。
10. foreach和map的區別
先看原始碼,foreach是RDD中Actions裡的第一個方法:
Actions (launch a job to return a value to the user program)
// Actions (launch a job to return a value to the user program)
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
map則是RDD中Transformations裡的第一個方法:
Transformations (return a new RDD)
// Transformations (return a new RDD)
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))
}
兩個方法的共同點:都是用於遍歷集合物件,並對每一項執行指定的方法。
兩者的差異:
-
foreach沒有返回值(準確的說是返回void),map返回集合物件。foreach用於遍歷集合,而map用於對映(轉換)集合到另一個集合。
-
foreach中的處理邏輯是序列的,map中的處理邏輯是並行的。
-
map是Transformation運算元,foreach是Action運算元。
11. map與mapPartitions的區別
mapPartitions:
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(_: TaskContext, _: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
相同點:map與mapPartitions都屬於Transformation運算元
區別:
-
本質
-
map是對RDD中的每一個元素進行操作
-
mapPartitions則是對RDD中的每個分割槽的迭代器進行操作
-
RDD中的每個分割槽資料量不大的情形
-
map操作效能低下。比如一個partition中有一萬條資料,那麼在分析每個分割槽時,function要執行和計算1萬次。
-
mapPartitions效能高。使用mapPartitions操作之後,一個Task僅僅會執行一次function,function一次接收所有的partition資料。只要執行一次就可以了,效能比較高。
-
RDD中的每個分割槽資料量超大的情形,比如一個Partition有100萬條資料。
-
map能正常執行完。
-
mapPartitions一次傳入一個function後,可能一下子記憶體不夠用,造成OOM(記憶體溢位)。
12. foreach和foreachPartition的區別
相同:foreach和foreachPartition都屬於Action運算元
區別:
-
foreach每次處理RDD中的一條資料
-
foreachPartition每次處理RDD中每個分割槽的迭代器中的資料
/**
* Applies a function f to each partition of this RDD.
*/
def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}
13. groupByKey、reduceByKey、combineByKey的區別
上面第9已經對Spark中reduceByKey VS groupByKey區別與用法做了說明。
-
groupByKey
-
用於對每個Key進行操作,將結果生成一個Sequence
-
groupByKey本身不能自定義函式
-
會將所有鍵值對進行移動,不會進行區域性merge
-
會導致叢集節點之間的開銷很大,導致傳輸延時
-
reduceByKey
-
用於對每個Key對應的多個Value進行merge操作
-
該運算元能在本地先進性merge操作
-
merge操作可以通過函式進行自定義
-
combineByKey
-
combineByKey是一個比較底層的運算元
-
reduceByKey底層就是使用了combineByKey,準確一點是combineByKeyWithClassTag
14. sortByKey這個運算元是全域性排序嗎?
sortByKey是全域性排序。RDD#sortBy
/**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
}
OrderedRDDFunctions#sortByKey
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
-
在sortByKey之前將資料使用Partitioner根據資料範圍來分(keyBy)。
-
使得p1分割槽所有的資料小於p2,p2分割槽所有的資料小於p3,依次類推。(p1~pn是分割槽標識)。
-
然後,使用sortByKey運算元針對每一個Partition進行排序,這樣全域性的資料就被排序了。
15. Spark中coalesce VS repartition
先看原始碼:
coalesce shuffle: Boolean = false
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* @note With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner. The optional partition coalescer
* passed in must be serializable.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](
mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}
repartition shuffle = true
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
通常認為coalesce不產生Shuffle會比repartition產生Shuffle效率高,而實際情況往往要根據具體問題具體分析,coalesce效率不一定高,有時還可能有大坑,所以還是要慎用。
兩個運算元都是對RDD的分割槽進行重新劃分,repartition呼叫了coalesce,把預設為false的shuffle引數置為了true。
假設有一個RDD有N個分割槽,需要重新劃分為M個分割槽:
-
如果N<M。一般情況下N個分割槽有資料分佈不均勻的狀況,利用HashPartitioner函式將資料重新分割槽為M個,這時需要將shuffle設定為true。
-
如果N>M,並且N和M相差不多,(假如N是1000,M是100),那麼就可以將N個分割槽中的若干個分割槽合併成一個新的分割槽,最終合併為M個分割槽,這時可以將shuffle設定為false,如果M>N時,coalesce是無效的,不進行Shuffle過程,父RDD和子RDD之間是窄依賴關係,無法使檔案數partitions變多。總之如果shuffle為false時,傳入的引數大於現有的分割槽數目,RDD的分割槽數將保持不變。也就是說不經過Shuffle,是無法將RDD的分割槽數變多的。
-
如果N>M,並且N和M相差很大很大,這是要看executor數量與要生成的partition的關係。如果executor數 <= 要生成的partition數,coalesce效率高,反之如果用coalesce可能會導致(executor數 - 要生成的partition數)個executor空跑從而降低效率。如果在M為1的時候,為了使得coalesce之前的操作有更好的並行度,可以將shuffle設定為true。
猜你喜歡:
相關文章
- JVM 基礎面試題總結JVM面試題
- Java面試題總結(基礎面試題完結版,2020-10-13)Java面試題
- Java面試題總結之Java基礎(三)Java面試題
- Hadoop/Spark相關面試問題總結HadoopSpark面試
- java面試題--基礎上Java面試題
- hive基礎總結(面試常用)Hive面試
- iOS 基礎面試題彙總iOS面試題
- spark 基礎開發 Tips總結Spark
- Python面試基礎題小彙總Python面試
- Java 基礎常見知識點&面試題總結(上),2022 最新版!| JavaGuideJava面試題GUIIDE
- Java基礎面試知識點總結Java面試
- 【Java面試】Java面試題基礎系列212道(上)Java面試題
- 超詳細的Java面試題總結(四 )之JavaWeb基礎知識總結Java面試題Web
- Spark 經典面試題彙總《一》Spark面試題
- 面試題總結面試題
- 【搞定 Java 併發面試】面試最常問的 Java 併發基礎常見面試題總結!Java面試題
- Java面試系列:Java面試題基礎系列228道(上)Java面試題
- Spark面試題Spark面試題
- 測試面試題總結面試題
- Hadoop面試題總結Hadoop面試題
- Ajax面試題總結面試題
- Kafka面試題總結Kafka面試題
- 面試問題總結面試
- PHP面試題總結PHP面試題
- iOS 面試題總結iOS面試題
- css面試題總結CSS面試題
- 面試刷題總結面試
- java基礎面試題Java面試題
- Spark面試題(四)Spark面試題
- 記錄近期面試題,面試總結面試題
- PHP 詳細面試總結 (三 Redis 基礎詳解)PHP面試Redis
- Nginx 面試題總結大全Nginx面試題
- PHP面試問題總結PHP面試
- 前端React面試題總結前端React面試題
- Swift面試題總結(一)Swift面試題
- spring cloud 面試題總結SpringCloud面試題
- 【Java問題面試總結】Java面試
- REDIS面試問題總結Redis面試