CS190 Scalable Machine Learning Spark - Introduction Spark

weixin_34279579發表於2015-08-03

標籤（空格分隔）： Spark ML

RDDs

• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk

Working with RDDs

• Create an RDD from a data source:
• Apply transformations to an RDD: map filter
• Apply actions to an RDD: collect count

Some Transformations

Transformation	Description
map(func)	return a new distributed dataset formed by passing each element of the source through a function func
filter(func)	return a new dataset formed by selecting those elements of the source on which func returns true
distinct([numTasks]))	return a new dataset that contains the distinct elements of the source dataset
flatMap(func)	similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)

例子:

>>> rdd =   sc.parallelize([1,  2,  3,  4]) 
>>> rdd.map(lambda  x:  x   *   2)  
RDD:    [1, 2,  3,  4]  →   [2, 4,  6,  8]  
>>> rdd.filter(lambda   x:  x   %   2   ==  0)  
RDD:    [1, 2,  3,  4]  →   [2, 4]  
>>> rdd2    =   sc.parallelize([1,  4,  2,  2,  3]) 
>>> rdd2.distinct() 
RDD:    [1, 4,  2,  2,  3]  →   [1, 4,  2,  3]  

>>> rdd =   sc.parallelize([1,  2,  3]) 
>>> rdd.Map(lambda  x:  [x, x+5])   
RDD:    [1, 2,  3]  →   [[1,    6], [2, 7], [3, 8]] 
>>> rdd.flatMap(lambda  x:  [x, x+5])   
RDD:    [1, 2,  3]  →   [1, 6,  2,  7,  3,  8]

Spark Actions

• Cause Spark to execute recipe to transform source
• Mechanism for getting results out of Spark

Action	Description
reduce(func)	aggregate dataset’s elements using function func.func takes two arguments and returns one, and is commutative and associative so that it can be computed correctly in parallel
take(n)	return an array with the first n elements collect() return all the elements as an array WARNING: make sure will fit in driver program
takeOrdered(n, key=func)	return n elements ordered in ascending order or as specified by the optional key function

例子

>>> rdd =   sc.parallelize([1,  2,  3]) 
>>> rdd.reduce(lambda   a,  b:  a   *   b)  
Value:  6 
#(1 * 2 * 3)
>>> rdd.take(2) 
Value:  [1,2]   #   as  list
>>> rdd.collect()   
Value:  [1,2,3] #   as  list    
>>> rdd =   sc.parallelize([5,3,1,2])
>>> rdd.takeOrdered(3,  lambda  s:  -1  *   s)  
Value:  [5,3,2] #   as  list

.count()
.cache()

lines   =   sc.textFile("...",  4)  
lines.cache()   #   save,   don't   recompute!  
comments    =   lines.filter(isComment) 
print   lines.count(),comments.count()

Spark Program Lifecycle

Create RDDs from external data or parallelize a collection in your driver program
Lazily transform them into new RDDs
cache() some RDDs for reuse
Perform actions to execute parallel computation and produce results

Key-Value RDDs

Key-Value Transformation	Description
reduceByKey(func)	return a new distributed dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V
sortByKey()	return a new dataset (K, V) pairs sorted by keys in ascending order
groupByKey()	return a new dataset of (K, Iterable<V>) pairs

! 使用groupByKey()是要注意，可能需要大量資料在網路中移動，同時生成的list可能非常大，導致worker記憶體耗盡

>>> rdd =   sc.parallelize([(1,2),  (3,4),  (3,6)]) 
>>> rdd.reduceByKey(lambda  a,  b:  a   +   b)      
RDD:    [(1,2), (3,4),  (3,6)]  →   [(1,2), (3,10)] 
>>> rdd2    =   sc.parallelize([(1,'a'),    (2,'c'),    (1,'b')])   
>>> rdd2.sortByKey()
RDD:    [(1,'a'),   (2,'c'),    (1,'b')]    →   [(1,'a'),   (1,'b'),    (2,'c')]
>>> rdd2    =   sc.parallelize([(1,'a'),    (2,'c'),    (1,'b')])   
>>> rdd2.groupByKey()   
RDD:    [(1,'a'),   (1,'b'),    (2,'c')]    →   [(1,['a','b']), (2,['c'])]

pySpark Shared Variables

Broadcast Variables(廣播變數)
» Efficiently send large, read-only value to all workers
» Saved at workers for use in one or more Spark operations
» Like sending a large, read-only lookup table to all the nodes

#   Country code lookup for HAM radio call signs
#   Lookup  the locations   of  the call    signs   on  the 
#   RDD contactCounts.  We  load    a   list    of  call    sign        
#   prefixes    to  country code    to  support this    lookup      
signPrefixes    =   sc.broadcast(loadCallSignTable())       

def processSignCount(sign_count,    signPrefixes):
    country =   lookupCountry(sign_count[0],    signPrefixes.value) 
    count   =   sign_count[1]   
    return (country,    count)      
    
countryContactCounts    =   (contactCounts
                            .map(processSignCount)  
                            .reduceByKey((lambda x, y:  x+  y)))

Accumulators (累加器)
» Aggregate values from workers back to driver
» Only driver can access value of accumulator
» For tasks, accumulators are write-only
» Use to count errors seen in RDD across workers

# Counting empty lines
file=sc.textFile(inputFile) 
#   Create  Accumulator[Int]    initialized to  0       
blankLines  =   sc.accumulator(0)

def extractCallSigns(line): 
    global blankLines # Make    the global  variable    accessible  
    if (line    ==  ""):        
        blankLines  +=  1       
    return line.split(" ")  
        
callSigns   =   file.flatMap(extractCallSigns)      
print   "Blank  lines:  %d" %   blankLines.value

Machine Learning－Introduction
2019-04-03
Mac
Spark Machine Learning 04 構建基於Spark的推薦引擎 (待完善)
2016-04-29
SparkMac
《Pattern Recognition and Machine Learning》第一章1. INTRODUCTION
2014-08-19
Mac
Learning Spark——使用Intellij Idea開發基於Maven的Spark程式
2017-04-20
SparkIntelliJIdeaMaven
Spark之spark shell
2018-09-13
Spark
【Spark篇】---Spark初始
2018-02-01
Spark
Machine Learning with Sklearn
2020-12-11
Mac
《machine learning》引言
2020-10-13
Mac
Spark on Yarn 和Spark on Mesos
2018-11-20
SparkYarn
Spark系列 - (3) Spark SQL
2023-02-23
SparkSQL
Machine Learning (12) - Support Vector Machine (SVM)
2019-06-10
Mac
Spark學習進度-Spark環境搭建&Spark shell
2021-01-04
Spark
【Spark】Spark容錯機制
2021-09-09
Spark
Spark導論（Spark自學一）
2017-10-10
Spark
Machine Learning - Basic points
2020-01-17
Mac
spark
2022-07-05
Spark
spark學習筆記--Spark SQL
2018-07-13
Spark筆記SQL
Spark記錄（一）：Spark全景概述
2021-11-06
Spark
Spark開發-Spark核心細說
2017-09-21
Spark
Spark開發-spark環境搭建
2017-09-10
Spark
Machine Learning（機器學習）之一
2019-02-27
Mac機器學習
Machine Learning (10) - Decision Tree
2019-06-09
Mac
Machine learning terms_01
2021-04-07
Mac
Extreme Learning Machine 翻譯
2019-01-20
REMMac
pages bookmarks for machine learning domain
2018-12-05
MacAI
Machine Learning（機器學習）之二
2018-10-25
Mac機器學習
Machine Learning 機器學習筆記
2018-03-27
Mac機器學習筆記
Spark文件閱讀之一：Spark Overview
2020-06-15
SparkView
Spark 系列（九）—— Spark SQL 之 Structured API
2019-08-13
SparkSQLStructAPI
Spark 系列（十四）—— Spark Streaming 基本操作
2019-08-14
Spark
Spark 系列（十五）—— Spark Streaming 整合 Flume
2019-08-15
Spark
Hello Spark! | Spark，從入門到精通
2018-09-18
Spark
spark2.2.0 配置spark sql 操作hive
2018-09-23
SparkSQLHive
Spark SQL | Spark，從入門到精通
2019-01-21
SparkSQL
spark學習筆記-- Spark Streaming
2018-08-03
Spark筆記
Spark —— Spark OOM Error問題排查定位
2020-11-22
SparkOOMError
【Spark篇】---Spark中控制運算元
2018-02-02
Spark
【Spark篇】---Spark中Action運算元
2018-02-02
Spark