版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
1 Python Spark SQL 基本資料處理
-
Python Spark DataFrame 基礎
df = spark.read.parquet('/sql/users.parquet') df.show() +------+--------------+----------------+ | name|favorite_color|favorite_numbers| +------+--------------+----------------+ |Alyssa| null| [3, 9, 15, 20]| | Ben| red| []| +------+--------------+----------------+ 複製程式碼
-
Python Spark DataFrame 聚合統計
CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100) 0001,Male,19,15,39 0002,Male,21,15,81 0003,Female,20,16,6 0004,Female,23,16,77 0005,Female,31,17,40 0006,Female,22,17,76 df = spark.read.csv('/sql/customers.csv',header=True) df.printSchema() df.show() root |-- CustomerID: string (nullable = true) |-- Genre: string (nullable = true) |-- Age: string (nullable = true) |-- Annual Income (k$): string (nullable = true) |-- Spending Score (1-100): string (nullable = true) +----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0001| Male| 19| 15| 39| | 0002| Male| 21| 15| 81| | 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| | 0006|Female| 22| 17| 76| | 0007|Female| 35| 18| 6| | 0008|Female| 23| 18| 94| | 0009| Male| 64| 19| 3| | 0010|Female| 30| 19| 72| | 0011| Male| 67| 19| 14| | 0012|Female| 35| 19| 99| | 0013|Female| 58| 20| 15| | 0014|Female| 24| 20| 77| | 0015| Male| 37| 20| 13| | 0016| Male| 22| 20| 79| | 0017|Female| 35| 21| 35| | 0018| Male| 20| 21| 66| | 0019| Male| 52| 23| 29| | 0020|Female| 35| 23| 98| +----------+------+---+------------------+----------------------+ df.agg({"Age": "max","Annual Income (k$)":"mean","Spending Score (1-100)":"mean"}).show() +---------------------------+-----------------------+--------+ |avg(Spending Score (1-100))|avg(Annual Income (k$))|max(Age)| +---------------------------+-----------------------+--------+ | 50.2| 60.56| 70| +---------------------------+-----------------------+--------+ 複製程式碼
-
alias(alias)為DataFrame定義一個別名,稍後再函式中就可以利用這個別名來做相關的運 算,例如說自關聯Join:
df1 = df.alias('cus1') type(df1) df2 = df.alias('cus2') df3 = df1.join(df2,col('cus1.CustomerId')==col('cus2.CustomerId'),'inner') df3.count() 200 +----------+------+---+------------------+----------------------+----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)|CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+----------+------+---+------------------+----------------------+ | 0001| Male| 19| 15| 39| 0001| Male| 19| 15| 39| | 0002| Male| 21| 15| 81| 0002| Male| 21| 15| 81| | 0003|Female| 20| 16| 6| 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| 0005|Female| 31| 17| 40| | 0006|Female| 22| 17| 76| 0006|Female| 22| 17| 76| | 0007|Female| 35| 18| 6| 0007|Female| 35| 18| 6| | 0008|Female| 23| 18| 94| 0008|Female| 23| 18| 94| | 0009| Male| 64| 19| 3| 0009| Male| 64| 19| 3| | 0010|Female| 30| 19| 72| 0010|Female| 30| 19| 72| +----------+------+---+------------------+----------------------+----------+------+---+------------------+----------------------+ only showing top 10 rows 複製程式碼
-
cache(),將DataFrame快取到StorageLevel對應的快取級別中,預設是 MEMORY_AND_DISK
df = spark.read.csv('/sql/customers.csv',header=True) a = df.cache() a.show() +----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0001| Male| 19| 15| 39| | 0002| Male| 21| 15| 81| | 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| | 0006|Female| 22| 17| 76| | 0007|Female| 35| 18| 6| | 0008|Female| 23| 18| 94| | 0009| Male| 64| 19| 3| | 0010|Female| 30| 19| 72| | 0011| Male| 67| 19| 14| | 0012|Female| 35| 19| 99| 複製程式碼
-
checkpoint(eager=True) 對DataFrame設定斷點,這個方法是Spark2.1引入的方法,這個方法的呼叫會斬斷在這個 DataFrame上的邏輯執行計劃,將前後的依賴關係持久化到checkpoint檔案中去。
sc sc.setCheckpointDir('/datas/checkpoint') a.checkpoint() a.show() 複製程式碼
-
coalesce(numPartitions) 重分割槽演算法,傳入的引數是DataFrame的分割槽數量。
注意通過read方法讀取檔案,建立的DataFrame預設的分割槽數為檔案的個數,即一個檔案對 應一個分割槽,在分割槽數少於coalesce指定的分割槽數的時候,呼叫coalesce是不起作用的 df = spark.read.csv('/sql/customers.csv',header=True) df.rdd.getNumPartitions() 1 spark.read.csv('/sql/customers.csv',header=True).coalesce(3).rdd.getNumPartitions() 1 df = spark.range(0,20,2,3) df.rdd.getNumPartitions() df.coalesce(2).rdd.getNumPartitions() 2 複製程式碼
-
repartition(numPartitions, *cols)這個方法和coalesce(numPartitions) 方法一樣,都是 對DataFrame進行重新的分割槽,但是repartition這個方法會使用hash演算法,在整個叢集中進 行shuffle,效率較低。repartition方法不僅可以指定分割槽數,還可以指定按照哪些列來做分 區。
df = spark.read.csv('/sql/customers.csv',header=True) df.rdd.getNumPartitions() 1 df2 = df.repartition(3) df2.rdd.getNumPartitions() 3 df2.columns df3 = df2.repartition(6,'Genre') df3.show(20) +----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| | 0006|Female| 22| 17| 76| | 0007|Female| 35| 18| 6| | 0008|Female| 23| 18| 94| | 0010|Female| 30| 19| 72| | 0012|Female| 35| 19| 99| | 0013|Female| 58| 20| 15| df3.rdd.getNumPartitions() 6 複製程式碼
-
colRegex(colName)用正規表示式的方式返回我們想要的列。
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "a"]) df.select(df.colRegex("`(Col1)?+.+`")).show() +---+ | a| +---+ | 1| | 2| | 3| +---+ 複製程式碼
-
collect(),返回DataFrame中的所有資料,注意資料量大了容易造成Driver節點記憶體溢 出!
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "a"]) df.collect() [Row(Col1='a', a=1), Row(Col1='b', a=2), Row(Col1='c', a=3)] 複製程式碼
-
columns,以列表的形式返回DataFrame的所有列名
df = spark.read.csv('/sql/customers.csv',header=True) df.columns df = spark.read.csv('/sql/customers.csv',header=True) df.columns ['CustomerID', 'Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)'] 複製程式碼
-
SparkSQL DataFrame 轉換為 PandasDataFrame
df = spark.read.csv('/sql/customers.csv',header=True) pdf = df.toPandas() 複製程式碼
-
Pandas 相關資料處理操作
pdf.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): CustomerID 200 non-null object Genre 200 non-null object Age 200 non-null object Annual Income (k$) 200 non-null object Spending Score (1-100) 200 non-null object dtypes: object(5) memory usage: 7.9+ KB pdf['Age'] = pdf['Age'].astype('int') pdf["Annual Income (k$)"]=pdf["Annual Income (k$)"].astype('int') pdf["Spending Score (1-100)"]=pdf["Spending Score (1-100)"].astype('int') pdf.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): CustomerID 200 non-null object Genre 200 non-null object Age 200 non-null int64 Annual Income (k$) 200 non-null int64 Spending Score (1-100) 200 non-null int64 dtypes: int64(3), object(2) memory usage: 7.9+ KB 複製程式碼
-
PandasDataFrame 轉換為 SparkSQL DataFrame
df1 = spark.createDataFrame(pdf) df1.corr("Age","Annual Income (k$)") df1.corr("Spending Score (1-100)","Annual Income (k$)") 0.009902848094037492 複製程式碼
-
count()返回DataFrame中Row的數量
df = spark.read.csv('/sql/customers.csv',header=True) df.count() 200 複製程式碼
-
createGlobalTempView(name)使用DataFrame建立一個全域性的臨時表,其生命週期 和啟動的app的週期一致,即啟動的spark應用存在則這個臨時的表就一直能訪問。直到 sparkcontext的stop方法的呼叫退出應用為止。建立的臨時表儲存在global_temp這個庫 中。
df = spark.read.csv('/sql/customers.csv',header=True) #df.createGlobalTempView('TT') spark.sql('select * from global_temp.TT').show() +----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0001| Male| 19| 15| 39| | 0002| Male| 21| 15| 81| | 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| | 0006|Female| 22| 17| 76| | 0007|Female| 35| 18| 6| 複製程式碼
-
createOrReplaceGlobalTempView(name)上面的方法當遇到已經建立了的臨時表名 的話會報錯,而這個方法遇到已經存在的臨時表會進行替換,沒有則建立。
df = spark.read.csv('/sql/customers.csv',header=True) df.createOrReplaceGlobalTempView('TT') spark.sql('select * from global_temp.TT').show() +----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0001| Male| 19| 15| 39| | 0002| Male| 21| 15| 81| | 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| | 0006|Female| 22| 17| 76| 複製程式碼
-
crossJoin(other)返回兩個DataFrame的笛卡爾積組合。不要輕易嘗試這個方法,非常 耗時且費資源
df1 = spark.createDataFrame([('regan',27),('ting',24)],schema=['name','age']) df2 = spark.createDataFrame([('regan',65),('ting',48)],schema=['name','weight']) df3 = df1.coalesce(3).crossJoin(df2.coalesce(3)) df3.show() +-----+---+-----+------+ | name|age| name|weight| +-----+---+-----+------+ |regan| 27|regan| 65| |regan| 27| ting| 48| | ting| 24|regan| 65| | ting| 24| ting| 48| +-----+---+-----+------+ 複製程式碼
-
cube(*cols)在當前的DataFrame上建立多維的資料立方體
from pyspark.sql.functions import * df = spark.read.csv('/sql/customers.csv',header=True) df.cube('Age','Genre').count().orderBy(desc("count"), asc("Age")).show() +----+------+-----+ | Age| Genre|count| +----+------+-----+ |null| null| 200| |null|Female| 112| |null| Male| 88| | 32| null| 11| | 35| null| 9| | 19| null| 8| | 31| null| 8| | 30| null| 7| | 31|Female| 7| | 49| null| 7| | 19| Male| 6| | 23|Female| 6| | 23| null| 6| | 27| null| 6| | 32|Female| 6| | 35|Female| 6| | 36| null| 6| | 38| null| 6| | 40| null| 6| | 47| null| 6| +----+------+-----+ only showing top 20 rows 複製程式碼
-
describe(*cols)統計cols對應的基本的統計資訊,包括數量、最大值、最小值、均值及標 準差
df = spark.read.csv('/sql/customers.csv',header=True) #df.describe('Age') df.describe('Age','Genre').show() +-------+-----------------+------+ |summary| Age| Genre| +-------+-----------------+------+ | count| 200| 200| | mean| 38.85| null| | stddev|13.96900733155888| null| | min| 18|Female| | max| 70| Male| +-------+-----------------+------+ df.describe().show() +-------+------------------+------+-----------------+------------------+----------------------+ |summary| CustomerID| Genre| Age|Annual Income (k$)|Spending Score (1-100)| +-------+------------------+------+-----------------+------------------+----------------------+ | count| 200| 200| 200| 200| 200| | mean| 100.5| null| 38.85| 60.56| 50.2| | stddev|57.879184513951124| null|13.96900733155888| 26.26472116527124| 25.823521668370173| | min| 0001|Female| 18| 101| 1| | max| 0200| Male| 70| 99| 99| +-------+------------------+------+-----------------+------------------+----------------------+ pdf=df.toPandas() pdf.describe() 複製程式碼
-
distinct()返回DataFrame中非重複的資料
df = spark.createDataFrame([(1,1),(1,2),(1,2),(5,5)]) df.count() df.distinct().count() df = spark.createDataFrame([(1,1),(1,2),(1,2),(5,5)]) df.count() 4 df.distinct().count() 3 複製程式碼
-
drop(*cols)按照列名刪除DataFrame中的列,返回新的DataFrame
df = spark.read.csv('/sql/customers.csv',header=True) df.columns ['CustomerID', 'Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)'] df1 = df.drop('Age') df1.columns ['CustomerID', 'Genre', 'Annual Income (k$)', 'Spending Score (1-100)'] 複製程式碼
-
dropDuplicates(subset=None)刪除重複行,subset用於指定在刪除重複行的時候考 慮那幾列。
from pyspark.sql import Row df = sc.parallelize([ Row(name='regan', age=27, height=170), Row(name='regan', age=27, height=170), Row(name='regan', age=27, height=155)],3).toDF() df.show() +---+------+-----+ |age|height| name| +---+------+-----+ | 27| 170|regan| | 27| 170|regan| | 27| 155|regan| +---+------+-----+ df.dropDuplicates().show() +---+------+-----+ |age|height| name| +---+------+-----+ | 27| 155|regan| | 27| 170|regan| +---+------+-----+ df.dropDuplicates(subset=['age','name']).show() +---+------+-----+ |age|height| name| +---+------+-----+ | 27| 170|regan| +---+------+-----+ 複製程式碼
2 Python Spark SQL 資料高階處理
-
numpy自由引入
-
dropna(how='any', thresh=None, subset=None)刪除DataFrame中的na資料,關鍵字參 數how指定如何刪,有“any”和‘all’兩種選項,thresh指定行中na資料有多少個時刪除整行數 據,這個設定將覆蓋how關鍵字引數的設定,subset指定在那幾列中刪除na資料。
import numpy as np df = spark.createDataFrame([(np.nan,27.,170.),(44.,27.,170.), (np.nan,np.nan,170.)],schema=['luck','age','weight']) df.show() +----+----+------+ |luck| age|weight| +----+----+------+ | NaN|27.0| 170.0| |44.0|27.0| 170.0| | NaN| NaN| 170.0| +----+----+------+ df.dropna(how='any').show() +----+----+------+ |luck| age|weight| +----+----+------+ |44.0|27.0| 170.0| +----+----+------+ df.dropna(how='all').show() +----+----+------+ |luck| age|weight| +----+----+------+ | NaN|27.0| 170.0| |44.0|27.0| 170.0| | NaN| NaN| 170.0| +----+----+------+ df.dropna(thresh=2).show() +----+----+------+ |luck| age|weight| +----+----+------+ | NaN|27.0| 170.0| |44.0|27.0| 170.0| +----+----+------+ 複製程式碼
-
dtypes返回DataFrame列的名字及對應的資料型別組成的tuple列表
import numpy as np df = spark.createDataFrame([(np.nan,27.,170.),(44.,27.,170.), (np.nan,np.nan,170.)],schema=['luck','age','weight']) df.dtypes [('luck', 'double'), ('age', 'double'), ('weight', 'double')] 複製程式碼
-
fillna(value, subset=None)用於DataFrame中空資料的填充。
import numpy as np f = spark.createDataFrame([(np.nan,27.,170.),(44.,27.,170.), (np.nan,np.nan,170.)],schema=['luck','age','weight']).show() +----+----+------+ |luck| age|weight| +----+----+------+ | NaN|27.0| 170.0| |44.0|27.0| 170.0| | NaN| NaN| 170.0| +----+----+------+ df.na.fill(0.0).show() +----+----+------+ |luck| age|weight| +----+----+------+ | 0.0|27.0| 170.0| |44.0|27.0| 170.0| | 0.0| 0.0| 170.0| +----+----+------+ df.fillna(0.0).show() +----+----+------+ |luck| age|weight| +----+----+------+ | 0.0|27.0| 170.0| |44.0|27.0| 170.0| | 0.0| 0.0| 170.0| +----+----+------+ df.na.fill(False).show() +----+----+------+ |luck| age|weight| +----+----+------+ | NaN|27.0| 170.0| |44.0|27.0| 170.0| | NaN| NaN| 170.0| +----+----+------+ df.na.fill({'luck':0.0,'age':50.0}).show() +----+----+------+ |luck| age|weight| +----+----+------+ | 0.0|27.0| 170.0| |44.0|27.0| 170.0| | 0.0|50.0| 170.0| +----+----+------+ 複製程式碼
-
filter(condition)按照傳入的條件進行過濾,其實where方法就是filter方法的一個別名 而已。
import numpy as np df = spark.createDataFrame([(np.nan,27.,170.),(44.,27.,170.), (np.nan,np.nan,170.)],schema=['luck','age','weight']) df.filter(df.luck != np.nan).show() +----+----+------+ |luck| age|weight| +----+----+------+ |44.0|27.0| 170.0| +----+----+------+ import numpy as np df = spark.createDataFrame([(np.nan,27.,170.),(44.,27.,170.), (np.nan,np.nan,170.)],schema=['luck','age','weight']) df.filter('luck <> "NaN" ').show() +----+----+------+ |luck| age|weight| +----+----+------+ |44.0|27.0| 170.0| +----+----+------+ 複製程式碼
-
first()返回DataFrame的第一條記錄
import numpy as np df = spark.createDataFrame([(np.nan,27.,170.),(44.,27.,170.), (np.nan,np.nan,170.)],schema=['luck','age','weight']) df.show() df.first() Row(luck=nan, age=27.0, weight=170.0) 複製程式碼
-
foreach(f),在每一個Row上運用f方法,實際上它呼叫的是df.rdd.foreach這個機遇 RDD上的foreach方法。(測試未通過)
import numpy as np df = spark.createDataFrame([(np.nan,27.,170.),(44.,27.,170.), (np.nan,np.nan,170.)],schema=['luck','age','weight']) def myprint(x): print(x.age) df.foreach(lambda x:print(x)) def pprint(x): for p in x: print(p.luck) df.foreachPartition(pprint) 複製程式碼
-
groupBy(*cols)使用給定的列進行分組,返回GroupedData物件
df = spark.read.csv('/sql/customers.csv',header=True) df.columns ['CustomerID', 'Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)'] df.groupby('Genre').agg({'Age':'mean'}).show() +------+------------------+ | Genre| avg(Age)| +------+------------------+ |Female|38.098214285714285| | Male| 39.80681818181818| +------+------------------+ 複製程式碼
-
head(n=None)返回DataFrame前n行資料,預設是返回1行,可以通過n關鍵字引數指定
df = spark.read.csv('/sql/customers.csv',header=True) df.head(6) 複製程式碼
-
hint(name, *parameters),hint方法用於兩個DataFrame做Join操作的時候,指定Join的 方式,一般為broadcast的方式。hint是暗示的意思,可以看出作者還是挺幽默的,給程式一 個暗示,按照那種方式join。
df1 = spark.createDataFrame([('regan',23),('ting',24)],schema=['name','age']) df2 = spark.createDataFrame([('regan',130),('ting',90)],schema=['name','weight']) df3 = df1.join(df2.hint('broadcast'),'name').show() +-----+---+------+ | name|age|weight| +-----+---+------+ |regan| 23| 130| | ting| 24| 90| +-----+---+------+ 複製程式碼
-
intersect(other)返回兩個DataFrame的交集是集合中的概念
df1 = spark.createDataFrame([('regan',23),('ting',24)],schema=['name','age']) df2 = spark.createDataFrame([('regan',23),('ting',90)],schema=['name','age']) df3 = df1.intersect(df2).show() +-----+---+ | name|age| +-----+---+ |regan| 23| +-----+---+ 複製程式碼
-
join(other, on=None, how=None),用來對兩個DataFrame做連線關聯操作,other是另 外一個DataFrame,on指定以哪個欄位做關聯,how指定怎麼關聯,有 inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left_anti選項,預設是inner。
df1 = spark.createDataFrame([('regan',23),('ting',24)],schema=['name','age']) df2 = spark.createDataFrame([('regan',130),('ting2',90)],schema=['name','weight']) df1.join(df2,on='name',how='left_outer').show() +-----+---+------+ | name|age|weight| +-----+---+------+ |regan| 23| 130| | ting| 24| null| +-----+---+------+ df1.join(df2,on='name',how='right_outer').show() +-----+----+------+ | name| age|weight| +-----+----+------+ |regan| 23| 130| |ting2|null| 90| +-----+----+------+ df1.join(df2,on='name',how='left_semi').show() +-----+---+ | name|age| +-----+---+ |regan| 23| +-----+---+ df1.join(df2,on='name',how='left_anti').show() +----+---+ |name|age| +----+---+ |ting| 24| +----+---+ 複製程式碼
-
limit(num)限制返回的資料的條數,防止返回到driver節點的資料過大造成OOM
df1 = spark.createDataFrame([('regan',23),('ting',24)],schema=['name','age']) df1.limit(1).collect() 複製程式碼
-
orderBy(*cols, **kwargs),返回按照指定列排好序的新的DataFrame。
df = spark.read.csv('/sql/customers.csv',header=True) df.orderBy('Age').show(3) df.orderBy('Age',ascending=False).show(3) +----------+-----+---+------------------+----------------------+ |CustomerID|Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+-----+---+------------------+----------------------+ | 0034| Male| 18| 33| 92| | 0066| Male| 18| 48| 59| | 0092| Male| 18| 59| 41| +----------+-----+---+------------------+----------------------+ only showing top 3 rows +----------+-----+---+------------------+----------------------+ |CustomerID|Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+-----+---+------------------+----------------------+ | 0061| Male| 70| 46| 56| | 0071| Male| 70| 49| 55| | 0058| Male| 69| 44| 46| +----------+-----+---+------------------+----------------------+ only showing top 3 rows df.orderBy(desc("Age")).show(3) df.orderBy(df.Age.desc()).show(3) orderBy方法和sort方法類似 df.sort(desc("Age")).show(3) df.sort(df.Age.desc()).show(3) 複製程式碼
-
persist(storageLevel=StorageLevel(True, True, False, False, 1))用來指定DataFrame 的快取級別,預設為記憶體和磁碟。
from pyspark import StorageLevel df = spark.read.csv('/sql/customers.csv',header=True) df.persist(storageLevel=StorageLevel.MEMORY_AND_DISK_2) DataFrame[CustomerID: string, Genre: string, Age: string, Annual Income (k$): string, Spending Score (1-100): string] 複製程式碼
-
randomSplit(weights, seed=None),按照給定的權重將DataFrame分為幾個 DataFrame,seed關鍵字引數用來指定隨機種子,用於復現結果。
df = spark.range(0.,30.,2,3) df.show() df.describe().show() dfs = df.randomSplit([1.0,4.0],24) for df in dfs: df.show() 複製程式碼
-
rdd,返回DataFrame對應的RDD物件,利用這個物件可以呼叫RDD上的所有的方法,但 是這些方法是比較底層的方法,在處理一些特殊任務的時候,頂層的DataFrame的方法可 能無法解決,需要轉換到更底層的RDD上來進行操作。
df = spark.range(0.,30.,2,3) rdd = df.rdd rdd.map(lambda x:x.id ** 2).collect() 複製程式碼
-
replace(to_replace, value=, subset=None)這個方法通過第一個引數指定要 被替換掉的老的值,第二個引數指定新的值,subset關鍵字引數指定子集,預設是在整個 DataFrame上進行替換。把資料集中的99換成100
注意上面在替換的過程中to_replace和value的型別必須要相同,而且to_replace資料型別只 能是:bool, int, long, float, string, list or dict。value資料型別只能是: bool, int, long, float, string, list or None df = spark.read.csv('/sql/customers.csv',header=True) df.columns df.show() df2 = df.replace('99','100') df2.show() df.replace(['Female','Male'],['F','M'],'Genre').show() +----------+-----+---+------------------+----------------------+ |CustomerID|Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+-----+---+------------------+----------------------+ | 0001| M| 19| 15| 39| | 0002| M| 21| 15| 81| | 0003| F| 20| 16| 6| | 0004| F| 23| 16| 77| | 0005| F| 31| 17| 40| | 0006| F| 22| 17| 76| | 0007| F| 35| 18| 6| | 0008| F| 23| 18| 94| | 0009| M| 64| 19| 3| | 0010| F| 30| 19| 72| | 0011| M| 67| 19| 14| | 0012| F| 35| 19| 99| | 0013| F| 58| 20| 15| | 0014| F| 24| 20| 77| | 0015| M| 37| 20| 13| | 0016| M| 22| 20| 79| | 0017| F| 35| 21| 35| | 0018| M| 20| 21| 66| | 0019| M| 52| 23| 29| | 0020| F| 35| 23| 98| +----------+-----+---+------------------+----------------------+ df.na.replace(['Female','Male'],['F','M'],'Genre').show() 複製程式碼
-
rollup(*cols),按照指定的列名進行彙總,這樣就可以在彙總的資料集上運用聚合函式
from pyspark.sql.functions import * df = spark.read.csv('/sql/customers.csv',header=True) df.rollup('Genre','Age').count().orderBy(desc('count'),'Genre').show() +------+----+-----+ | Genre| Age|count| +------+----+-----+ | null|null| 200| |Female|null| 112| | Male|null| 88| |Female| 31| 7| |Female| 23| 6| |Female| 49| 6| |Female| 32| 6| |Female| 35| 6| | Male| 19| 6| |Female| 30| 5| | Male| 32| 5| | Male| 48| 5| |Female| 21| 4| |Female| 47| 4| |Female| 50| 4| |Female| 36| 4| |Female| 29| 4| |Female| 27| 4| |Female| 38| 4| | Male| 59| 4| +------+----+-----+ 複製程式碼
-
sample(withReplacement=None, fraction=None, seed=None),用於從DataFrame中進行 取樣的方法,withReplacement關鍵字引數用於指定是否採用有放回的取樣,true為有放回 採用,false為無放回的取樣,fraction指定取樣的比例,seed取樣種子,相同的種子對應的 取樣總是相同的,用於場景的復現。
df = spark.read.csv('/sql/customers.csv',header=True) df.count() 200 df2 = df.sample(withReplacement=True,fraction=0.2,seed=1) df2.count() 35 複製程式碼
-
sampleBy(col, fractions, seed=None),按照指定的col列根據fractions指定的比例進行分 層抽樣,seed是隨機種子,用於場景的復現。
df = spark.read.csv('/sql/customers.csv',header=True) df.sampleBy('Genre',{'Male':0.1,'Female':0.15}).groupBy('Genre').count().show() +------+-----+ | Genre|count| +------+-----+ |Female| 15| | Male| 11| +------+-----+ 複製程式碼
-
select(*cols),通過表示式選取DataFrame中符合條件的資料,返回新的DataFrame
f = spark.read.csv('/sql/customers.csv',header=True) df.select('*').count() df.select('Age','Genre').show(10) df.select(df.Age.alias('age')).show(10) 複製程式碼
-
selectExpr(*expr),這個方法是select方法的一個變體,他可以接收一個SQL表示式, 返回新的DataFrame
df = spark.read.csv('/sql/customers.csv',header=True) df.selectExpr('Age * 2','sqrt(Age)').show(10) df = spark.read.csv('/sql/customers.csv',header=True) df.selectExpr('Age * 2','sqrt(Age)').show(10) +---------+-------------------------+ |(Age * 2)|SQRT(CAST(Age AS DOUBLE))| +---------+-------------------------+ | 38.0| 4.358898943540674| | 42.0| 4.58257569495584| | 40.0| 4.47213595499958| | 46.0| 4.795831523312719| | 62.0| 5.5677643628300215| | 44.0| 4.69041575982343| | 70.0| 5.916079783099616| | 46.0| 4.795831523312719| | 128.0| 8.0| | 60.0| 5.477225575051661| +---------+-------------------------+ 複製程式碼
-
show(n=20, truncate=True, vertical=False),這個方法預設返回DataFrame的前20行記 錄,可以通過truncate指定超過20個字元的記錄將會被截斷,vertical指定是否垂直顯示。
df = spark.read.csv('/sql/customers.csv',header=True) df.selectExpr('Age * 2','sqrt(Age)').show(10,truncate=False,vertical=False) 複製程式碼
-
sortWithinPartitions(*cols, **kwargs)和sort(*cols, **kwargs),這兩個方法都是 用指定的cols列進行排序,通過kwargs引數指定升序降序。
-
sortWithinPartitions(*cols, **kwargs)和sort(*cols, **kwargs),這兩個方法都是 用指定的cols列進行排序,通過kwargs引數指定升序降序。
df = spark.read.csv('/sql/customers.csv',header=True) df.sort(['Age','Genre'],ascending=True).show(10) df.sort(df.Age.desc()).show(10) from pyspark.sql.functions import * df.sortWithinPartitions(['Age','Genre'],ascending=False).show(10) +----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0061| Male| 70| 46| 56| | 0071| Male| 70| 49| 55| | 0058| Male| 69| 44| 46| | 0109| Male| 68| 63| 43| | 0068|Female| 68| 48| 48| | 0091|Female| 68| 59| 55| | 0011| Male| 67| 19| 14| | 0083| Male| 67| 54| 41| | 0103| Male| 67| 62| 59| | 0063|Female| 67| 47| 52| +----------+------+---+------------------+----------------------+ df.sortWithinPartitions(desc('Age')).show(10) 複製程式碼
-
subtract(other),這個方法用來獲取在A集合裡而不再B集合裡的資料,返回新的 DataFrame
df1 = spark.createDataFrame([('regan',),('ting',),('yu',)],schema=['name']) df2 = spark.createDataFrame([('regan',),('ting',),('sha',)],schema=['name']) df3 = df1.subtract(df2) df3.show() 複製程式碼
-
summary(*statistics),用傳入的統計方法返回概要資訊。不傳引數會預設計算count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max, *statistics引數可以是: count mean stddev min max arbitrary approximate percentiles
f = spark.read.csv('/sql/customers.csv',header=True) df.summary().show() df.summary('min','count','75%').show() +-------+------------------+------+-----------------+------------------+----------------------+ |summary| CustomerID| Genre| Age|Annual Income (k$)|Spending Score (1-100)| +-------+------------------+------+-----------------+------------------+----------------------+ | count| 200| 200| 200| 200| 200| | mean| 100.5| null| 38.85| 60.56| 50.2| | stddev|57.879184513951124| null|13.96900733155888| 26.26472116527124| 25.823521668370173| | min| 0001|Female| 18| 101| 1| | 25%| 50.0| null| 28.0| 40.0| 34.0| | 50%| 100.0| null| 36.0| 61.0| 50.0| | 75%| 150.0| null| 49.0| 78.0| 73.0| | max| 0200| Male| 70| 99| 99| +-------+------------------+------+-----------------+------------------+----------------------+ +-------+----------+------+----+------------------+----------------------+ |summary|CustomerID| Genre| Age|Annual Income (k$)|Spending Score (1-100)| +-------+----------+------+----+------------------+----------------------+ | min| 0001|Female| 18| 101| 1| | count| 200| 200| 200| 200| 200| | 75%| 150.0| null|49.0| 78.0| 73.0| +-------+----------+------+----+------------------+----------------------+ 複製程式碼
-
take(num),返回DataFrame的前num個Row資料組成的列表,注意num不要太大,容易 造成driver節點的OOM。
df = spark.read.csv('/sql/customers.csv',header=True) df.take(3) 複製程式碼
-
toDF(*cols),返回新的帶有指定cols名字的DataFrame物件
df = spark.read.csv('/sql/customers.csv',header=True) df.columns df1 = df.toDF('id','sex','age','income','score') df1.columns df1.show(5) 複製程式碼
-
toJSON(use_unicode=True),將DataFrame中的Row物件轉換為json字串,預設使用 unicode編碼。toJSON方法返回的是RDD物件,而不是DataFrame物件。
df = spark.read.csv('/sql/customers.csv',header=True) df.show(5) df1 = df.toJSON() df1.collect() +----------+------+---+------------------+----------------------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0001| Male| 19| 15| 39| | 0002| Male| 21| 15| 81| | 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| +----------+------+---+------------------+----------------------+ only showing top 5 rows ['{"CustomerID":"0001","Genre":"Male","Age":"19","Annual Income (k$)":"15","Spending Score (1-100)":"39"}', '{"CustomerID":"0002","Genre":"Male","Age":"21","Annual Income (k$)":"15","Spending Score (1-100)":"81"}', '{"CustomerID":"0003","Genre":"Female","Age":"20","Annual Income (k$)":"16","Spending Score (1-100)":"6"}', ......] 複製程式碼
-
toLocalIterator(),將DataFrame中所有資料返回為本地的可迭代的資料,資料量大 了容易OOM。(除錯未通過)
df = spark.read.csv('/sql/customers.csv',header=True) results = df.toLocalIterator() for data in results: print(data) 複製程式碼
-
toPandas(),將Spark中的DataFrame物件轉換為pandas中的DataFrame物件
df = spark.read.csv('/sql/customers.csv',header=True) pan_df = df.toPandas() pan_df pan_df.head(10) 複製程式碼
-
union(other),返回兩個DataFrame的合集。
df1 = spark.createDataFrame([('regan',),('ting',),('yu',)],schema=['name']) df2 = spark.createDataFrame([('regan',),('ting',),('sha',)],schema=['name']) +-----+ | name| +-----+ |regan| | ting| | yu| |regan| | ting| | sha| +-----+ 複製程式碼
-
unionByName(other)根據名字來找出兩個DataFrame的合集,與欄位的順序沒關係,只 要欄位名稱能對應上即可。
-
unpersist(blocking=False),這個方法用於將DataFrame上持久化的資料全部清除掉。
df1 = spark.createDataFrame([('regan',11),('ting',1),('yu',2)],schema=['name','score']) df1.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) df1.storageLevel df1.unpersist() df1.storageLevel 複製程式碼
-
where(condition),這個方法和filter方法類似。更具傳入的條件作出選擇。
df = spark.read.csv('/sql/customers.csv',header=True) df.where('Age >= 30').show() 複製程式碼
-
withColumn(colName, col),返回一個新的DataFrame,這個DataFrame中新增加 colName的列,或者原來本身就有colName的列,則替換掉。
f = spark.read.csv('/sql/customers.csv',header=True) df.withColumn('Age',df.Age**2).show(10) df.withColumn('Age2',df.Age**2).show(10) +----------+------+------+------------------+----------------------+ |CustomerID| Genre| Age|Annual Income (k$)|Spending Score (1-100)| +----------+------+------+------------------+----------------------+ | 0001| Male| 361.0| 15| 39| | 0002| Male| 441.0| 15| 81| | 0003|Female| 400.0| 16| 6| | 0004|Female| 529.0| 16| 77| | 0005|Female| 961.0| 17| 40| | 0006|Female| 484.0| 17| 76| | 0007|Female|1225.0| 18| 6| | 0008|Female| 529.0| 18| 94| | 0009| Male|4096.0| 19| 3| | 0010|Female| 900.0| 19| 72| +----------+------+------+------------------+----------------------+ only showing top 10 rows +----------+------+---+------------------+----------------------+------+ |CustomerID| Genre|Age|Annual Income (k$)|Spending Score (1-100)| Age2| +----------+------+---+------------------+----------------------+------+ | 0001| Male| 19| 15| 39| 361.0| | 0002| Male| 21| 15| 81| 441.0| | 0003|Female| 20| 16| 6| 400.0| | 0004|Female| 23| 16| 77| 529.0| | 0005|Female| 31| 17| 40| 961.0| | 0006|Female| 22| 17| 76| 484.0| | 0007|Female| 35| 18| 6|1225.0| | 0008|Female| 23| 18| 94| 529.0| | 0009| Male| 64| 19| 3|4096.0| | 0010|Female| 30| 19| 72| 900.0| +----------+------+---+------------------+----------------------+------+ only showing top 10 rows 複製程式碼
-
withColumnRenamed(existing, new),對已經存在的列名重新命名為new,若名稱不存在 則這個操作不做任何事情。
df = spark.read.csv('/sql/customers.csv',header=True) df.withColumnRenamed('Age','age').show(10) df.withColumnRenamed('Age2','age').show(10) +----------+------+---+------------------+----------------------+ |CustomerID| Genre|age|Annual Income (k$)|Spending Score (1-100)| +----------+------+---+------------------+----------------------+ | 0001| Male| 19| 15| 39| | 0002| Male| 21| 15| 81| | 0003|Female| 20| 16| 6| | 0004|Female| 23| 16| 77| | 0005|Female| 31| 17| 40| | 0006|Female| 22| 17| 76| | 0007|Female| 35| 18| 6| | 0008|Female| 23| 18| 94| | 0009| Male| 64| 19| 3| | 0010|Female| 30| 19| 72| +----------+------+---+------------------+----------------------+ 複製程式碼
-
write,藉助這個介面將DataFrame的內容儲存到外部的系統
df = spark.read.csv('/sql/customers.csv',header=True) df.write 複製程式碼
3 Spark SQL 高階用法cube及上卷
-
group by:主要用來對查詢的結果進行分組,相同組合的分組條件在結果集中只顯示一行記錄。可以新增聚合函式。
-
grouping sets:對分組集中指定的組表示式的每個子集執行group by,group by A,B grouping sets(A,B)就等價於 group by A union group by B,其中A和B也可以是一個集合,比如group by A,B,C grouping sets((A,B),(A,C))。
-
rollup:在指定表示式的每個層次級別建立分組集。group by A,B,C with rollup首先會對(A、B、C)進行group by,然後對(A、B)進行group by,然後是(A)進行group by,最後對全表進行group by操作。
-
cube:為指定表示式集的每個可能組合建立分組集。首先會對(A、B、C)進行group by,然後依次是(A、B),(A、C),(A),(B、C),(B),( C),最後對全表進行group by操作。
case class MemberOrderInfo(area:String,memberType:String,product:String,price:Int) import spark.implicits._ val orders=Seq( MemberOrderInfo("深圳","鑽石會員","鑽石會員1個月",25), MemberOrderInfo("深圳","鑽石會員","鑽石會員1個月",25), MemberOrderInfo("深圳","鑽石會員","鑽石會員3個月",70), MemberOrderInfo("深圳","鑽石會員","鑽石會員12個月",300), MemberOrderInfo("深圳","鉑金會員","鉑金會員3個月",60), MemberOrderInfo("深圳","鉑金會員","鉑金會員3個月",60), MemberOrderInfo("深圳","鉑金會員","鉑金會員6個月",120), MemberOrderInfo("深圳","黃金會員","黃金會員1個月",15) ) 把seq轉換成DataFrame val memberDF:DataFrame =orders.toDF() 把DataFrame註冊成臨時表 memberDF.createOrReplaceGlobalTempView("orderTempTable") group by spark.sql("select area,memberType,product,sum(price) as total from orderTempTable group by area,memberType,product").show +----+----------+--------+-----+ |area|memberType| product|total| +----+----------+--------+-----+ | 深圳| 鑽石會員| 鑽石會員3個月| 70| | 深圳| 鑽石會員|鑽石會員12個月| 300| | 深圳| 鉑金會員| 鉑金會員6個月| 120| | 深圳| 鉑金會員| 鉑金會員3個月| 120| | 深圳| 鑽石會員| 鑽石會員1個月| 50| | 深圳| 黃金會員| 黃金會員1個月| 15| +----+----------+--------+-----+ spark.sql("select area,memberType,product,sum(price) as total from orderTempTable group by area,memberType,product grouping sets(area,memberType,product)").show +----+----------+--------+-----+ |area|memberType| product|total| +----+----------+--------+-----+ |null| null| 鉑金會員3個月| 120| |null| 鉑金會員| null| 240| |null| null|鑽石會員12個月| 300| | 深圳| null| null| 675| |null| 鑽石會員| null| 420| |null| null| 鑽石會員1個月| 50| |null| null| 黃金會員1個月| 15| |null| null| 鑽石會員3個月| 70| |null| 黃金會員| null| 15| |null| null| 鉑金會員6個月| 120| +----+----------+--------+-----+ spark.sql("select area,memberType,product,sum(price) as total from orderTempTable group by area,memberType,product grouping sets((area,memberType),(memberType,product))").show +----+----------+--------+-----+ |area|memberType| product|total| +----+----------+--------+-----+ |null| 鉑金會員| 鉑金會員6個月| 120| |null| 鑽石會員|鑽石會員12個月| 300| |null| 鑽石會員| 鑽石會員3個月| 70| | 深圳| 鑽石會員| null| 420| |null| 鉑金會員| 鉑金會員3個月| 120| |null| 黃金會員| 黃金會員1個月| 15| |null| 鑽石會員| 鑽石會員1個月| 50| | 深圳| 黃金會員| null| 15| | 深圳| 鉑金會員| null| 240| +----+----------+--------+-----+ spark.sql("select area,memberType,product,sum(price) as total from orderTempTable group by area,memberType,product with rollup").show +----+----------+--------+-----+ |area|memberType| product|total| +----+----------+--------+-----+ | 深圳| 鑽石會員| 鑽石會員1個月| 50| | 深圳| 鑽石會員|鑽石會員12個月| 300| | 深圳| 鉑金會員| 鉑金會員3個月| 120| | 深圳| 鑽石會員| null| 420| | 深圳| null| null| 675| |null| null| null| 675| | 深圳| 鑽石會員| 鑽石會員3個月| 70| | 深圳| 黃金會員| 黃金會員1個月| 15| | 深圳| 黃金會員| null| 15| | 深圳| 鉑金會員| null| 240| | 深圳| 鉑金會員| 鉑金會員6個月| 120| +----+----------+--------+-----+ spark.sql("select area,memberType,product,sum(price) as total from orderTempTable group by area,memberType,product with cube").show +----+----------+--------+-----+ |area|memberType| product|total| +----+----------+--------+-----+ | 深圳| null| 黃金會員1個月| 15| |null| null| 鉑金會員3個月| 120| | 深圳| null| 鉑金會員6個月| 120| |null| 鉑金會員| 鉑金會員6個月| 120| |null| 鉑金會員| null| 240| | 深圳| 鑽石會員| 鑽石會員1個月| 50| | 深圳| null| 鑽石會員1個月| 50| |null| 鑽石會員|鑽石會員12個月| 300| | 深圳| 鑽石會員|鑽石會員12個月| 300| | 深圳| 鉑金會員| 鉑金會員3個月| 120| |null| 鑽石會員| 鑽石會員3個月| 70| | 深圳| 鑽石會員| null| 420| |null| null|鑽石會員12個月| 300| | 深圳| null| null| 675| |null| 鉑金會員| 鉑金會員3個月| 120| |null| 鑽石會員| null| 420| |null| 黃金會員| 黃金會員1個月| 15| |null| 鑽石會員| 鑽石會員1個月| 50| |null| null| 鑽石會員1個月| 50| |null| null| null| 675| +----+----------+--------+-----+ 複製程式碼
-
版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
4 總結
Python技術棧與Spark交叉資料分析雙向整合,讓我們在大資料融合分析達到了通用,可以發現Spark SQL 其實很大部分功能和Pandas雷同
秦凱新 於深圳 201812172352