Spark特徵(提取，轉換，選擇)extracting, transforming and selecting features

智慧先行者發表於2016-12-02

VectorAssembler欄位轉換成特徵向量

import org.apache.spark.ml.feature.VectorAssembler

val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating")

// 欄位轉換成特徵向量
val assembler = new VectorAssembler().setInputCols(colArray).setOutputCol("features")

val vecDF: DataFrame = assembler.transform(data)
vecDF: org.apache.spark.sql.DataFrame = [affairs: double, gender: string ... 8 more fields]

vecDF.select("features", colArray: _*).show(10, truncate = false)
+----------------------------+----+------------+-------------+---------+----------+------+
|features                    |age |yearsmarried|religiousness|education|occupation|rating|
+----------------------------+----+------------+-------------+---------+----------+------+
|[37.0,10.0,3.0,18.0,7.0,4.0]|37.0|10.0        |3.0          |18.0     |7.0       |4.0   |
|[27.0,4.0,4.0,14.0,6.0,4.0] |27.0|4.0         |4.0          |14.0     |6.0       |4.0   |
|[32.0,15.0,1.0,12.0,1.0,4.0]|32.0|15.0        |1.0          |12.0     |1.0       |4.0   |
|[57.0,15.0,5.0,18.0,6.0,5.0]|57.0|15.0        |5.0          |18.0     |6.0       |5.0   |
|[22.0,0.75,2.0,17.0,6.0,3.0]|22.0|0.75        |2.0          |17.0     |6.0       |3.0   |
|[32.0,1.5,2.0,17.0,5.0,5.0] |32.0|1.5         |2.0          |17.0     |5.0       |5.0   |
|[22.0,0.75,2.0,12.0,1.0,3.0]|22.0|0.75        |2.0          |12.0     |1.0       |3.0   |
|[57.0,15.0,2.0,14.0,4.0,4.0]|57.0|15.0        |2.0          |14.0     |4.0       |4.0   |
|[32.0,15.0,4.0,16.0,1.0,2.0]|32.0|15.0        |4.0          |16.0     |1.0       |2.0   |
|[22.0,1.5,4.0,14.0,4.0,5.0] |22.0|1.5         |4.0          |14.0     |4.0       |5.0   |
+----------------------------+----+------------+-------------+---------+----------+------+
only showing top 10 rows

VectorIndexer自動識別分類的特徵，並對它們進行索引

import org.apache.spark.ml.feature.VectorIndexer

val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating") 

// 自動識別分類的特徵，並對它們進行索引
// 具有大於7個不同的值的特徵被視為連續。
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(7)
.fit(vecDF)


val categoricalFeatures: Set[Int] = featureIndexer.categoryMaps.keys.toSet
categoricalFeatures: Set[Int] = Set(2, 3, 4, 5)

println(s"Chose ${categoricalFeatures.size} categorical features: " +
         categoricalFeatures.mkString(", "))
Chose 4 categorical features: 2, 3, 4, 5

// 由此看出，當MaxCategories=7，從6個欄位中識別出了4個“類別特徵欄位”，
// 他們的下標索引為(2, 3, 4, 5),分別對應colArray中的(2, 3, 4, 5)元素，即"religiousness", "education", "occupation", "rating"
// 為什麼識別出了4個“類別特徵欄位”呢，請看本人部落格http://www.cnblogs.com/wwxbi/p/6125363.html“統計欄位中元素的個數”
// 從“統計欄位中元素的個數”看出，("religiousness", "education", "occupation", "rating")這4個欄位的元素個數<=7

// Create new column "indexedFeatures" with categorical values transformed to indices
val indexedData = featureIndexer.transform(vecDF)
indexedData: org.apache.spark.sql.DataFrame = [affairs: double, gender: string ... 9 more fields]

val resColArray = Array("indexedFeatures", "features", "age", "yearsmarried", "religiousness", "education", "occupation", "rating")
resColArray: Array[String] = Array(indexedFeatures, features, age, yearsmarried, religiousness, education, occupation, rating)

indexedData.selectExpr(resColArray: _*).show(10, truncate = false)
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
|indexedFeatures            |features                    |age |yearsmarried|religiousness|education|occupation|rating|
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
|[37.0,10.0,2.0,5.0,6.0,3.0]|[37.0,10.0,3.0,18.0,7.0,4.0]|37.0|10.0        |3.0          |18.0     |7.0       |4.0   |
|[27.0,4.0,3.0,2.0,5.0,3.0] |[27.0,4.0,4.0,14.0,6.0,4.0] |27.0|4.0         |4.0          |14.0     |6.0       |4.0   |
|[32.0,15.0,0.0,1.0,0.0,3.0]|[32.0,15.0,1.0,12.0,1.0,4.0]|32.0|15.0        |1.0          |12.0     |1.0       |4.0   |
|[57.0,15.0,4.0,5.0,5.0,4.0]|[57.0,15.0,5.0,18.0,6.0,5.0]|57.0|15.0        |5.0          |18.0     |6.0       |5.0   |
|[22.0,0.75,1.0,4.0,5.0,2.0]|[22.0,0.75,2.0,17.0,6.0,3.0]|22.0|0.75        |2.0          |17.0     |6.0       |3.0   |
|[32.0,1.5,1.0,4.0,4.0,4.0] |[32.0,1.5,2.0,17.0,5.0,5.0] |32.0|1.5         |2.0          |17.0     |5.0       |5.0   |
|[22.0,0.75,1.0,1.0,0.0,2.0]|[22.0,0.75,2.0,12.0,1.0,3.0]|22.0|0.75        |2.0          |12.0     |1.0       |3.0   |
|[57.0,15.0,1.0,2.0,3.0,3.0]|[57.0,15.0,2.0,14.0,4.0,4.0]|57.0|15.0        |2.0          |14.0     |4.0       |4.0   |
|[32.0,15.0,3.0,3.0,0.0,1.0]|[32.0,15.0,4.0,16.0,1.0,2.0]|32.0|15.0        |4.0          |16.0     |1.0       |2.0   |
|[22.0,1.5,3.0,2.0,3.0,4.0] |[22.0,1.5,4.0,14.0,4.0,5.0] |22.0|1.5         |4.0          |14.0     |4.0       |5.0   |
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
only showing top 10 rows

import org.apache.spark.ml.feature.VectorSlicer

val slicer = new VectorSlicer().setInputCol("indexedFeatures").setOutputCol("slicerFeatures")
slicer.setIndices(Array(3))  // 此處的3對應“索引化”之前的欄位“education”

val output = slicer.transform(indexedData)
output.select("indexedFeatures", 
        "slicerFeatures",
        "education").limit(10).orderBy($"education").show(10, truncate = false)
+---------------------------+--------------+---------+
|indexedFeatures            |slicerFeatures|education|
+---------------------------+--------------+---------+
|[32.0,15.0,0.0,1.0,0.0,3.0]|[1.0]         |12.0     |
|[22.0,0.75,1.0,1.0,0.0,2.0]|[1.0]         |12.0     |
|[27.0,4.0,3.0,2.0,5.0,3.0] |[2.0]         |14.0     |
|[57.0,15.0,1.0,2.0,3.0,3.0]|[2.0]         |14.0     |
|[22.0,1.5,3.0,2.0,3.0,4.0] |[2.0]         |14.0     |
|[32.0,15.0,3.0,3.0,0.0,1.0]|[3.0]         |16.0     |
|[32.0,1.5,1.0,4.0,4.0,4.0] |[4.0]         |17.0     |
|[22.0,0.75,1.0,4.0,5.0,2.0]|[4.0]         |17.0     |
|[37.0,10.0,2.0,5.0,6.0,3.0]|[5.0]         |18.0     |
|[57.0,15.0,4.0,5.0,5.0,4.0]|[5.0]         |18.0     |
+---------------------------+--------------+---------+
// 由此看出，“類別特徵欄位”被索引化後，索引的編號是跟“原欄位值的大小順序”對照的，索引從0開始
// 索引編號(0,1,2,3,4,5,6)對應[9.0, 12.0, 14.0, 16.0, 17.0, 18.0, 20.0]

VectorSlicer向量切割

import org.apache.spark.ml.feature.VectorSlicer

val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating") 

// 欄位轉換成特徵向量
val assembler = new VectorAssembler().setInputCols(colArray).setOutputCol("features")
val vecDF = assembler.transform(data)

val slicer = new VectorSlicer().setInputCol("features").setOutputCol("slicerFeatures")
// 指定“向量欄位features”中的下標索引
// (2, 3, 4)分別對應欄位("religiousness", "education", "occupation")
slicer.setIndices(Array(2, 3, 4)) 

val output = slicer.transform(vecDF)
output.select("features", "slicerFeatures","religiousness", "education", "occupation").show(10, truncate = false)
+----------------------------+--------------+-------------+---------+----------+
|features                    |slicerFeatures|religiousness|education|occupation|
+----------------------------+--------------+-------------+---------+----------+
|[37.0,10.0,3.0,18.0,7.0,4.0]|[3.0,18.0,7.0]|3.0          |18.0     |7.0       |
|[27.0,4.0,4.0,14.0,6.0,4.0] |[4.0,14.0,6.0]|4.0          |14.0     |6.0       |
|[32.0,15.0,1.0,12.0,1.0,4.0]|[1.0,12.0,1.0]|1.0          |12.0     |1.0       |
|[57.0,15.0,5.0,18.0,6.0,5.0]|[5.0,18.0,6.0]|5.0          |18.0     |6.0       |
|[22.0,0.75,2.0,17.0,6.0,3.0]|[2.0,17.0,6.0]|2.0          |17.0     |6.0       |
|[32.0,1.5,2.0,17.0,5.0,5.0] |[2.0,17.0,5.0]|2.0          |17.0     |5.0       |
|[22.0,0.75,2.0,12.0,1.0,3.0]|[2.0,12.0,1.0]|2.0          |12.0     |1.0       |
|[57.0,15.0,2.0,14.0,4.0,4.0]|[2.0,14.0,4.0]|2.0          |14.0     |4.0       |
|[32.0,15.0,4.0,16.0,1.0,2.0]|[4.0,16.0,1.0]|4.0          |16.0     |1.0       |
|[22.0,1.5,4.0,14.0,4.0,5.0] |[4.0,14.0,4.0]|4.0          |14.0     |4.0       |
+----------------------------+--------------+-------------+---------+----------+
only showing top 10 rows


output.printSchema()
root
 |-- affairs: double (nullable = false)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = false)
 |-- yearsmarried: double (nullable = false)
 |-- children: string (nullable = true)
 |-- religiousness: double (nullable = false)
 |-- education: double (nullable = false)
 |-- occupation: double (nullable = false)
 |-- rating: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- slicerFeatures: vector (nullable = true)

Bucketizer將連續資料離散化到指定的範圍區間

import org.apache.spark.ml.feature.Bucketizer

// Double.NegativeInfinity：負無窮；Double.PositiveInfinity：正無窮
// 分為6個組：[負無窮,-100),[-100,-10),[-10,0),[0,10),[10,90),[90,正無窮)
val splits = Array(Double.NegativeInfinity, -100, -10, 0.0, 10, 90, Double.PositiveInfinity)

val data: Array[Double] = Array(-180,-160,-100,-50,-70,-20,-8,-5,-3, 0.0, 1,3,7,10,30,60,90,100,120,150)

val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
dataFrame: org.apache.spark.sql.DataFrame = [features: double]

val bucketizer = new Bucketizer()
.setInputCol("features")
.setOutputCol("bucketedFeatures")
.setSplits(splits)

// 將原始資料轉換為桶索引
val bucketedData = bucketizer.transform(dataFrame)
bucketedData: org.apache.spark.sql.DataFrame = [features: double, bucketedFeatures: double]

bucketedData.show(50,truncate=false)
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
|-180.0  |0.0             |
|-160.0  |0.0             |
|-100.0  |1.0             |
|-50.0   |1.0             |
|-70.0   |1.0             |
|-20.0   |1.0             |
|-8.0    |2.0             |
|-5.0    |2.0             |
|-3.0    |2.0             |
|0.0     |3.0             |
|1.0     |3.0             |
|3.0     |3.0             |
|7.0     |3.0             |
|10.0    |4.0             |
|30.0    |4.0             |
|60.0    |4.0             |
|90.0    |5.0             |
|100.0   |5.0             |
|120.0   |5.0             |
|150.0   |5.0             |
+--------+----------------+

Spark Extracting,transforming,selecting features
2020-09-25
SparkORM
特徵工程之特徵選擇
2018-10-26
特徵工程
機器學習特徵工程之特徵選擇
2017-03-25
機器學習特徵工程
xgboost特徵選擇
2016-11-27
特徵
xgboost 特徵選擇，篩選特徵的正要性
2018-04-17
特徵
特徵工程特徵選擇 reliefF演算法
2020-11-07
特徵工程演算法
特徵提取之Haar特徵
2015-08-06
特徵
特徵選擇和特徵生成問題初探
2018-07-29
特徵
第二篇：使用Spark對MovieLens的特徵進行提取
2017-05-20
Spark特徵
機器學習-特徵提取
2019-09-07
機器學習特徵
特徵提取-map
2021-01-04
特徵
影象特徵提取之HoG特徵
2018-03-06
特徵HOG
特徵選擇技術總結
2022-11-24
特徵
SQL解析過程中的查詢轉換 - Transforming Queries
2010-10-28
SQLORM
Spark SQL如何選擇join策略
2021-01-29
SparkSQL
Spark 模型選擇和調參
2020-09-28
Spark模型
使用xgboost進行特徵選擇
2017-08-17
特徵
決策樹模型(2)特徵選擇
2024-03-26
模型特徵
基於條件熵的特徵選擇
2020-08-09
熵特徵
RF、GBDT、XGboost特徵選擇方法
2018-04-19
特徵
流量特徵提取工具NFStream
2024-05-11
特徵NFS
xgboost 特徵重要性選擇 / 看所有特徵哪個重要
2018-06-06
特徵
spark dataframe 型別轉換
2016-12-20
Spark型別
論文筆記(3)-Extracting and Composing Robust Features with Denoising Autoencoders
2014-01-29
筆記
機器學習之特徵選擇和降維的理解
2017-09-23
機器學習特徵
機器學習中，有哪些特徵選擇的工程方法？
2018-07-09
機器學習特徵
ch11 特徵選擇與稀疏學習
2024-06-21
特徵
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（一）
2020-04-22
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（二）
2020-04-24
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（三）
2020-04-24
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（四）
2020-05-07
特徵工程
專欄 | 基於 Jupyter 的特徵工程手冊：特徵選擇（五）
2020-05-24
特徵工程
【Spark篇】---Spark中Transformations轉換運算元
2018-02-01
SparkORM
sift、surf、orb 特徵提取及最優特徵點匹配
2019-08-04
ORB特徵
Relief 特徵選擇演算法簡單介紹
2018-08-08
特徵演算法
用遺傳演算法進行特徵選擇
2019-01-20
演算法特徵
決策樹中結點的特徵選擇方法
2018-05-09
特徵
資訊增益（IG）特徵提取例項
2016-01-28
特徵

Spark特徵(提取，轉換，選擇)extracting, transforming and selecting features

相關文章