Spark Extracting,transforming,selecting features

HoLoong發表於2020-09-25

Spark(3) - Extracting, transforming, selecting features

官方文件連結:https://spark.apache.org/docs/2.2.0/ml-features.html

概述

該章節包含基於特徵的演算法工作,下面是粗略的對演算法分組:

  • 提取:從原始資料中提取特徵;
  • 轉換:縮放、轉換、修改特徵;
  • 選擇:從大的特徵集合中選擇一個子集;
  • 區域性敏感雜湊:這一類的演算法組合了其他演算法在特徵轉換部分(LSH最根本的作用是處理海量高維資料的最近鄰,也就是相似度問題,它使得相似度很高的資料以較高的概率對映為同一個hash值,而相似度很低的資料以極低的概率對映為同一個hash值,完成這個功能的函式,稱之為LSH);

目錄:

  • 特徵提取:
    • TF-IDF
    • Word2Vec
    • CountVectorizer
  • 特徵轉換:
    • Tokenizer
    • StopWordsRemover
    • n-gram
    • Binarizer
    • PCA
    • PolynomialExpansion
    • Discrete Cosine Transform
    • StringIndexer
    • IndexToString
    • OneHotEncoder
    • VectorIndexer
    • Interaction
    • Normalizer
    • StandardScaler
    • MinMaxScaler
    • MaxAbsScaler
    • Bucketizer
    • ElementwiseProduct
    • SQLTransformer
    • VectorAssembler
    • QuantileDiscretizer
    • Imputer
  • 特徵選擇:
    • VectorSlicer
    • RFormule
    • ChiSqSelector
  • 區域性敏感雜湊:
    • LSH Oprations:
      • Feature Transformation
      • Approximate Similarity Join
      • Approximate Nearest Neighbor Search
    • LSH Algorithms:
      • Bucketed Random Projection for Euclidean Distance
      • MinHash for Jaccard Distance

特徵提取

TF-IDF

TF-IDF是一種廣泛用於文字挖掘中反應語料庫中每一項對於文件的重要性的特徵向量化方法;

  • TF:HashingTF和CountVectorizer都可以用於生成詞項頻率向量;
  • IDF:IDF是一個預測器,呼叫其fit方法後得到IDFModel,IDFModel將每個特徵向量進行縮放,這樣做的目的是降低詞項在語料庫中出現次數導致的權重;
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark"),
    (0.0, "I wish Java could use case classes"),
    (1.0, "Logistic regression models are neat")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
# alternatively, CountVectorizer can also be used to get term frequency vectors

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("label", "features").show()

Word2Vec

Word2Vec是一個使用文件中的詞序列的預測器,訓練得到Word2VecModel,該模型將每個詞對映到一個唯一的可變大小的向量上,Word2VecModel使用文件中所有詞的平均值將文件轉換成一個向量,這個向量可以作為特徵用於預測、文件相似度計算等;

from pyspark.ml.feature import Word2Vec

# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)

result = model.transform(documentDF)
for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))

CountVectorizer

CountVectorizer和CountVectorizerModel的目標是將文字文件集合轉換為token出行次數的向量,當一個先驗的詞典不可用時,CountVectorizr可以作為一個預測器來提取詞彙並生成CoutVectorizerModel,這個模型為文件生成基於詞彙的稀疏表示式,這可以作為其他演算法的輸入,比如LDA;

在Fitting過程中,CountVectorizer會選擇語料庫中詞頻最大的詞彙量,一個可選的引數minDF通過指定文件中詞在語料庫中的最小出現次數來影響Fitting過程,另一個可選的二類切換引數控制輸出向量,如果設定為True,那麼所有非零counts都將被設定為1,這對於離散概率模型尤其有用;

假設我們有下面這個DataFrame,兩列為id和texts:

id texts
0 Array("a", "b", "c")
1 Array("a", "b", "b", "c", "a")

texts中的每一行都是一個元素為字串的陣列表示的文件,呼叫CountVectorizer的Fit方法得到一個含詞彙(a,b,c)的模型,輸出列“vector”格式如下:

id texts vector
0 Array("a", "b", "c") (3,[0,1,2],[1.0,1.0,1.0])
1 Array("a", "b", "b", "c", "a") (3,[0,1,2],[2.0,2.0,1.0])
from pyspark.ml.feature import CountVectorizer

# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

model = cv.fit(df)

result = model.transform(df)
result.show(truncate=False)

特徵轉換

Tokenizer

Tokenization表示將文字轉換分割為單詞集合的過程,一個簡單的Tokenizer提供了這個功能,下面例子展示如何將句子分割為單詞序列;

RegexTokenizer允許使用更多高階的基於正規表示式的Tokenization,預設情況下,引數pattern用於表達分隔符,或者使用者可以設定引數gaps為false來表示pattern不是作為分隔符,此時pattern就是正規表示式的作用;

from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

StopWordsRemover

停用詞指的是那些在輸入中應該被去除的單詞,因為停用詞出現次數很多但是又不包含任意資訊;

StopWordsRemover將輸入的字串序列中所有的停用詞丟棄,停用詞列表可以通過引數stopWords指定同一種語言的預設停用詞可以通過呼叫StopWordsRemover.loadDefaultStopWords來訪問(可惜沒有中文的停用詞列表),bool型引數caseSensitive表示是否大小寫敏感,預設是不敏感;

假設我們有下列包含id和raw的DataFrame:

id raw
0 [I, saw, the, red, baloon]
1 [Mary, had, a, little, lamb]

對raw列應用StopWordsRemover可以得到過濾後的列:

id raw filtered
0 [I, saw, the, red, baloon] [saw, red, baloon]
1 [Mary, had, a, little, lamb] [Mary, little, lamb]
from pyspark.ml.feature import StopWordsRemover

sentenceData = spark.createDataFrame([
    (0, ["I", "saw", "the", "red", "balloon"]),
    (1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])

remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)

n-gram

一個n-gram就是一個n tokens(一般就是單詞)的序列,NGram類將輸入特徵轉換成n-grams;

NGram將字串序列(比如Tokenizer的輸出)作為輸入,引數n用於指定每個n-gram中的項的個數;

from pyspark.ml.feature import NGram

wordDataFrame = spark.createDataFrame([
    (0, ["Hi", "I", "heard", "about", "Spark"]),
    (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
    (2, ["Logistic", "regression", "models", "are", "neat"])
], ["id", "words"])

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")

ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)

Binarizer

Binarization表示將數值型特徵轉換為0/1特徵的過程;

Binarizer使用常用的inputCol和outputCol引數,指定threshold用於二分資料,特徵值大於閾值的將被設定為1,反之則是0,向量和雙精度浮點型都可以作為inputCol;

from pyspark.ml.feature import Binarizer

continuousDataFrame = spark.createDataFrame([
    (0, 0.1),
    (1, 0.8),
    (2, 0.2)
], ["id", "feature"])

binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature")

binarizedDataFrame = binarizer.transform(continuousDataFrame)

print("Binarizer output with Threshold = %f" % binarizer.getThreshold())
binarizedDataFrame.show()

PCA

PCA是一種使用正交變換將可能相關的變數值轉換為線性不相關(即主成分)的統計程式,PCA類訓練模型用於將向量對映到低維空間,下面例子演示瞭如何將5維特徵向量對映到3維主成分;

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
        (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
        (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

PolynomialExpansion

多項式展開是將特徵展開到多項式空間的過程,這可以通過原始維度的n階組合,PolynomailExpansion類提供了這一功能,下面例子展示如何將原始特徵展開到一個3階多項式空間;

from pyspark.ml.feature import PolynomialExpansion
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (Vectors.dense([2.0, 1.0]),),
    (Vectors.dense([0.0, 0.0]),),
    (Vectors.dense([3.0, -1.0]),)
], ["features"])

polyExpansion = PolynomialExpansion(degree=3, inputCol="features", outputCol="polyFeatures")
polyDF = polyExpansion.transform(df)

polyDF.show(truncate=False)

Discrete Cosine Tranform

離散餘弦轉換將在時域的長度為N的真值序列轉換到另一個在頻域的長度為N的真值序列,DCT類提供了這一功能;

from pyspark.ml.feature import DCT
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (Vectors.dense([0.0, 1.0, -2.0, 3.0]),),
    (Vectors.dense([-1.0, 2.0, 4.0, -7.0]),),
    (Vectors.dense([14.0, -2.0, -5.0, 1.0]),)], ["features"])

dct = DCT(inverse=False, inputCol="features", outputCol="featuresDCT")

dctDf = dct.transform(df)

dctDf.select("featuresDCT").show(truncate=False)

StringIndexer

StringIndexer將字串標籤編碼為索引標籤,實際就是將字串與數字進行一一對應,不過這個的對應關係是字串頻率越高,對應數字越小,因此出現最多的將被對映為0,對於未見過的字串標籤,如果使用者選擇保留,那麼它們將會被放入數字標籤中,如果輸入標籤是數值型,會被強轉為字串再處理;

假設我們有下面這個包含id和category的DataFrame:

id category
0 a
1 b
2 c
3 a
4 a
5 c

category是字串列,包含3種標籤:‘a’,‘b’,‘c’,應用StringIndexer到category得到categoryIndex:

id category categoryIndex
0 a 0.0
1 b 2.0
2 c 1.0
3 a 0.0
4 a 0.0
5 c 1.0

'a'對映到0,因為它出現次數最多,然後是‘c’,對映到1,‘b’對映到2;

另外,有三種策略處理沒見過的label:

  • 丟擲異常,預設選擇是這個;
  • 跳過包含未見過的label的行;
  • 將未見過的標籤放入特別的額外的桶中,在索引數字標籤;

回到前面的例子,不同的是將上述構建的StringIndexer例項用於下面的DataFrame上,注意‘d’和‘e’是未見過的標籤:

id category
0 a
1 b
2 c
3 d
4 e

如果沒有設定StringIndexer如何處理錯誤或者設定了‘error’,那麼它會丟擲異常,如果設定為‘skip’,會得到下述結果:

id category categoryIndex
0 a 0.0
1 b 2.0
2 c 1.0

注意到含有‘d’和‘e’的行被跳過了;

如果設定為‘keep’,那麼會得到以下結果:

id category categoryIndex
0 a 0.0
1 b 2.0
2 c 1.0
3 d 3.0
4 e 3.0

看到,未見過的標籤被統一對映到一個單獨的數字上,此處是‘3’;

from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

IndexToString

可以簡單看作是StringIndexer的反向操作,通常使用場景也是與StringIndexer配套使用;

基於StringIndexer的例子,假設我們有下述包含id和categoryIndex的DataFrame,注意此處的categoryIndex是StringIndexer轉換得到的:

id categoryIndex
0 0.0
1 2.0
2 1.0
3 0.0
4 0.0
5 1.0

應用IndexToString到categoryIndex,輸出originalCategory,我們可以取回我們的原始標籤(這是基於列的後設資料推斷得到的):

id categoryIndex originalCategory
0 0.0 a
1 2.0 b
2 1.0 c
3 0.0 a
4 0.0 a
5 1.0 c
from pyspark.ml.feature import IndexToString, StringIndexer

df = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = indexer.fit(df)
indexed = model.transform(df)

print("Transformed string column '%s' to indexed column '%s'"
      % (indexer.getInputCol(), indexer.getOutputCol()))
indexed.show()

print("StringIndexer will store labels in output column metadata\n")

converter = IndexToString(inputCol="categoryIndex", outputCol="originalCategory")
converted = converter.transform(indexed)

print("Transformed indexed column '%s' back to original string column '%s' using "
      "labels in metadata" % (converter.getInputCol(), converter.getOutputCol()))
converted.select("id", "categoryIndex", "originalCategory").show()

OneHotEncoder

One-Hot編碼將標籤列索引到二分向量上,這種編碼使得那些期望輸入為數值型特徵的演算法,比如邏輯迴歸,可以使用類別型特徵;

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

VectorIndexer

VectorIndexer幫助對類別特徵進行索引處理,它可以同時自動判斷那些特徵是類別型,並將其對映到類別索引上,如下:

  • 接收型別為Vector的列,設定引數maxCategories;
  • 基於列的唯一值數量判斷哪些列需要進行類別索引化,最多有maxCategories個特徵被處理;
  • 每個特徵索引從0開始;
  • 索引類別特徵並轉換原特徵值為索引值;

下面例子,讀取一個含標籤的資料集,使用VectorIndexer進行處理,轉換類別特徵為他們自身的索引,之後這個轉換後的特徵資料就可以直接送入類似DecisionTreeRegressor等演算法中進行訓練了:

from pyspark.ml.feature import VectorIndexer

data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=10)
indexerModel = indexer.fit(data)

categoricalFeatures = indexerModel.categoryMaps
print("Chose %d categorical features: %s" %
      (len(categoricalFeatures), ", ".join(str(k) for k in categoricalFeatures.keys())))

# Create new column "indexed" with categorical values transformed to indices
indexedData = indexerModel.transform(data)
indexedData.show()

Interaction

Interfaction是一個接收向量列或者兩個值的列的轉換器,輸出一個單向量列,該列包含輸入列的每個值所有組合的乘積;

例如,如果你有2個向量列,每一個都是3維,那麼你將得到一個9維(3*3的排列組合)的向量作為輸出列;

假設我們有下列包含vec1和vec2兩列的DataFrame:

id1 vec1 vec2
1 [1.0,2.0,3.0] [8.0,4.0,5.0]
2 [4.0,3.0,8.0] [7.0,9.0,8.0]
3 [6.0,1.0,9.0] [2.0,3.0,6.0]
4 [10.0,8.0,6.0] [9.0,4.0,5.0]
5 [9.0,2.0,7.0] [10.0,7.0,3.0]
6 [1.0,1.0,4.0] [2.0,8.0,4.0]

對vec1和vec2應用Interaction後得到interactedCol作為輸出列:

id1 vec1 vec2 interactedCol
1 [1.0,2.0,3.0] [8.0,4.0,5.0] [8.0,4.0,5.0,16.0,8.0,10.0,24.0,12.0,15.0]
2 [4.0,3.0,8.0] [7.0,9.0,8.0] [56.0,72.0,64.0,42.0,54.0,48.0,112.0,144.0,128.0]
3 [6.0,1.0,9.0] [2.0,3.0,6.0] [36.0,54.0,108.0,6.0,9.0,18.0,54.0,81.0,162.0]
4 [10.0,8.0,6.0] [9.0,4.0,5.0] [360.0,160.0,200.0,288.0,128.0,160.0,216.0,96.0,120.0]
5 [9.0,2.0,7.0] [10.0,7.0,3.0] [450.0,315.0,135.0,100.0,70.0,30.0,350.0,245.0,105.0]
6 [1.0,1.0,4.0] [2.0,8.0,4.0] [12.0,48.0,24.0,12.0,48.0,24.0,48.0,192.0,96.0]
import org.apache.spark.ml.feature.Interaction
import org.apache.spark.ml.feature.VectorAssembler

val df = spark.createDataFrame(Seq(
  (1, 1, 2, 3, 8, 4, 5),
  (2, 4, 3, 8, 7, 9, 8),
  (3, 6, 1, 9, 2, 3, 6),
  (4, 10, 8, 6, 9, 4, 5),
  (5, 9, 2, 7, 10, 7, 3),
  (6, 1, 1, 4, 2, 8, 4)
)).toDF("id1", "id2", "id3", "id4", "id5", "id6", "id7")

val assembler1 = new VectorAssembler().
  setInputCols(Array("id2", "id3", "id4")).
  setOutputCol("vec1")

val assembled1 = assembler1.transform(df)

val assembler2 = new VectorAssembler().
  setInputCols(Array("id5", "id6", "id7")).
  setOutputCol("vec2")

val assembled2 = assembler2.transform(assembled1).select("id1", "vec1", "vec2")

val interaction = new Interaction()
  .setInputCols(Array("id1", "vec1", "vec2"))
  .setOutputCol("interactedCol")

val interacted = interaction.transform(assembled2)

interacted.show(truncate = false)

Normalizer

Normalizer是一個轉換Vector資料集的轉換器,對資料進行正則化處理,正則化處理標準化資料,並提高學習演算法的表現;

from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.5, -1.0]),),
    (1, Vectors.dense([2.0, 1.0, 1.0]),),
    (2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])

# Normalize each Vector using $L^1$ norm.
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
l1NormData = normalizer.transform(dataFrame)
print("Normalized using L^1 norm")
l1NormData.show()

# Normalize each Vector using $L^\infty$ norm.
lInfNormData = normalizer.transform(dataFrame, {normalizer.p: float("inf")})
print("Normalized using L^inf norm")
lInfNormData.show()

StandardScaler

StandardScaler轉換Vector資料集,正則化每個特徵使其具備統一的標準差或者均值為0,可設定引數:

  • withStd,預設是True,將資料縮放到一致的標準差下;
  • withMean,預設是False,縮放前使用均值集中資料,會得到密集結果,如果應用在稀疏輸入上要格外注意;

StandardScaler是一個預測器,可以通過fit資料集得到StandardScalerModel,這可用於計算總結統計資料,這個模型可以轉換資料集中的一個vector列,使其用於一致的標準差或者均值為0;

注意:如果一個特徵的標準差是0,那麼該特徵處理後返回的就是預設值0;

from pyspark.ml.feature import StandardScaler

dataFrame = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(dataFrame)

# Normalize each feature to have unit standard deviation.
scaledData = scalerModel.transform(dataFrame)
scaledData.show()

MinMaxScaler

MinMaxScaler轉換Vector資料集,重新縮放每個特徵到一個指定範圍,預設是0到1,引數如下:

  • min:預設0,指定範圍下限;
  • max:預設1,指定範圍上限;

MinMaxScaler計算資料集上的總結統計,生成MinMaxScalerModel,這個模型可以將每個特徵轉換到給定的範圍內;

重新縮放特徵值的方式如下:
$$
\begin{equation}
Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
\end{equation}
$$
注意:值為0也有可能被轉換為非0值,轉換的輸出將是密集向量即便輸入是稀疏向量;

from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(dataFrame)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(dataFrame)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
scaledData.select("features", "scaledFeatures").show()

MaxAbsScaler

MaxAbsScaler轉換Vector的資料集,通過除以每個特徵自身的最大絕對值將數值範圍縮放到-1和1之間,這個操作不會移動或者集中資料(資料分佈沒變),也就不會損失任何稀疏性;

MaxAbsScaler計算總結統計生成MaxAbsScalerModel,這個模型可以轉換任何一個特徵到-1和1之間;

from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -8.0]),),
    (1, Vectors.dense([2.0, 1.0, -4.0]),),
    (2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"])

scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MaxAbsScalerModel
scalerModel = scaler.fit(dataFrame)

# rescale each feature to range [-1, 1].
scaledData = scalerModel.transform(dataFrame)

scaledData.select("features", "scaledFeatures").show()

Bucketizer

分箱操作,Bucketizer將一個數值型特徵轉換程箱型特徵,每個箱的間隔等都是使用者設定的,引數:

  • splits:數值到箱的對映關係表,將會分為n+1個分割得到n個箱,每個箱定義為[x,y),即x到y之間,包含x,最後一個箱同時包含y,分割需要時單調遞增的,正負無窮都必須明確的提供以覆蓋所有數值,也就是說,在指定分割範圍外的數值將被作為錯誤對待;

注意:如果你不知道目標列的上下限,你需要新增正負無窮作為你分割的第一個和最後一個箱;

注意:提供的分割順序必須是單調遞增的,s0 < s1 < s2.... < sn;

from pyspark.ml.feature import Bucketizer

splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")]

data = [(-999.9,), (-0.5,), (-0.3,), (0.0,), (0.2,), (999.9,)]
dataFrame = spark.createDataFrame(data, ["features"])

bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures")

# Transform original data into its bucket index.
bucketedData = bucketizer.transform(dataFrame)

print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
bucketedData.show()

ElementwiseProduct

ElementwiseProduct將每個輸入向量乘以對應的提供的”權重“向量,使用element-wise倍增,換句話說,它使用標乘處理資料集中的每一列,公式如下:
$$
\begin{pmatrix}
v_1 \
\vdots \
v_N
\end{pmatrix} \circ \begin{pmatrix}
w_1 \
\vdots \
w_N
\end{pmatrix}
= \begin{pmatrix}
v_1 w_1 \
\vdots \
v_N w_N
\end{pmatrix}
$$

from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors

# Create some vector data; also works for sparse vectors
data = [(Vectors.dense([1.0, 2.0, 3.0]),), (Vectors.dense([4.0, 5.0, 6.0]),)]
df = spark.createDataFrame(data, ["vector"])
transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]),
                                 inputCol="vector", outputCol="transformedVector")
# Batch transform the vectors to create new column:
transformer.transform(df).show()

SQLTransformer

SQLTransformer實現了SQL表示式定義的資料轉換方法,目前我們只支援的SQL語句類似”SELECT ... FROM __THIS__ ... WHERE __THIS__“,使用者還可以使用Spark SQL內建函式或者UDF來操作選中的列,例如SQLTransformer支援下列用法:

  • SELECT a, a+b AS a_b FROM __THIS__
  • SELECT a, SQRT(B) AS b_sqrt FROM __THIS__ WHERE a > 5
  • SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b

假設我們有下列DataFrame:

id v1 v2
0 1.0 3.0
2 2.0 5.0

應用”SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__“結果如下:

id v1 v2 v3 v4
0 1.0 3.0 4.0 3.0
2 2.0 5.0 7.0 10.0
from pyspark.ml.feature import SQLTransformer

df = spark.createDataFrame([
    (0, 1.0, 3.0),
    (2, 2.0, 5.0)
], ["id", "v1", "v2"])
sqlTrans = SQLTransformer(
    statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTrans.transform(df).show()

VectorAssembler

VectorAssembler將N個列組合轉成一個vector列的轉換器,一般使用者對原始特徵的組合或者對其他轉換器輸出的組合,對於模型訓練來說,通常都需要先對原始的各種類別的,包括數值、bool、vector等特徵進行VectorAssembler組合後再送入模型訓練;

假設有下列資料:

id hour mobile userFeatures clicked
0 18 1.0 [0.0, 10.0, 0.5] 1.0

上述資料包含整型、浮點型以及vector,同時id和clicked是不需要組合的,應用VectorAssembler結果如下:

id hour mobile userFeatures clicked features
0 18 1.0 [0.0, 10.0, 0.5] 1.0 [18.0, 1.0, 0.0, 10.0, 0.5]

可以看到,原始特徵中的vector也被展開了,這就很方便;

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
    ["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)

QuantileDiscretizer

QuantileDiscretizer(分位數離散)將數值型特徵轉換為類別型特徵(類別號為分位數對應),通過numBuckets設定桶的數量,也就是分為多少段,比如設定為100,那就是百分位,可能最終桶數小於這個設定的值,這是因為原資料中的所有可能的數值數量不足導致的;

NaN值:NaN值在QuantileDiscretizer的Fitting期間會被移除,該過程會得到一個Bucketizer模型來預測,在轉換期間,Bucketizer如果在資料集中遇到NaN,那麼會丟擲一個錯誤,但是使用者可以選擇是保留還是移除NaN值,通過色湖之handleInvalid引數,如果使用者選擇保留,那麼這些NaN值會被放入一個特殊的額外增加的桶中;

演算法:每個桶的範圍的選擇是通過近似演算法,近似精度可以通過引數relativeError控制,如果設定為0,那麼就會計算準確的分位數(注意這個計算是非常佔用計算資源的),桶的上下限為正負無窮,覆蓋所有實數;

假設我們有下列DataFrame:

id hour
0 18.0
1 19.0
2 8.0
3 5.0
4 2.2

hour是一個雙精度型別的數值列,我們想要將其轉換為類別型,設定numBuckets為3,也就是放入3個桶中,得到下列DataFrame:

id hour result
0 18.0 2.0
1 19.0 2.0
2 8.0 1.0
3 5.0 1.0
4 2.2 0.0
from pyspark.ml.feature import QuantileDiscretizer

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)]
df = spark.createDataFrame(data, ["id", "hour"])

discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result")

result = discretizer.fit(df).transform(df)
result.show()

Imputer

Imputer用於對資料集中的缺失值進行填充,可以通過均值或者中位數等對指定未知的缺失值填充,輸入特徵需要是Float或者Double型別,當前Imputer不支援類別特徵和對於包含類別特徵的列可能會出現錯誤數值;

注意:所有輸入特徵中的null值都被看做是缺失值,因此也會被填充;

假設我們有下列DataFrame:

a b
1.0 Double.NaN
2.0 Double.NaN
Double.NaN 3.0
4.0 4.0
5.0 5.0

在這個例子中,Imputer會替換所有Double.NaN為對應列的均值,a列均值為3,b列均值為4,轉換後,a和b中的NaN被3和4替換得到新列:

a b out_a out_b
1.0 Double.NaN 1.0 4.0
2.0 Double.NaN 2.0 4.0
Double.NaN 3.0 3.0 3.0
4.0 4.0 4.0 4.0
5.0 5.0 5.0 5.0
from pyspark.ml.feature import Imputer

df = spark.createDataFrame([
    (1.0, float("nan")),
    (2.0, float("nan")),
    (float("nan"), 3.0),
    (4.0, 4.0),
    (5.0, 5.0)
], ["a", "b"])

imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)

model.transform(df).show()

特徵選擇

VectorSlicer

VectorSlicer是一個轉換器,接收特徵向量,輸出含有原特徵向量子集的新的特徵向量,這對於對向量列做特徵提取很有用;

VectorSlicer接收包含指定索引的向量列,輸出新的向量列,新的向量列中的元素是通過這些索引指定選擇的,有兩種指定索引的方式:

  • 通過setIndices()方法以整數方式指定下標;
  • 通過setNames()方法以字串方式指定索引,這要求向量列有一AttributeGroup將每個Attribute與名字匹配上;

通過整數和字串指定都是可以的,此外還可以同時指定整合和字串,最少一個特徵必須被選中,不允許指定重複列,因此不會出現重複列,注意,如果指定了一個不存在的字串列會丟擲異常;

輸出向量會把特徵按照整數指定的順序排列,然後才是按照字串指定的順序;

假設我們有包含userFeatures列的DataFrame:

userFeatures
[0.0, 10.0, 0.5]

userFeatures是一個包含3個使用者特徵的向量列,假設userFeatures的第一列都是0,因此我們希望可以移除它,僅保留其餘兩列,通過setIndices(1,2)的結果如下:

userFeatures features
[0.0, 10.0, 0.5] [10.0, 0.5]

假設userFeatures中3列對應名字為["f1","f2","f3"],那麼我們同樣可以通過setNames("f2","f3")實現一樣的效果:

userFeatures features
[0.0, 10.0, 0.5] [10.0, 0.5]
["f1", "f2", "f3"] ["f2", "f3"]
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row

df = spark.createDataFrame([
    Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3})),
    Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0]))])

slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[1])

output = slicer.transform(df)

output.select("userFeatures", "features").show()

RFormula

RFormula通過R模型公式選擇列,當前我們支援有限的R操作的子集,包括”~“、”.“、”:“、”+“、”-“:

  • ~分割目標和項,類似公式中的等號;
  • +連線多個項,”+ 0“表示移除截距;
  • -移除一項,”- 1“表示移除截距;
  • :相互作用(數值型做乘法、類別型做二分);
  • .除了目標列的所有列;

假設a和b是兩個列,我們可以使用下述簡單公式來演示RFormula的功能:

  • y ~ a + b:表示模型 y~w0 + w1*a + w2*b,w0是截距,w1和w2是係數;
  • y ~ a + b + a:b -1:表示模型 y~w1*a + w2*b + w3*a*b,w1、w2和w3都是係數;

RFormula生成一個特徵向量列和一個雙精度浮點或者字串型的標籤列,類似R中的公式用於線性迴歸一樣,字串輸入列會被one-hot編碼,數值型列會被強轉為雙精度浮點,如果標籤列是字串,那麼會首先被StringIndexer轉為double,如果DataFrame中不存在標籤列,輸出標籤列會被公式中的指定返回變數所建立;

假設我們有一個包含id、country、hour、clicked的DataFrame,如下:

id country hour clicked
7 "US" 18 1.0
8 "CA" 12 0.0
9 "NZ" 15 0.0

如果我們使用公式為”clicked ~ country + hour“的RFormula,這意味著我們希望基於country和hour來預測clicked,轉換後我們會得到如下DataFrame:

id country hour clicked features label
7 "US" 18 1.0 [0.0, 0.0, 18.0] 1.0
8 "CA" 12 0.0 [0.0, 1.0, 12.0] 0.0
9 "NZ" 15 0.0 [1.0, 0.0, 15.0] 0.0
from pyspark.ml.feature import RFormula

dataset = spark.createDataFrame(
    [(7, "US", 18, 1.0),
     (8, "CA", 12, 0.0),
     (9, "NZ", 15, 0.0)],
    ["id", "country", "hour", "clicked"])

formula = RFormula(
    formula="clicked ~ country + hour",
    featuresCol="features",
    labelCol="label")

output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()

ChiSqSelector

ChiSqSelector用於卡方特徵選擇,它作用於類別特徵標籤資料,ChiSqSelector使用獨立卡方檢驗來決定哪些特徵被選中,它支援5種選擇方法:

  • numTopFeatures:指定返回卡方測試中的TopN個特徵;
  • percentile:返回卡方測試中的多少比例的Top特徵;
  • fpr:返回所有p值小於閾值的特徵,它控制選擇的false positive比例;
  • fdr:返回false descovery rate小於閾值的特徵;
  • fwe:返回所有p值小於閾值的特徵,閾值為1/numFeatures;

預設使用numTopFeatures,N指定為50;

假設我們有包含id、features、clicked的DataFrame作為我們目標來預測:

id features clicked
7 [0.0, 0.0, 18.0, 1.0] 1.0
8 [0.0, 1.0, 12.0, 0.0] 0.0
9 [1.0, 0.0, 15.0, 0.1] 0.0

如果我們使用ChiSqSelector,指定numTopFeatures=1,根據標籤列clicked計算得到features中的最後一列是最有用的特徵:

id features clicked selectedFeatures
7 [0.0, 0.0, 18.0, 1.0] 1.0 [1.0]
8 [0.0, 1.0, 12.0, 0.0] 0.0 [0.0]
9 [1.0, 0.0, 15.0, 0.1] 0.0 [0.1]
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])

selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",
                         outputCol="selectedFeatures", labelCol="clicked")

result = selector.fit(df).transform(df)

print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures())
result.show()

區域性敏感雜湊

PS:這篇LSH講的挺好的,可以參考下;

LSH是雜湊技術中很重要的一類,通常用於海量資料的聚類、近似最近鄰搜尋、異常檢測等;

通常的做法是使用LSH family函式將資料點雜湊到桶中,相似的點大概率落入一樣的桶,不相似的點落入不同的桶中;

在矩陣空間(M,d)中,M是資料集合,d是作用在M上的距離函式,LSH family函式h需要滿足下列屬性:
$$
\forall p, q \in M,\ d(p,q) \leq r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\ d(p,q) \geq r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
$$
這個LSH family叫做(r1,r2,p1,p2)-sensitive;

在Spark中,不同的LSH family通過分離的類實現(比如MinHash),每個類都提供用於特徵轉換、近似相似連線、近似最近鄰的API;

LSH操作

我們選擇了LSH能被使用的主要的操作型別,每個Fitted的LSH模型都有方法負責每個操作;

特徵轉換

特徵轉換是一個基本功能,將一個hash列作為新列新增到資料集中,這對於降維很有用,使用者可以通過inputCol和outputCol指定輸入輸出列;

LSH也支援多個LSH雜湊表,使用者可以通過numHuashTables指定雜湊表個數(這屬於增強LSH),這也可以用於近似相似連線和近似最近鄰的OR-amplification,提高雜湊表的個數可以提高準確率,同時也會提高執行時間和通訊成本;

outputCol的型別是Seq[Vector],陣列的維度等於numHashTables,向量的維度目前設定為1,在未來,我們會實現AND-amplification,那樣使用者就可以指定向量的維度;

近似相似連線

近似相似連線使用兩個資料集,返回近似的距離小於使用者定義的閾值的行對(row,row),近似相似連線支援連線兩個不同的資料集,也支援資料集與自身的連線,自身連線會生成一些重複對;

近似相似連線允許轉換後和未轉換的資料集作為輸入,如果輸入是未轉換的,它將被自動轉換,這種情況下,雜湊signature作為outputCol被建立;

在連線後的資料集中,原始資料集可以在datasetA和datasetB中被查詢,一個距離列會增加到輸出資料集中,它包含每一對的真實距離;

近似最近鄰搜尋

近似最近鄰搜尋使用資料集(特徵向量集合)和目標行(一個特徵向量),它近似的返回指定數量的與目標行最接近的行;

近似最近鄰搜尋同樣支援轉換後和未轉換的資料集作為輸入,如果輸入未轉換,那麼會自動轉換,這種情況下,雜湊signature作為outputCol被建立;

一個用於展示每個輸出行與目標行之間距離的列會被新增到輸出資料集中;

注意:當雜湊桶中沒有足夠候選資料點時,近似最近鄰搜尋會返回少於指定的個數的行;

LSH演算法

LSH演算法通常是一一對應的,即一個距離演算法(比如歐氏距離、cos距離)對應一個LSH演算法(即Hash函式);

Bucketed Random Projection - 歐氏距離

Bucketed Random Projection是針對歐氏距離的LSH族演算法,歐氏距離定義如下:
$$
d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
$$
LSH family將特徵向量集x對映到一個隨機單元向量v,將對映結果分到雜湊桶中:
$$
h(\mathbf{x}) = \Big\lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \Big\rfloor
$$
r是使用者定義的桶的長度,桶的長度可以用於控制雜湊桶的平均大小,一個大的桶長度提高了特徵被分到同一個桶中的概率(提高了true positives和false positives的數量);

Bucketed Random Projection接收任意向量集作為輸入特徵集,sparse和dense向量都支援;

from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

dataA = [(0, Vectors.dense([1.0, 1.0]),),
         (1, Vectors.dense([1.0, -1.0]),),
         (2, Vectors.dense([-1.0, -1.0]),),
         (3, Vectors.dense([-1.0, 1.0]),)]
dfA = spark.createDataFrame(dataA, ["id", "features"])

dataB = [(4, Vectors.dense([1.0, 0.0]),),
         (5, Vectors.dense([-1.0, 0.0]),),
         (6, Vectors.dense([0.0, 1.0]),),
         (7, Vectors.dense([0.0, -1.0]),)]
dfB = spark.createDataFrame(dataB, ["id", "features"])

key = Vectors.dense([1.0, 0.0])

brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=2.0,
                                  numHashTables=3)
model = brp.fit(dfA)

# Feature Transformation
print("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dfA).show()

# Compute the locality sensitive hashes for the input rows, then perform approximate
# similarity join.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`
print("Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:")
model.approxSimilarityJoin(dfA, dfB, 1.5, distCol="EuclideanDistance")\
    .select(col("datasetA.id").alias("idA"),
            col("datasetB.id").alias("idB"),
            col("EuclideanDistance")).show()

# Compute the locality sensitive hashes for the input rows, then perform approximate nearest
# neighbor search.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxNearestNeighbors(transformedA, key, 2)`
print("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 2).show()
MinHash - 傑卡德距離

MinHash是一個針對傑卡德距離的使用自然數作為輸入特徵集的LSH family,傑卡德距離的定義是兩個集合的交集和並集的基數:
$$
d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
$$
MinHash對集合中每個元素應用一個隨機雜湊函式g,選取所有雜湊值中最小的:
$$
h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
$$
MinHash的輸入集是二分向量集,向量索引表示元素自身和向量中的非零值,sparse和dense向量都支援,處於效率考慮推薦使用sparse向量集,例如Vectors.sparse(10, Array[(2,1.0),(3,1.0),(5,1.0)])表示空間中有10個元素,集合包括元素2,3,5,所有非零值被看作二分值中的”1“;

from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
         (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
         (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
dfA = spark.createDataFrame(dataA, ["id", "features"])

dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
         (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
         (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
dfB = spark.createDataFrame(dataB, ["id", "features"])

key = Vectors.sparse(6, [1, 3], [1.0, 1.0])

mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
model = mh.fit(dfA)

# Feature Transformation
print("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dfA).show()

# Compute the locality sensitive hashes for the input rows, then perform approximate
# similarity join.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
print("Approximately joining dfA and dfB on distance smaller than 0.6:")
model.approxSimilarityJoin(dfA, dfB, 0.6, distCol="JaccardDistance")\
    .select(col("datasetA.id").alias("idA"),
            col("datasetB.id").alias("idB"),
            col("JaccardDistance")).show()

# Compute the locality sensitive hashes for the input rows, then perform approximate nearest
# neighbor search.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxNearestNeighbors(transformedA, key, 2)`
# It may return less than 2 rows when not enough approximate near-neighbor candidates are
# found.
print("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 2).show()

最後

大家可以到我的Github上看看有沒有其他需要的東西,目前主要是自己做的機器學習專案、Python各種指令碼工具、有意思的小專案以及Follow的大佬、Fork的專案等:
https://github.com/NemoHoHaloAi

相關文章