Spark的39個機器學習庫-英文
Apache Spark itself
1. MLlib
AMPLab
Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.
ML Base
Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer.
2. MLI
3. ML Optimizer (aka Ghostface)
Ghostware was described in 2014 but never released. Of the 39 machine learning libraries, this is the only one that is vaporware, and is included only due to its AMPLab and ML Base status.
Other than ML Base
4. Splash
A recent project from June, 2015, this set of stochastic learning algorithms claims 25x - 75x faster performance than Spark MLlib on Stochastic Gradient Descent (SGD). Plus it's an AMPLab project that begins with the letters "sp", so it's worth watching.
5. Keystone ML
Brought machine learning pipelines to Spark, but pipelines have matured in recent versions of Spark. Also promises some computer vision capability, but there are limitations I previously blogged about.
6. Velox
A server to manage a large collection of machine learning models.
7. CoCoA
Faster machine learning on Spark by optimizing communication patterns and shuffles, as described in the paper Communication-Efficient Distributed Dual Coordinate Ascent
Frameworks
GPU-based
I previously blogged DeepLearning4j Adds Spark GPU Support
9. Elephas
Brand new and frankly why I started this list for this blog post. Provides an interface to Keras.
Non-GPU-based
10. DistML
Parameter server for model-parallel rather than data-parallel (as Spark's MLlib is).
11. Aerosolve
From Airbnb, used in their automated pricing
12. Zen
Logistic regression, LDA, Factorization machines, Neural Network, Restricted Boltzmann Machines
Similar to Spark DataFrames, but agnostic to engine (i.e. will run on engines other than Spark in the future). Includes cross-validation and interfaces to external machine learning libraries.
Interfaces to other Machine Learning systems
14. spark-corenlp
Wraps Stanford CoreNLP.
15. Sparkit-learn
Interface to Python's Scikit-learn
16. Sparkling Water
Interface to H2O
17. hivemall-spark
Wraps Hivemall, machine learning in Hive
18. spark-pmml-exporter-validator
Export PMML, an industry standard XML format for transporting machine learning models.
Add-ons that enhance MLlib's existing algorithms
19. MLlib-dropout
Adds dropout capability to Spark MLLib, based on the paper Dropout: A simple way to prevent neural networks from overfitting.
20. generalized-kmeans-clustering
Adds arbitrary distance functions to K-Means
Visualize the Streaming Machine Learning algorithms built into Spark MLlib
Algorithms
Supervised learning
22. spark-libFM
Factorization Machines
23. ScalaNetwork
Recursive Neural Networks (RNNs)
24. dissolve-struct
SVM based on the performant Spark communication framework CoCoA listed above.
25. Sparkling Ferns
Based on Image Classification using Random Forests and Ferns
26. streaming-matrix-factorization
Matrix Factorization Recommendation System
Unsupervised learning
27. PatchWork
40x faster clustering than Spark MLlib K-Means
28. Bisecting K-Meams Clustering
K-Means that produces more uniformly-sized clusters, based on A Comparison of Document Clustering Techniques
29. spark-knn-graphs
Build graphs using k-nearest-neighbors and locality sensitive hashing (LSH)
30. TopicModeling
Online Latent Dirichlet Allocation (LDA), Gibbs Sampling LDA, Online Hierarchical Dirichlet Process (HDP)
Algorithm building blocks
31. sparkboost
Adaboost and MP-Boost
32. spark-tfocs
Port to Spark of TFOCS: Templates for First-Order Conic Solvers. If your machine learning cost function happens to be convex, then TFOCS can solve it.
33. lazy-linalg
Linear algebra operators to work with Spark MLlib's linalg package
Feature extractors
34. spark-infotheoretic-feature-selection
Information-theoretic basis for feature selection, based on Conditional likelihood maximisation: a unifying framework for information theoretic feature selection
Given labeled data, "discretize" one of the continuous numeric dimensions such that each bin is relatively homogenous in terms of data classes. This is a foundational idea CART and ID3 algorithms to generate decision trees. Based on Multi-interval discretization of continuous-valued attributes for classification learning.
36. spark-tsne
Distributed t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction.
37. modelmatrix
Sparse feature vectors
Domain-specific
38. Spatial and time-series data
K-Means, Regression, and Statistics
39. Twitter data
相關文章
- 機器學習庫Spark MLlib簡介與教程機器學習Spark
- jQuery 中的 39 個技巧jQuery
- 五個常用的機器學習python庫!機器學習Python
- 《Spark機器學習》筆記——Spark構建聚類模型Spark機器學習筆記聚類模型
- (課程)基於Spark的機器學習經驗Spark機器學習
- 25個Java機器學習工具和庫Java機器學習
- 【盤點】Python機器學習常用的10個庫!Python機器學習
- 機器學習講座,如何利用Spark MLlib進行個性推薦?機器學習Spark
- spark Ml 機器學習之 線性迴歸Spark機器學習
- 實用!7個強大的Python機器學習庫!⛵Python機器學習
- Cassandra+Akka+Spark分散式機器學習架構Spark分散式機器學習架構
- 常用的Python機器學習庫合集,你用過幾個?Python機器學習
- 39個讓你受益的HTML5教程HTML
- Python機器學習庫,看看你用過幾個?Python機器學習
- 39個你不知道的關於亞馬遜的真相亞馬遜
- Python常用的機器學習庫Python機器學習
- 最好的Python機器學習庫Python機器學習
- 翻譯英文軟體哪個好?中文翻譯英文最便捷的方法
- 【機器學習】--Python機器學習庫之Numpy機器學習Python
- 39
- 10個python常用且好用的機器學習及深度學習庫!Python機器學習深度學習
- Python機器學習常用庫,你用過哪幾個?Python機器學習
- 在 Apache Spark 中使用機器學習進行客戶細分ApacheSpark機器學習
- 從Spark MLlib到美圖機器學習框架實踐Spark機器學習框架
- 機器學習實踐:如何將Spark與Python結合?機器學習SparkPython
- 一個屌絲程式猿的人生(39)
- 《機器學習實戰》中英文電子書 + 原始碼下載機器學習原始碼
- 一個spark清洗資料的demoSpark
- Python語言下的機器學習庫Python機器學習
- JAVA開發者最常去的25個英文網站Java網站
- JAVA開發者最常去的20個英文網站Java網站
- Python機器學習庫Python機器學習
- 簡明機器學習——01機器學習的幾個基本要素機器學習
- 機器學習專案 - 使用 Apache Spark 建立電影推薦引擎機器學習ApacheSpark
- Spark機器學習1·程式設計入門(scala/java/python)Spark機器學習程式設計JavaPython
- Spark團隊開源新作:全流程機器學習平臺MLflowSpark機器學習
- iFinD終端39個圖形化產業鏈產業
- 畢設開源了,126個star,39個fork