資料探勘（5）：使用mahout做海量資料關聯規則挖掘

發表於2015-08-29

上一篇介紹了用開源資料探勘軟體weka做關聯規則挖掘，weka方便實用，但不能處理大資料集，因為記憶體放不下，給它再多的時間也是無用，因此需要進行分散式計算，mahout是一個基於hadoop的分散式資料探勘開源專案（mahout本來是指一個騎在大象上的人）。掌握了關聯規則的基本演算法和使用，加上分散式關聯規則挖掘後，就可以處理基本的關聯規則挖掘工作了，實踐中只需要把握業務，理解資料便可遊刃有餘。

安裝mahout

騎在大象上的俠士必然需要一頭雄糾糾的大象，不過本文不解紹大象hadoop，所以我假定已經安裝好了hadoop，關於hadoop的安裝，請google。

到Apache官網下載mahout8.0

解壓

tar -zxvf mahout-distribution-0.8.tar.gz

1	tar -zxvf mahout-distribution-0.8.tar.gz

移動

sudo mv tar mahout-distribution-0.8 /usr/local/mahout-8

1	sudo mv tar mahout-distribution-0.8 /usr/local/mahout-8

配置

sudo gedit /etc/profile

1	sudo gedit /etc/profile

輸入以下內容：

export MAHOUT_HOME=/usr/local/mahout-8
export PATH=$MAHOUT_HOME/bin:$PATH
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$PATH

export MAHOUT_HOME=/usr/local/mahout-8

export PATH=$MAHOUT_HOME/bin:$PATH

export HADOOP_HOME=/usr/local/hadoop

export PATH=$HADOOP_HOME/bin:$PATH

退出使用者重新登入，使配置檔案生效。輸入mahout -version測試是否安裝成功。

資料準備

到http://fimi.ua.ac.be/data/下載一個購物籃資料retail.dat。

上傳到hadoop檔案系統

hadoop fs -mkdir /user/hadoop/mahoutData #建立目錄
hadoop fs -put ~/data/retail.dat /user/hadoop/mahoutData

1 2	hadoop fs -mkdir /user/hadoop/mahoutData #建立目錄 hadoop fs -put ~/data/retail.dat /user/hadoop/mahoutData

呼叫FpGrowth演算法

mahout fpg -i /user/hadoop/mahoutData/retail.dat -o patterns -method mapreduce -s 1000 -regex &#039;[\ ]&#039;

1	mahout fpg -i /user/hadoop/mahoutData/retail.dat -o patterns -method mapreduce -s 1000 -regex '[\ ]'

-i表示input，-o表示-output，-s表示最小支援度，'[\ ]’表示以行內的資料以空格分開。

一兩分鐘後執行完畢，生成的檔案被序列化了，直接檢視會是亂碼，因此需要用mahout還原回來：

mahout seqdumper -i /user/hadoop/patterns/fpgrowth/part-r-00000 -o ~/data/patterns.txt

1	mahout seqdumper -i /user/hadoop/patterns/fpgrowth/part-r-00000 -o ~/data/patterns.txt

輸出結果：

Key: 39: Value: ([39],50675)
Key: 48: Value: ([48],42135), ([39, 48],29142)
Key: 38: Value: ([38],15596), ([39, 38],10345), ([48, 38],7944), ([39, 48, 38],6102)
Key: 32: Value: ([32],15167), ([39, 32],8455), ([48, 32],8034), ([39, 48, 32],5402), ([38, 32],2833), ([39, 38, 32],1840), ([48, 38, 32],1646), ([39, 48, 38, 32],1236)
Key: 41: Value: ([41],14945), ([39, 41],11414), ([48, 41],9018), ([39, 48, 41],7366), ([38, 41],3897), ([32, 41],3196), ([39, 38, 41],3051), ([48, 38, 41],2374), ([39, 32, 41],2359), ([48, 32, 41],2063), ([39, 48, 38, 41],1991), ([39, 48, 32, 41],1646)
Key: 65: Value: ([65],4472), ([39, 65],2787), ([48, 65],2529), ([39, 48, 65],1797)
Key: 89: Value: ([89],3837), ([48, 89],2798), ([39, 89],2749), ([39, 48, 89],2125)
Key: 225: Value: ([225],3257), ([39, 225],2351), ([48, 225],1736), ([39, 48, 225],1400)

Key: 39: Value: ([39],50675)

Key: 48: Value: ([48],42135), ([39, 48],29142)

Key: 38: Value: ([38],15596), ([39, 38],10345), ([48, 38],7944), ([39, 48, 38],6102)

Key: 32: Value: ([32],15167), ([39, 32],8455), ([48, 32],8034), ([39, 48, 32],5402), ([38, 32],2833), ([39, 38, 32],1840), ([48, 38, 32],1646), ([39, 48, 38, 32],1236)

Key: 41: Value: ([41],14945), ([39, 41],11414), ([48, 41],9018), ([39, 48, 41],7366), ([38, 41],3897), ([32, 41],3196), ([39, 38, 41],3051), ([48, 38, 41],2374), ([39, 32, 41],2359), ([48, 32, 41],2063), ([39, 48, 38, 41],1991), ([39, 48, 32, 41],1646)

Key: 65: Value: ([65],4472), ([39, 65],2787), ([48, 65],2529), ([39, 48, 65],1797)

Key: 89: Value: ([89],3837), ([48, 89],2798), ([39, 89],2749), ([39, 48, 89],2125)

Key: 225: Value: ([225],3257), ([39, 225],2351), ([48, 225],1736), ([39, 48, 225],1400)

這裡輸出的只是頻繁項集，但在此基礎上提取關聯規則已經不是難事。

Frequent Pattern 資料探勘關聯規則演算法（Aprior演算法) FT-Tree
2018-03-26
演算法
微軟資料探勘開發：建立挖掘模型JP
2022-03-21
微軟模型
大資料時代，如何做資料探勘與分析！
2019-04-26
大資料
Apriori 演算法-如何進行關聯規則挖掘
2020-12-10
演算法
談一談資料探勘的軍規
2022-05-16
關聯規則挖掘：Apriori演算法的深度探討
2023-11-20
演算法
大資料探勘，怎樣才能挖掘出有效內容？
2022-08-01
大資料
Cephfs資料池資料物件命名規則解析
2022-05-11
物件
星環科技StellarDB4.0正式釋出：效能數倍提升，萬億級圖資料庫挖掘海量資料互聯價值
2022-05-10
資料庫
使用impdp,expdp資料泵進入海量資料遷移
2018-04-04
海量列式非關聯式資料庫HBase 架構，shell與API
2021-09-14
資料庫架構API
資料探勘（ TO DO LIST）
2019-03-10
資料探勘技術
2024-06-02
資料探勘與生活
2021-06-14
資料分析與挖掘-挖掘建模
2020-09-30
符合 iview 資料規則的省市區三級聯動
2018-07-11
View
海量文字中挖掘人物關聯關係核心技術介紹
2022-12-14
資料探勘和資料提取能做什麼？
2022-02-14
海量資料處理
2018-03-08
主資料之編碼規則
2020-07-06
淺談大資料、資料分析、資料探勘的區別！
2020-12-19
大資料
關於資料探勘你想了解的都在這
2022-04-01
用Python做商業資料探勘，你學廢了嗎？
2021-11-18
Python
使用反應式關聯式資料庫連線規範R2DBC操作MySQL資料庫
2020-06-24
資料庫MySql
SQLAIchemy資料模型關聯
2018-06-22
SQLAI模型
正確使用資料架構的五條規則 - infoworld
2021-12-22
架構
海量資料處理2
2018-08-28
《資料探勘導論》實驗課——實驗四、資料探勘之KNN,Naive Bayes
2019-06-21
KNNAI
Window資料夾，檔案命名規則
2024-03-10
oracle資料隱式轉換規則
2019-07-22
Oracle
JS資料型別轉換規則
2021-09-09
JS資料型別
資料探勘專案的特徵和關鍵環節
2020-12-19
特徵
乾貨 | 資料探勘過關40題，你是否都能通關...
2018-08-24
資料分析與資料探勘 - 04科學計算
2020-09-09
萌新向Python資料分析及資料探勘前言
2018-12-25
Python
Yahoo前任資料官：資料探勘與分析技巧（下）IF
2022-03-21
資料探勘和資料提取該怎麼區分？
2023-02-23
談談建立有效資料質量規則特別關注的5大要素
2023-02-01
使用DDD規格Specification模式構建資料驅動規則引擎 - jonblankenship
2020-03-25
模式

資料探勘（5）：使用mahout做海量資料關聯規則挖掘

安裝mahout

資料準備

呼叫FpGrowth演算法

相關文章