利用電影觀看記錄資料,進行電影推薦。
準備
1、任務描述:
在推薦領域有一個著名的開放測試集,下載連結是:http://grouplens.org/datasets/movielens/
,該測試集包含三個檔案,分別是ratings.dat
、sers.dat
、movies.dat
,具體介紹可閱讀:README.txt。
請程式設計實現:通過連線ratings.dat
和movies.dat
兩個檔案得到平均得分超過4.0的電影列表,採用的資料集是:ml-1m
2、資料下載
下載(大小約為5.64M)後解壓,會有movies.dat、ratings.dat、ReadMe、users.dat四個檔案。
3、部分資料展示
movies.dat 部分資料:
MovieID::Title::Genres
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
13::Balto (1995)::Animation|Children's
14::Nixon (1995)::Drama
15::Cutthroat Island (1995)::Action|Adventure|Romance
16::Casino (1995)::Drama|Thriller
17::Sense and Sensibility (1995)::Drama|Romance
18::Four Rooms (1995)::Thriller
19::Ace Ventura: When Nature Calls (1995)::Comedy
20::Money Train (1995)::Action
ratings.dat 部分資料:
UserID::MovieID::Rating::Timestamp
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
1::595::5::978824268
1::938::4::978301752
1::2398::4::978302281
1::2918::4::978302124
1::1035::5::978301753
1::2791::4::978302188
1::2687::3::978824268
1::2018::4::978301777
1::3105::5::978301713
1::2797::4::978302039
實操
將我們剛剛下載的資料存放到我們的專案中,專案目錄結構如下,建立我們的主程式movie.scala
1、設定輸入輸出路徑
這裡使用陣列儲存我們的輸入輸出檔案,方便後面的修改以及使用
val files = Array("src/main/java/day_20200425/data/movies.dat",
"src/main/java/day_20200425/data/ratings.dat",
"src/main/java/day_20200425/output")
2、配置spark
val conf = new SparkConf().setAppName("SparkJoin").setMaster("local")
val sc = new SparkContext(conf)
3、讀取Rating檔案
讀取Ratings.dat
檔案,根據其內容格式我們將其用::分隔開兩個部分,最後計算出電影評分。
// Read rating file
val textFile = sc.textFile(files(1))
//提取(movieid, rating)
val rating = textFile.map(line => {
val fileds = line.split("::")
(fileds(1).toInt, fileds(2).toDouble)
})
//get (movieid,ave_rating)
val movieScores = rating
.groupByKey()
.map(data => {
val avg = data._2.sum / data._2.size
(data._1, avg)
})
4、讀取movie檔案
Join操作的結果(ID,((ID,Rating),(ID,MovieName)))
RDD的keyBy(func)
實際上是為每個RDD元素生成一個增加了key的<key,value>
由於有時候資料的列數很多,不只是按一項作為key來排序,有時候需要對其中兩項進行排序,Spark的RDD提供了keyBy的方法。
val movies = sc.textFile(files(0))
val movieskey = movies.map(line => {
val fileds = line.split("::")
(fileds(0).toInt, fileds(1)) //(MovieID,MovieName)
}).keyBy(tup => tup._1)
5、儲存結果
儲存評分大於4的電影
val result = movieScores
.keyBy(tup => tup._1)
.join(movieskey)
.filter(f => f._2._1._2 > 4.0)
.map(f => (f._1, f._2._1._2, f._2._2._2))
// .foreach(s =>println(s))
val file = new File(files(2))
if(file.exists()){
deleteDir(file)
}
result.saveAsTextFile(files(2))
6、結果
他會自動生成output資料夾,裡面有四個檔案,_SUCECCESS
代表成功的意思,裡面沒有任何內容,part-00000
就是我們的需要的資料。
部分結果:
(1084,4.096209912536443,Bonnie and Clyde (1967))
(3007,4.013559322033898,American Movie (1999))
(2493,4.142857142857143,Harmonists, The (1997))
(3517,4.5,Bells, The (1926))
(1,4.146846413095811,Toy Story (1995))
(1780,4.125,Ayn Rand: A Sense of Life (1997))
(2351,4.207207207207207,Nights of Cabiria (Le Notti di Cabiria) (1957))
(759,4.101694915254237,Maya Lin: A Strong Clear Vision (1994))
(1300,4.1454545454545455,My Life as a Dog (Mitt liv som hund) (1985))
(1947,4.057818659658344,West Side Story (1961))
(2819,4.040752351097178,Three Days of the Condor (1975))
(162,4.063136456211812,Crumb (1994))
(1228,4.1875923190546525,Raging Bull (1980))
(1132,4.259090909090909,Manon of the Spring (Manon des sources) (1986))
(306,4.227544910179641,Three Colors: Red (1994))
(2132,4.074074074074074,Who's Afraid of Virginia Woolf? (1966))
(720,4.426940639269406,Wallace & Gromit: The Best of Aardman Animation (1996))
(2917,4.031746031746032,Body Heat (1981))
(1066,4.1657142857142855,Shall We Dance? (1937))
(2972,4.015384615384615,Red Sorghum (Hong Gao Liang) (1987))
你可能會遇到的問題
問題一:結果輸出目錄已存在
描述:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/D:/Projects/JAVA/Scala/src/main/java/day_20200425/data/output already exist
分析:由於執行,然後輸出資料夾已存在,則需要刪除該目錄
解決:方法一:手動刪除、方法二:加入以下程式碼
1、主程式中
val file = new File(files(2))
if(file.exists()){
deleteDir(file)
}
2、刪除函式
/**
* https://www.cnblogs.com/honeybee/p/6831346.html
* 刪除一個資料夾,及其子目錄
*
* @param dir 目錄
*/
def deleteDir(dir: File): Unit = {
val files = dir.listFiles()
files.foreach(f => {
if (f.isDirectory) {
deleteDir(f)
} else {
f.delete()
println("delete file " + f.getAbsolutePath)
}
})
dir.delete()
println("delete dir " + dir.getAbsolutePath)
}
問題二:缺少hadoop環境變數
描述:
ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException
分析:
在windows環境下沒有配置hadoop環境的原因。
解決:
下載:https://github.com/amihalik/hadoop-common-2.6.0-bin,並且將其bin目錄配置為系統的環境變數(path),然後再程式碼中加入以下程式碼,例如我的目錄為E:\\Program\\hadoop\\hadoop-common-2.6.0-bin
,那麼則需要加入:
System.setProperty("hadoop.home.dir", "E:\\Program\\hadoop\\hadoop-common-2.6.0-bin")