Spark 二次排序實現

破棉襖發表於2015-11-09

Spark二次排序的實現方式,記錄以作備忘


一、測試檔案testsortTwo:
  1. [root@tongji ~]# hadoop fs -cat /user/wzx/testsortTwo
  2. 1444697637.786 180.175.251.34 wv.88mf.com _trackClick 174|139||-17718753 0000436cc2ad45bb8df6a70bd09e146f
  3. 1444695603.085 218.22.168.122 wv.17mf.com _trackPageview 0002a9ed7d754a08957912700e36d731
  4. 1444696305.588 106.110.49.210 wv.88mf.com _trackPageview 00034c9597df47b6a3041635334daa3d
  5. 1444696305.588 221.2.101.146 wv.77mf.com _trackMover 446,650|492,635|520,629 000364344c8649f8bdbf66dba76f8ed1
  6. 1444695543.619 120.193.187.66 c.mfniu.com _trackPageview 00042d7207ee4eb29d1604c724629182
  7. 1444697033.836 183.54.102.45 c.mfniu.com _trackPageview 000436b51eb844aa9e002ff62c21168c
  8. 1444696305.588 58.215.136.139 wv.88mf.com _trackPageview 00051113efbf4ae1a805b2bf262ca26d
  9. 1444697308.329 61.164.41.227 wv.17mf.com _trackPageview 00054c1620814bfcaeba6a88d6b3c54c


spark程式碼:(先按照第一列排序,再按照第3列排序)
  1. val text= sc.textFile("/user/wzx/testsortTwo")
  2. val rdd1 = text.map(x => x.split(" ")).map{
  3.        x =>
  4.        val len = x.length
  5.        if(len == 5){
  6.         (((x(0)),x(2)),(x(1),x(3),x(4)))
  7.        }else if(len == 6){
  8.          (((x(0)),x(2)),(x(1),x(3),x(4),x(5)))
  9.        }else{
  10.          (((x(0)),x(2)),x(1))
  11.        }
  12.       }
  13. val rdd2 = rdd1.groupByKey().sortByKey()
  14. rdd2.collect()


結果:

  1. ((1444695543.619,c.mfniu.com),CompactBuffer((120.193.187.66,_trackPageview,00042d7207ee4eb29d1604c724629182))),
  2. ((1444695603.085,wv.17mf.com),CompactBuffer((218.22.168.122,_trackPageview,0002a9ed7d754a08957912700e36d731))),
  3. ((1444696305.588,wv.77mf.com),CompactBuffer((221.2.101.146,_trackMover,446,650|492,635|520,629,000364344c8649f8bdbf66dba76f8ed1))),
  4. ((1444696305.588,wv.88mf.com),CompactBuffer((106.110.49.210,_trackPageview,00034c9597df47b6a3041635334daa3d), (58.215.136.139,_trackPageview,00051113efbf4ae1a805b2bf262ca26d))),
  5. ((1444697033.836,c.mfniu.com),CompactBuffer((183.54.102.45,_trackPageview,000436b51eb844aa9e002ff62c21168c))),
  6. ((1444697308.329,wv.17mf.com),CompactBuffer((61.164.41.227,_trackPageview,00054c...




二、測試檔案testsort:
  1. wzx 2321
  2. admin 462
  3. yxy 21323
  4. zov 32
  5. wzx 123
  6. vi 2
  7. wzx 3
  8. wzx 9
  9. yxy 223

spark程式碼:
  1. val text = sc.textFile("/user/wzx/testsort")
  2. val rdd1 = text.map(x => x.split(" ")).map(x => (x(0),x(1).toInt)).groupByKey().sortByKey(true).map(x => (x._1,x._2.toList.sortWith(_>_)))
  3. rdd1.collect

結果:

  1. Array((admin,List(462)), (vi,List(2)), (wzx,List(2321, 123, 9, 3)), (yxy,List(21323, 223)), (zov,List(32)))





來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29754888/viewspace-1826229/,如需轉載,請註明出處,否則將追究法律責任。

相關文章