Storm實戰之WordCount

541732025發表於2015-03-06
在storm環境部署完畢,並正確啟動之後,現在就可以真正進入storm開發了,按照慣例,以wordcount作為開始。
這個例子很簡單,核心元件包括:一個spout,兩個bolt,一個Topology。
spout從一個路徑讀取檔案,然後readLine,向bolt發射,一個檔案處理完畢後,重新命名,以不再重複處理。
第一個bolt將從spout接收到的字串按空格split,產生word,發射給下一個bolt。
第二個bolt接收到word後,統計、計數,放到HashMap容器中。

1,定義一個spout,作用是源源不斷滴向bolt發射字串。

點選(此處)摺疊或開啟

  1. import java.io.File;
  2. import java.io.IOException;
  3. import java.util.Collection;
  4. import java.util.List;
  5. import java.util.Map;

  6. import org.apache.commons.io.FileUtils;
  7. import org.apache.commons.io.filefilter.FileFilterUtils;

  8. import backtype.storm.spout.SpoutOutputCollector;
  9. import backtype.storm.task.TopologyContext;
  10. import backtype.storm.topology.OutputFieldsDeclarer;
  11. import backtype.storm.topology.base.BaseRichSpout;
  12. import backtype.storm.tuple.Fields;
  13. import backtype.storm.tuple.Values;

  14. public class WordReader extends BaseRichSpout {
  15.     private static final long serialVersionUID = 2197521792014017918L;
  16.     private String inputPath;
  17.     private SpoutOutputCollector collector;

  18.     @Override
  19.     @SuppressWarnings(\"rawtypes\")
  20.     public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
  21.         this.collector = collector;
  22.         inputPath = (String) conf.get(\"INPUT_PATH\");
  23.     }

  24.     @Override
  25.     public void nextTuple() {
  26.         Collection<File> files = FileUtils.listFiles(new File(inputPath),
  27.                 FileFilterUtils.notFileFilter(FileFilterUtils.suffixFileFilter(\".bak\")), null);
  28.         for (File f : files) {
  29.             try {
  30.                 List<String> lines = FileUtils.readLines(f, \"UTF-8\");
  31.                 for (String line : lines) {
  32.                     collector.emit(new Values(line));
  33.                 }
  34.                 FileUtils.moveFile(f, new File(f.getPath() + System.currentTimeMillis() + \".bak\"));
  35.             } catch (IOException e) {
  36.                 e.printStackTrace();
  37.             }
  38.         }
  39.     }

  40.     @Override
  41.     public void declareOutputFields(OutputFieldsDeclarer declarer) {
  42.         declarer.declare(new Fields(\"line\"));
  43.     }

  44. }

2,定義一個bolt,作用是接收spout發過來的字串,並分割成word,發射給下一個bolt。

點選(此處)摺疊或開啟

  1. import org.apache.commons.lang.StringUtils;

  2. import backtype.storm.topology.BasicOutputCollector;
  3. import backtype.storm.topology.OutputFieldsDeclarer;
  4. import backtype.storm.topology.base.BaseBasicBolt;
  5. import backtype.storm.tuple.Fields;
  6. import backtype.storm.tuple.Tuple;
  7. import backtype.storm.tuple.Values;

  8. public class WordSpliter extends BaseBasicBolt {

  9.     private static final long serialVersionUID = -5653803832498574866L;

  10.     @Override
  11.     public void execute(Tuple input, BasicOutputCollector collector) {
  12.         String line = input.getString(0);
  13.         String[] words = line.split(\" \");
  14.         for (String word : words) {
  15.             word = word.trim();
  16.             if (StringUtils.isNotBlank(word)) {
  17.                 word = word.toLowerCase();
  18.                 collector.emit(new Values(word));
  19.             }
  20.         }
  21.     }

  22.     @Override
  23.     public void declareOutputFields(OutputFieldsDeclarer declarer) {
  24.         declarer.declare(new Fields(\"word\"));

  25.     }

  26. }

3,定義一個bolt,接收word,並統計。

點選(此處)摺疊或開啟

  1. import java.util.HashMap;
  2. import java.util.Map;
  3. import java.util.Map.Entry;

  4. import backtype.storm.task.TopologyContext;
  5. import backtype.storm.topology.BasicOutputCollector;
  6. import backtype.storm.topology.OutputFieldsDeclarer;
  7. import backtype.storm.topology.base.BaseBasicBolt;
  8. import backtype.storm.tuple.Tuple;

  9. public class WordCounter extends BaseBasicBolt {
  10.     private static final long serialVersionUID = 5683648523524179434L;
  11.     private HashMap<String, Integer> counters = new HashMap<String, Integer>();
  12.     private volatile boolean edit = false;

  13.     @Override
  14.     @SuppressWarnings(\"rawtypes\")
  15.     public void prepare(Map stormConf, TopologyContext context) {
  16.         final long timeOffset = Long.parseLong(stormConf.get(\"TIME_OFFSET\").toString());
  17.         new Thread(new Runnable() {
  18.             @Override
  19.             public void run() {
  20.                 while (true) {
  21.                     if (edit) {
  22.                         for (Entry<String, Integer> entry : counters.entrySet()) {
  23.                             System.out.println(entry.getKey() + \" : \" + entry.getValue());
  24.                         }
  25.                         System.out.println(\"WordCounter---------------------------------------\");
  26.                         edit = false;
  27.                     }
  28.                     try {
  29.                         Thread.sleep(timeOffset * 1000);
  30.                     } catch (InterruptedException e) {
  31.                         e.printStackTrace();
  32.                     }
  33.                 }
  34.             }
  35.         }).start();
  36.     }

  37.     @Override
  38.     public void execute(Tuple input, BasicOutputCollector collector) {
  39.         String str = input.getString(0);
  40.         if (!counters.containsKey(str)) {
  41.             counters.put(str, 1);
  42.         } else {
  43.             Integer c = counters.get(str) + 1;
  44.             counters.put(str, c);
  45.         }
  46.         edit = true;
  47.         System.out.println(\"WordCounter+++++++++++++++++++++++++++++++++++++++++++\");
  48.     }

  49.     @Override
  50.     public void declareOutputFields(OutputFieldsDeclarer declarer) {

  51.     }
  52. }
注意WordCounter類prepare方法,裡面定義了一個Thread,持續監控容器的變化(word個數增加或者新增word)。

4,定義一個Topology,提交作業。

點選(此處)摺疊或開啟

  1. public class WordCountTopo {
  2.     public static void main(String[] args) {
  3.         if (args.length != 2) {
  4.             System.err.println(\"Usage: inputPaht timeOffset\");
  5.             System.err.println(\"such as : java -jar WordCount.jar D://input/ 2\");
  6.             System.exit(2);
  7.         }
  8.         TopologyBuilder builder = new TopologyBuilder();
  9.         builder.setSpout(\"word-reader\", new WordReader());
  10.         builder.setBolt(\"word-spilter\", new WordSpliter()).shuffleGrouping(\"word-reader\");
  11.         builder.setBolt(\"word-counter\", new WordCounter()).shuffleGrouping(\"word-spilter\");
  12.         String inputPaht = args[0];
  13.         String timeOffset = args[1];
  14.         Config conf = new Config();
  15.         conf.put(\"INPUT_PATH\", inputPaht);
  16.         conf.put(\"TIME_OFFSET\", timeOffset);
  17.         conf.setDebug(false);
  18.         LocalCluster cluster = new LocalCluster();
  19.         cluster.submitTopology(\"WordCount\", conf, builder.createTopology());
  20.     }
  21. }

5,程式碼完成後,匯出jar(匯出時不要指定Main class),然後上傳至storm叢集,透過命令./storm jar com.x.x.WordCountTopo /data/tianzhen/input 2來提交作業。
Topo啟動,spout、bolt執行過程:


Thread監控的統計結果:


原始檔處理之後被重新命名為*.bak。

和Hadoop不同,在任務執行完之後,Topo不會停止,spout會一直監控資料來源,不停地往bolt發射資料。
所以現在如果源資料發生變化,應該能夠立馬體現出來。我往path下再放一個文字檔案,結果:

可見,結果立刻更新了,storm的實時性就體現在這裡。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/28912557/viewspace-1450885/,如需轉載,請註明出處,否則將追究法律責任。

相關文章