五、DRPC

fan_rockrock發表於2015-12-31

1、為什麼有DRPC?

         Storm裡面引入DRPC主要是利用storm的實時計算能力來並行化CPU intensive的計算。DRPC的storm topology以函式的引數流作為輸入,而把這些函式呼叫的返回值作為topology的輸出流。   DRPC其實不能算是storm本身的一個特性, 它是通過組合storm的原語spout,bolt, topology而成的一種模式(pattern)。本來應該把DRPC單獨打成一個包的, 但是DRPC實在是太有用了,所以我們我們把它和storm捆綁在一起。

2、DRPC的工作流程

Distributed RPC是由一個”DPRC Server”協調的(storm自帶了一個實現)。DRPC伺服器協調 

 (1) 接收一個RPC請求 

 (2) 傳送請求到storm topology 

 (3) 從storm topology接收結果

 (4) 把結果發回給等待的客戶端。

       從客戶端的角度來看一個DRPC呼叫跟一個普通的RPC呼叫沒有任何區別。比如下面是客戶端如何呼叫RPC的,方法的引數是:http://twitter.com。

DRPCClient client = newDRPCClient("drpc-host",3772);  
String result = client.execute("reach",  "http://twitter.com"); 



客戶端給DRPC伺服器傳送要執行的方法的名字,以及這個方法的引數。實現了這個函式的topology使用DRPCSpout從DRPC伺服器接收函式呼叫流。每個函式呼叫被DRPC伺服器標記了一個唯一的id。 這個topology然後計算結果,在topology的最後一個叫做ReturnResults的bolt會連線到DRPC伺服器,並且把這個呼叫的結果傳送給DRPC伺服器(通過那個唯一的id標識)。DRPC伺服器用那個唯一id來跟等待的客戶端匹配上,喚醒這個客戶端並且把結果傳送給它。


3、LinearDRPCTopologyBuilder

Storm自帶了一個稱作LinearDRPCTopologyBuilder的topology builder, 它把實現DRPC的幾乎所有步驟都自動化了。這些步驟包括:

  • 設定spout
  • 把結果返回給DRPC伺服器
  • 給bolt提供有限聚合幾組tuples的能力
例1、把輸入引數後面新增一個”!”的DRPC topology的實現
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.LocalDRPC;
import backtype.storm.StormSubmitter;
import backtype.storm.drpc.LinearDRPCTopologyBuilder;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
  
public class BasicDRPCTopology {
  public static class ExclaimBolt extends BaseBasicBolt {//主要需要覆寫execute方法和declareoutputfields方法
    @Override
    public void execute(Tuple tuple, BasicOutputCollector collector) {
      String input = tuple.getString(1);
      collector.emit(new Values(tuple.getValue(0), input + "!"));
    }
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
      declarer.declare(new Fields("id", "result"));
    }
  
  }
  public static void main(String[] args) throws Exception {
    LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation");//實現DRPC模式
    builder.addBolt(new ExclaimBolt(), 3);
  
    Config conf = new Config();
  
    if (args == null || args.length == 0) {//本地呼叫
      LocalDRPC drpc = new LocalDRPC();//本地模擬DRPCSever
      LocalCluster cluster = new LocalCluster();//本地模擬storm叢集
  
      cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc));//組裝
  
      for (String word : new String[]{ "hello", "goodbye" }) {
        System.out.println("Result for \"" + word + "\": " + drpc.execute("exclamation", word));
      }
  
      cluster.shutdown();
      drpc.shutdown();
    }
    else {//叢集模式
      conf.setNumWorkers(3);
      StormSubmitter.submitTopology(args[0], conf, builder.createRemoteTopology());
    }
  }
}
你宣告的第一個bolt會接收兩維tuple,tuple的第一個field是request-id,第二個field是這個請求的引數LinearDRPCTopologyBuilder同時要求我們topology的最後一個bolt發射一個二維tuple: 第一個field是request-id, 第二個field是這個函式的結果。最後所有中間tuple的第一個field必須是request-id。

例2:計算reach值

首先介紹一下什麼是reach值,要計算一個URL的reach值,我們需要:

  • 獲取所有微薄裡面包含這個URL的人
  • 獲取這些人的粉絲
  • 把這些粉絲去重
  • 獲取這些去重之後的粉絲個數 — 這就是reach
一個簡單的reach計算可能會有成千上萬個資料庫呼叫,並且可能設計到百萬數量級的微薄使用者。這個確實可以說是CPU intensive的計算了。你會看到的是,在storm上面來實現這個是非常非常的簡單。
reach Topology是這樣定義的:
 LinearDRPCTopologyBuilder builder=new LinearDRPCTopologyBuilder("reach");
        builder.addBolt(new GetTweeters(),4);
        builder.addBolt(new GetFollowers(),12).shuffleGrouping();
        builder.addBolt(new PartialUniquer(),6).fieldsGrouping(new Fields("id","follower"));
        builder.addBolt(new CountAggregator(),3).fieldsGrouping(new Fields("id"));

這個topology分四步執行:

  • GetTweeters獲取所發微薄裡面包含制定URL的所有使用者。它接收輸入流: [id, url], 它輸出:[id, tweeter]. 每一個URL tuple會對應到很多tweetertuple。
  • GetFollowers 獲取這些tweeter的粉絲。它接收輸入流: [id, tweeter], 它輸出: [id, follower]
  • PartialUniquer 通過粉絲的id來group粉絲。通過id和follower分組,因此不同的task接收到的粉絲是不同的 — 從而起到去重的作用。它的輸出流:[id, count] 即輸出這個task上統計的粉絲個數。
  • 最後,CountAggregator 接收到所有的區域性數量, 把它們加起來就算出了我們要的reach值。
去重bolt程式碼:
 public static class PartialUniquer extends BaseBatchBolt{
    	BatchOutputCollector _collector;
    	Object _id;
    	Set<String> _followers=new HashSet<String>();
    	
		@Override
		public void prepare(Map conf, TopologyContext context,
				BatchOutputCollector collector, Object id) {
			// TODO Auto-generated method stub
			_collector=collector;
			_id=id;//這個id是一個batch的id,事實上是一個TransactionAttempt物件,包含兩個值,詳見事物拓撲
		}

		@Override
		public void execute(Tuple tuple) {//對於每一個batch裡面的tuple都會執行一次
			// TODO Auto-generated method stub
			_followers.add(tuple.getString(1));
		}

		@Override
		public void finishBatch() {
			// TODO Auto-generated method stub
			_collector.emit(new Values(_id,_followers.size()));
		}
每個請求的ID都會建立Batch Bolt的新例項。在底層CoordinateBolt被用來檢測一個給定的Bolt是否已接受到請求的id的所有元組,即保證一個batch的所有tuple都已處理。
這個topology的其餘部分就非常的明瞭了。我們可以看到的是reach計算的每個步驟都是平行計算出來的,而且實現這個DRPC的topology是那麼的簡單。

下面是完整程式碼:
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.LocalDRPC;
import backtype.storm.StormSubmitter;
import backtype.storm.coordination.BatchOutputCollector;
import backtype.storm.drpc.LinearDRPCTopologyBuilder;
import backtype.storm.generated.AlreadyAliveException;
import backtype.storm.generated.InvalidTopologyException;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.topology.base.BaseBatchBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;

public class ReachTopology {//獲取一個url的reach值
    public static Map<String,List<String>> TWEETERS_DB=new HashMap<String,List<String>>(){
    	{
    		put("url1",Arrays.asList("sally","bob","tim","george","nathan"));
    		put("url2",Arrays.asList("adam","david","sally","nathan"));
    		put("url3",Arrays.asList("tim","mike","john"));
    	}
    };
    public static Map<String,List<String>> FOLLOWERS_DB=new HashMap<String,List<String>>(){
    	{
    		put("sally",Arrays.asList("bob","tim","alice","adam","jim","chris","jai"));
    		put("bob",Arrays.asList("sally","nathan","jim","mary","david","vivian"));
    		put("tim",Arrays.asList("alex"));
    		put("nathan",Arrays.asList("sally","bob","adam","harry","chris","vivian","emily","jordan"));
    		put("adam",Arrays.asList("david","carissa"));
    		put("mike",Arrays.asList("john","bob"));
    		put("john",Arrays.asList("alice","nathan","jim","mike","bob"));
    	}
    };
    
    public static class GetTweeters extends BaseBasicBolt{
		@Override
		public void execute(Tuple input, BasicOutputCollector collector) {
			Object id=input.getValue(0);
			String url=input.getString(1);
			List<String> tweeters=TWEETERS_DB.get(url);
			if(tweeters!=null)
			{
				for(String tweeter:tweeters)
				{
					collector.emit(new Values(id,tweeter));
				}
			}
		}

		@Override
		public void declareOutputFields(OutputFieldsDeclarer declarer) {
			declarer.declare(new Fields("id","tweeter"));
		}
    }
    public static class GetFollowers extends BaseBasicBolt{

		@Override
		public void execute(Tuple input, BasicOutputCollector collector) {
			// TODO Auto-generated method stub
		   Object id=input.getValue(0);
		   String tweeter=input.getString(1);
		   List<String>followers=FOLLOWERS_DB.get(tweeter);
		   if(followers!=null)
		   {
			   for(String follower:followers)
			   {
				   collector.emit(new Values(id,follower));
			   }
		   }
		}

		@Override
		public void declareOutputFields(OutputFieldsDeclarer declarer) {
			// TODO Auto-generated method stub
			declarer.declare(new Fields("id","follower"));
		}
    	
    }
    public static class PartialUniquer extends BaseBatchBolt{
    	BatchOutputCollector _collector;
    	Object _id;
    	Set<String> _followers=new HashSet<String>();
    	
		@Override
		public void prepare(Map conf, TopologyContext context,
				BatchOutputCollector collector, Object id) {
			// TODO Auto-generated method stub
			_collector=collector;
			_id=id;
		}

		@Override
		public void execute(Tuple tuple) {//對於每一個batch裡面的tuple都會執行一次
			// TODO Auto-generated method stub
			_followers.add(tuple.getString(1));
		}

		@Override
		public void finishBatch() {
			// TODO Auto-generated method stub
			_collector.emit(new Values(_id,_followers.size()));
		}

		@Override
		public void declareOutputFields(OutputFieldsDeclarer declarer) {
			// TODO Auto-generated method stub
			declarer.declare(new Fields("id","partial-count"));
		}
    	
    }
    public static class CountAggregator extends BaseBatchBolt{
        BatchOutputCollector _collector;
        Object _id;
        int _count=0;
        
		@Override
		public void prepare(Map conf, TopologyContext context,
				BatchOutputCollector collector, Object id) {
			// TODO Auto-generated method stub
			_collector=collector;
			_id=id;
		}

		@Override
		public void execute(Tuple tuple) {
			// TODO Auto-generated method stub
			_count+=tuple.getInteger(1);
		}

		@Override
		public void finishBatch() {
			// TODO Auto-generated method stub
			_collector.emit(new Values(_id,_count));
		}

		@Override
		public void declareOutputFields(OutputFieldsDeclarer declarer) {
			// TODO Auto-generated method stub
			declarer.declare(new Fields("id","reach"));
		}
    	
    }
	public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException {
		// TODO Auto-generated method stub
        LinearDRPCTopologyBuilder builder=new LinearDRPCTopologyBuilder("reach");
        builder.addBolt(new GetTweeters(),4);
        builder.addBolt(new GetFollowers(),12).shuffleGrouping();
        builder.addBolt(new PartialUniquer(),6).fieldsGrouping(new Fields("id","follower"));
        builder.addBolt(new CountAggregator(),3).fieldsGrouping(new Fields("id"));
        
        Config conf=new Config();
        if(args==null||args.length==0)
        {
        	conf.setMaxTaskParallelism(3);
        	LocalDRPC drpc=new LocalDRPC();
        	LocalCluster cluster=new LocalCluster();
        	cluster.submitTopology("reach-drpc", conf, builder.createLocalTopology(drpc));
            
        	String[] urls=new String[]{"url3","url2","url1"};
        	for(String url:urls)
        	{
        		System.out.println("Reach of "+url+": "+drpc.execute("reach", url));
        	}
        	cluster.shutdown();
        	drpc.shutdown();
        }
        else {
			conf.setNumWorkers(6);
			StormSubmitter.submitTopology(args[0], conf, builder.createRemoteTopology());
		}
	}

}