[20141228]關於bloom filter.txt

lfree發表於2014-12-30

[20141228]關於bloom filter.txt

--系統升級到11.2.0.4 ,執行計劃經常出現bloom filter,自己對這些一點都不瞭解.
--自己做了google,能查到的資料不多,eygle的blog講解,我水平有限,至少第1次沒看懂,不過裡面指向一個連結:
--

--我仔細閱讀,感覺還是不好理解,決定把裡面的測試例子執行1次,後面做一個測試:

--實際上bloom filter很早就已經存在,1970年由Burton H.Bloom發明的.我轉載裡面的介紹:

A bloom filter is a data structure used to support membership queries.Simply put,a bloom filter is
used to test whether an element is a member of a giver set or not. Its main properties are:

. The amount of space needed to store the bloom filter is small compared to the amount of data
  belonging to the set being tested.
. The time needed to check whether an element is a member of a given set is independent of
  the number of elements contained in the set.
. False negatives not possible.
. False positives are possible, bit their frequency can be controlled. In practice,it is a tradeoff
  between space/time efficiency and the false positive frequency.

A bloom filter is based on an array of an bits (b1, b2, ..., bm) that are initially set to 0. To understand
how a bloom filter works, it is essential to describe how these bits are set and checked. For this
purpose, k independent hash functions (hi, h2, ..., hk), each returning a value between 1 and m,
are used. In order to "store" a given element into the bit array, each hash function must be
applied to it and, based on the return value r of each function (r1, r2, ..., rk), the bit with the offset r
is set to 1. Since there are k hash functions, up to k bits in the bit array are set to 1 (it might be
less because several hash functions might return the same value). The following figure is an
example where m = 16, k=4 and e is the element to be "stored" in the bit array.

r1 = h1(e) = 8  -> set b8 to 1
r2 = h2(e) = 1  -> set b1 to 1
r3 = h3(e) = 6  -> set b6 to 1
r4 = h4(e) = 13 -> set b13 to 1

To check whether an element is /"stored" in the bit array, the process is similar. The only
difference is that instead of setting k bits in the array, it is checked whether any of them are set to
0. If yes, this implies that the element is not /"stored" in the bit array.

Let's take a look at an example that illustrates how a bloom filter might be used in practice. An
application receives input data at a high throughput. Before processing it, it has to check whether
that data belongs to a given set stored in a database table. Note that the rejection rate of this
validation is very high. To perform the validation, the application has to call a remote service.
Unfortunately, the time needed for the network roundtrip and to actually do the lookup is far too
long. Since it is not possible to further optimize it (for example, the network latency is subject to
physical limitations), another method must be implemented. ln such cases, caches come to mind.
The general idea is to duplicate the data stored remotely in a local cache and then to do the
validation locally. In this specific case, let's say that it is not practicable. Not enough resources are
available locally to store the data. In such a situation, a bloom filter might be useful. In fact, as
described above, bloom filters need much less space than the actual set. Since the bloom filter is
subject to false positives, the idea is to use it to discard the maximum amount of input data before
calling the remote service. As a result, the original validation is not made faster but it is used
much less frequently.

--還是透過例子來說明問題:

/* Formatted on 2014/12/28 21:47:07 (QP5 v5.227.12220.39754) */
CREATE OR REPLACE PACKAGE bloom_filter
IS
   PROCEDURE init (p_m IN BINARY_INTEGER, p_n IN BINARY_INTEGER);

   FUNCTION add_value (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER;

   FUNCTION contain (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER;
END bloom_filter;
/

CREATE OR REPLACE PACKAGE BODY bloom_filter
IS
   TYPE t_bitarray IS TABLE OF BOOLEAN;

   g_bitarray   t_bitarray;
   g_m          BINARY_INTEGER;
   g_k          BINARY_INTEGER;

   PROCEDURE init (p_m IN BINARY_INTEGER, p_n IN BINARY_INTEGER)
   IS
   BEGIN
      g_m := p_m;
      g_bitarray := t_bitarray ();
      g_bitarray.EXTEND (p_m);
      FOR i IN g_bitarray.FIRST .. g_bitarray.LAST
      LOOP
         g_bitarray (i) := FALSE;
         --dbms_output.put_line(i);
      END LOOP;

      g_k := CEIL (p_m / p_n * LN (2));
   END init;

   FUNCTION add_value (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER
   IS
   BEGIN
      dbms_random.seed (p_value);

      FOR i IN 0 .. g_k
      LOOP
         g_bitarray (dbms_random.VALUE (1, g_m)) := TRUE;
      END LOOP;

      RETURN 1;
   END add_value;

   FUNCTION contain (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER
   IS
      l_ret   BINARY_INTEGER := 1;
   BEGIN
      dbms_random.seed (p_value);

      FOR i IN 0 .. g_k
      LOOP
         IF NOT g_bitarray (dbms_random.VALUE (1, g_m))
         THEN
            l_ret := 0;
            EXIT;
         END IF;
      END LOOP;

      RETURN l_ret;
   END contain;
END bloom_filter;
/

create table t as select dbms_random.string('u',100) as value from dual connect by level<=1e4;

SCOTT@test> execute bloom_filter.init(16484,1000);
PL/SQL procedure successfully completed.

SCOTT@test> select count(bloom_filter.add_value(value)) from t where rownum<=1000;
COUNT(BLOOM_FILTER.ADD_VALUE(VALUE))
------------------------------------
                                1000

SCOTT@test> select count(*) from t where bloom_filter.contain(value) =1;
  COUNT(*)
----------
      1006

--執行初始化bloom_filter.init,就是設定g_bitarray的16384個二進位制位為0.並且設定 g_k := CEIL (p_m / p_n * LN (2))=12;
SCOTT@test> select ceil(16384/1000*ln(2)) from dual ;
CEIL(16384/1000*LN(2))
----------------------
                    12

--  bloom_filter.add_value 就是透過函式(這裡是取隨機數)變化加入g_bitarray中,相應的二進位制位設定1.

-- 後面 執行的select count(*) from t where bloom_filter.contain(value) =1;檢查1e4個字串轉化的位是否在g_bitarray中,當然存
-- 在並不一定正確.
-- 可以發現存在6個衝突的.

-- 如果能把例子執行,很容易理解bloom filter演算法.實際上利用少量的空間儲存大量的資訊.前面的例子使用16384位=2k的資訊儲存
1000*100的資訊,當然也存在一定的衝突,實際上取決於點陣圖數量以及儲存資訊的數量.


-

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/267265/viewspace-1384617/,如需轉載,請註明出處,否則將追究法律責任。

相關文章