[大資料文章之其四] 概率資料結構簡介

StoneDemo發表於2018-06-10

日常前言

  • 五月六月,又陷入反反覆覆的專案 Bug 中了。講道理,分析日誌是越來越熟練了,程式碼水平其實沒有很大提高,畢竟改 Bug 嘛,大多隻是在原有程式碼的基礎上,新增或者修改一些業務邏輯。雖然改原生程式碼的時候能學到很多東西,但是那些部分很少出現問題,絕大部分還是我們自己人加入、修改的邏輯挖出來的坑。填坑的過程真是漫長又令人心煩。
  • 時間有限,這次的翻譯也只選了五個短篇,儘量提高內容質量同時也節省出一些業餘時間學習一些其它知識。
  • 不過這次翻譯對我來說,收穫頗豐。在翻譯其中兩篇文章的時候,我做了詳細的筆記,並且在閱讀過程中查閱了不少相關資料,學到了很多東西。
  • 其中一篇是資料視覺化的藝術,雖然只是以網頁效能分析為例,對各種常用圖表作了簡單的適用場景的介紹,但是這正是我最近需要了解的內容 —— 由於業務原因,我需要經常接入第三方演算法,並評測其效能。然而組內一直都是用列印 Log 的方式去分析效能,很不方便,而且經常會忽略掉一些異常變化。我正需要一些方法提高我們的效能分析效率,而這篇文章則給我指明瞭方向。
  • 另一篇則是關於概率資料結構的介紹。選擇翻譯這篇文章是因為看到了 Bloom Filter,這讓我想起了大學時給老師打工寫爬蟲的時光……這次順勢重溫了 Bloom Filter,並瞭解了 HyperLogLog 與 Min-Count Sketch 這兩個演算法。我總覺得在不久的將來我就會用上它們。
  • 這一期文章依舊採納了四篇:
  • 說到版權問題,我其實不太清楚我這樣中英翻譯的方式發 Blog 是不是侵了英文原版的版權了。但是如果不是中英翻譯的話,發這些 Blog 就是多此一舉了。如果侵權了的話,以後再刪掉吧~

版權相關

翻譯人:StoneDemo,該成員來自雲+社群翻譯社
原文連結:Introduction to Probabilistic Data Structures
原文作者:Yin Niu


Introduction to Probabilistic Data Structures

題目:(概率資料結構簡介)

When processing large data sets, we often want to do some simple checks, such as number of unique items, most frequent items, and whether some items exist in the data set. The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes. But when the data set we are dealing with becomes very large, such data structures are simply not feasible because the data is too big to fit in the memory. It becomes even more difficult for streaming applications which typically require data to be processed in one pass and perform incremental updates.

在處理大型的資料集時,我們常常進行一些簡單的檢查,如稀有項(Unique items)的數量、最常見的項,以及資料集中是否存在某些指定的項。通常的做法是使用某種確定性的資料結構,如 HashSet(雜湊集) 或 Hashtable(雜湊表) 來達此目的。但是當我們所處理的資料集十分巨大時,這樣的資料結構完全不可行,因為資料量太大,我們沒有足夠的儲存空間。對於通常需要在一次傳遞(One pass)中處理資料並執行增量更新的流媒體應用(Streaming application)來說,這就變得更加困難。

Probabilistic data structures are a group of data structures that are extremely useful for big data and streaming applications. Generally speaking, these data structures use hash functions to randomize and compactly represent a set of items. Collisions are ignored but errors can be well-controlled under certain threshold. Comparing with error-free approaches, these algorithms use much less memory and have constant query time. They usually support union and intersection operations and therefore can be easily parallelized.

概率資料結構(Probabilistic data structures)是一組資料結構,它們對於大資料和流式應用來說非常有用。一般而言,這類資料結構使用雜湊函式(Hash function)來隨機化並緊湊地表示一個項的集合。忽略掉碰撞(Collision)的情況,但錯誤可以在一定的閾值下得到很好的控制。與無錯方法相比,這些演算法使用的記憶體更少,並且具有常數級的查詢時間複雜度。他們通常支援並集(Union)和交集(Intersection)操作,因此可以很容易地使其並行化。

This article will introduce three commonly used probabilistic data structures: Bloom filter, HyperLogLog, and Count-Min sketch.

本文將介紹三種常用的概率資料結構:Bloom filter(布隆過濾器),HyperLogLog(基數估計法),以及 Count-Min sketch(最小略圖)。

Membership Query - Bloom filter

(成員關係查詢 —— Bloom filter)

A Bloom filter is a bit array of m bits initialized to 0. To add an element, feed it to k hash functions to get k array position and set the bits at these positions to 1. To query an element, feed it to k hash functions to obtain k array positions. If any of the bits at these positions is 0, then the element is definitely not in the set. If the bits are all 1, then the element might be in the set. A Bloom filter with 1% false positive rate only requires 9.6 bits per element regardless of the size of the elements.

Bloom filter 是具有 m 個數位的位陣列(Bit array),每一個數位被初始化為 0。要新增一個元素,則先將其流入 k 個雜湊函式,以獲取 k 個陣列位置,並將這些位置對應的數位設定為 1。查詢某元素時,將其流入 k 個雜湊函式來獲得 k 個陣列位置。如果這些位置中有任何一個 0,則該元素必定不在該集合中。如果這些位全部為 1,那麼該元素可能在該集合中。無論元素的大小如何,每個元素僅需要有 9.6 個位,就能使 Bloom filter 具有 1% 的低誤報率(False positive rate)。

這裡寫圖片描述

For example, if we have inserted x, y, z into the bloom filter, with k=3 hash functions like the picture above. Each of these three elements has three bits each set to 1 in the bit array. When we look up for w in the set, because one of the bits is not set to 1, the bloom filter will tell us that it is not in the set.

例如,如果我們將 x,y,z 新增到 Bloom filter 中,並使用 3 個雜湊函式(即 k = 3),如上圖所示。這三個元素,每一個都在位陣列中有三個位,每個位都設定為 1。當我們在集合中查詢 w 時,由於其中一個位元未被設定為 1,Bloom filter 會告訴我們它不在集合中。

Bloom filter has the following properties:

  • False positive is possible when the queried positions are already set to 1. But false negative is impossible.
  • Query time is O(k).
  • Union and intersection of bloom filters with same size and hash functions can be implemented with bitwise OR and AND operations.
  • Cannot remove an element from the set.

Bloom filter 具有以下特性:

  • 當查詢的位置都已經設​​置為 1 時,可能出現誤報。但是,錯誤否定(False negative,在此處表示對於 “不在集合中” 的錯誤判定)是不可能的。
  • 查詢時間是 O(k)。
  • 具有相同大小和雜湊函式的 Bloom filter 的並集和交集操作,可以通過按位 OR 和 AND 操作來實現。
  • 無法從集合中刪除元素。

Bloom filter requires the following inputs:

m: size of the bit array

n: estimated insertion

p: false positive probability

布隆過濾器需要以下幾種輸入:

m:位陣列的大小
n:預計要插入的元素數量(插入次數)
p:誤報率

The optimum number of hash functions k can be determined using the formula:

使用以下公式可以確定雜湊函式的最佳數量 k:

這裡寫圖片描述

Given false positive probability p and the estimated number of insertions n, the length of the bit array can be calculated as:

給定誤報率 p 和預計的插入次數 n,位陣列的長度可以通過下式計算:

這裡寫圖片描述

The hash functions used for bloom filter should generally be faster than cryptographic hash algorithms with good distribution and collision resistance. Commonly used hash functions for bloom filter include Murmur hash, fnv series of hashes and Jenkins hashes. Murmur hash is the fastest among them. MurmurHash3 is used by Google Guava library’s bloom filter implementation.

通常用於 Bloom filter 的雜湊函式應該比具有良好分佈和抗碰撞性的加密雜湊演算法更快。Bloom filter 常用的雜湊函式包括 Murmur 雜湊函式,fnv 的一系列雜湊函式,以及 Jenkins 雜湊函式。Murmur 雜湊是其中最快的。谷歌在其 Guava 庫中實現的 Bloom filter 使用了 MurmurHash3。

Cardinality - HyperLogLog

(基數 —— HyperLogLog)

HyperLogLog is a streaming algorithm used for estimating the number of distinct elements (the cardinality) of very large data sets. HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.

HyperLogLog 是一種流式演算法,用於估算極大型資料集中不同元素(基數)的數量。HyperLogLog 計數器可以僅使用 1.5KB 的記憶體計算出 10 億個不同的項,同時其精確度為 2%。該演算法基於位模式觀察(Bit pattern observation),對於隨機分佈的數字流,若有一個數字 x ,它具有最多的 k 個前導位 0,則流的基數很可能等於 2^k。

For each element si in the stream, hash function h(si) transforms si into string of random bits (0 or 1 with probability of 1/2):

對於流中的每個元素 si,使用雜湊函式 h(si) 將 si 轉換為隨機位元串(0 或 1,概率為 1/2):

這裡寫圖片描述

The probability P of the bit patterns:

0xxxx… → P = 1/2

01xxx… → P = 1/4

001xx… → P = 1/8

位模式的概率 P 則如下:

0xxxx …→P = 1/2

01xxx …→P = 1/4

001xx …→P = 1/8

The intuition is that when we are seeing prefix 0k 1…, it’s likely there are n ≥ 2k+1 different strings. By keeping track of prefixes 0k 1… that have appeared in the data stream, we can estimate the cardinality to be 2p, where p is the length of the largest prefix.

當我們看到字首為 0k 1 … 時,直覺告訴我們,這很可能有 n≥2^(k+1) 個不同的字串。通過跟蹤出現在資料流中的字首 0k 1 …,我們可以估計其基數為 2^p,其中 p 是最大字首的長度。

Because the variance is very high when using single counter, in order to get a better estimation, data is split into m sub-streams using the first few bits of the hash. The counters are maintained by m registers each has memory space of multiple of 4 bytes. If the standard deviation for each sub-stream is σ, then the standard deviation for the averaged value is only σ/√m. This is called stochastic averaging.

由於使用單個計數器時方差非常高,為了獲得更好的估計,我們使用雜湊值的前幾位將資料拆分為 m 個子流。計數器分別由 m 個暫存器維護,其中每個暫存器具有 4 位元組的倍數大小的儲存空間。如果每個子流的標準偏差為 σ,則平均值的標準偏差僅為 σ/√m。這被稱為隨機平均(Stochastic averaging)。

For instance for m=4,

例如,對於m = 4,

這裡寫圖片描述

The elements are split into m stream using the first 2 bits (00, 01, 10, 11) which are then discarded. Each of the register stores the rest of the hash bits that contains the largest 0k 1 prefix. The values in the m registers are then averaged to obtain the cardinality estimate.

使用前兩位(00,01,10,11)將元素分成 m 個流,然後將其丟棄。每個暫存器儲存包含最大 0k 1 字首的其餘雜湊位元。然後將 m 個暫存器中的值平均起來以獲得基數估計。

HyperLogLog algorithm uses harmonic mean to normalize result. The algorithm also makes adjustment for small and very large values. The resulting error is equal to 1.04/√m.

HyperLogLog 演算法使用調和均值(Harmonic mean)來將結果歸一化。該演算法還可以根據小的值與非常大的值進行調整。由此產生的誤差等於 1.04 /√m。

Each of the m registers uses at most log2log2 n + O(1) bits when cardinalities ≤ n need to be estimated.

當需要估計的基數小於等於 n 時,m 個暫存器中的任一個最多使用 log2(log2(n)) + O(1) 個位元位。

Union of two HyperLogLog counters can be calculated by first taking the maximum value of the two counters for each of the m registers, and then calculate the estimated cardinality.

要計算兩個 HyperLogLog 計數器的並集,可以先計算出每個計數器中的 m 個暫存器,將不同計數器的暫存器進行比較並取最大值,然後再計算估計的基數。

Frequency - Count-Min Sketch

(頻率 —— Count-Min Sketch)

Count-Min sketch is a probabilistic sub-linear space streaming algorithm. It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary.

Count-Min Sketch 是概率子線性空間流演算法。它與 Bloom filter 在某種程度上是相似的。它們的主要區別在於,Bloom filter 用點陣圖來表示一個集合,而 Count-Min Sketch 則用點陣圖來表示一個儲存了頻率分佈概況的多重集(Multi-set)。

The basic data structure is a two dimensional d x w array of counters with d pairwise independent hash functions h1 … hd of range w. Given parameters (ε,δ), set w = [e/ε], and d = [ln1/δ]. ε is the accuracy we want to have and δ is the certainty with which we reach the accuracy. The two dimensional array consists of wd counts. To increment the counts, calculate the hash positions with the d hash functions and update the counts at those positions.

其基本資料結構是一個二維的 (d * w) 計數器陣列,它具有 d 個兩兩獨立的雜湊函式 h1 … hd,它們的值域都在 w 內。給定引數(ε,δ),令 w = [e /ε],d = [ln1 / δ]。ε 是我們想要的準確度,δ 是我們達到準確度的確定性(Certainty)。二維陣列由 wd 計陣列成。要增加計數,則需使用 d 個雜湊函式計算雜湊位置,並更新這些位置的計數。

這裡寫圖片描述

The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions.

項的計數估計值是由 d 個雜湊函式所確定的陣列位置處的最小計數值。

The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved.

Count-Min Sketch 使用的空間是 w * d 個計數器的陣列。通過選擇合適的 d 和 w 值,可以實現非常小的誤差和高概率。

Example of Count-Min sketch sizes for different error and probability combination:

不同錯誤和概率組合的 Count-Min Sketch 尺寸示例:

ε 1 - δ w d wd
0.1 0.9 28 3 84
0.1 0.99 28 5 140
0.1 0.999 28 7 196
0.01 0.9 272 3 816
0.01 0.99 272 5 1360
0.01 0.999 272 7 1940
0.0001 0.999 2719 7 19033

Count-Min sketch has the following properties:

  • Union can be performed by cell-wise ADD operation
  • O(k) query time
  • Better accuracy for higher frequency items (heavy hitters)
  • Can only cause over-counting but not under-counting

Count-Min Sketch 具有以下特性:

  • 並集可以通過按位的 ADD 操作實現
  • O(k) 的查詢時間複雜度
  • 頻率越高的項(比如 Heavy hitters,大流量物件),其準確度越高
  • 只會造成重複計算,但不會計算不足(即頻率值不會偏低)

Count-Min sketch can be used for querying single item count or “heavy hitters” which can be obtained by keeping a heap structure of all the counts.

Count-Min Sketch 可用於查詢單個項的計數或 “Heavy hitters”(可通過保留所有計數的堆結構來獲得)。

Summary

(總結)

Probabilistic data structures have many applications in modern web and data applications where the data arrives in a streaming fashion and needs to be processed on the fly using limited memory. Bloom filter, HyperLogLog, and Count-Min sketch are the most commonly used probabilistic data structures. There are a lot of research on various streaming algorithms, synopsis data structures and optimization techniques that are worth investigating and studying.

概率資料結構在現代網路和資料應用程式中已經有了許多應用,這些應用中的資料以流的方式到達,並且需要使用有限的記憶體進行即時處理。Bloom filter,HyperLogLog 和 Count-Min Sketch 是最為常用的概率資料結構。對於各種流媒體演算法、概要資料結構(Synopsis data structure)和優化技術已經有很多相關的研究,這些都值得我們去學習學習。

If you haven’t tried these data structures, you will be amazed how powerful they can be once you start using them. It may be a little bit intimidating to understand the concept initially, but the implementation is actually quite simple. Google Guava has Bloom filter implementation using murmur hash. Clearspring’s Java library stream-lib and Twitter’s Scala library Algebird have implementation for all three data structures and other useful data structures that you can play with. I have included the links below.

如果您還沒有嘗試過這些資料結構,那麼一旦您開始使用它們,您會驚奇地發現它們有多麼強大。剛開始時,理解它們的概念可能有些嚇人,但實際上,要實現它們非常簡單。Google Guava 使用 Murmur 雜湊來實現了 Bloom filter。Clearspring 的 Java 庫 stream-lib,以及 Twitter 的 Scala 庫 Algebird 已經實現了所有的這三種資料結構和其他有用的資料結構,您可以通過這些庫來使用它們。我已經把相應的連結列在下面。

相關連結:

相關文章