海量資料找缺失的值

JUAN425發表於2015-04-16

一個檔案中含有4billion的integers, 試圖找出一個不再這個檔案中的數。

假如一個integer用4bytes表示，那麼檔案的大小就是4 * 10 ^9 * 4 = 16 GB。檔案太大了。

（1）辦法一：採用外排序的辦法。對這些資料排完序之後，可以從第一個數開始遍歷，知道出現gap, 表示gap之間缺少一個元素。這樣我們輸出的是第一個missing的value。

（2）辦法二：

假如我們的資料是32位表示的整數，總共可能的值是2^32個distinct integers，約為4 billion（4 * 10^9）。

情況1：我們有1 * 10^9 bytes * 8bit/byte = 8 billion bits 的記憶體（足以表示出4 billion的所有可能的互異的numbers了），我們可以使用一個bit代表一個整數，也就是使用Bitmap的辦法。我們就可以不經過排序就可以檢索到整數的位置。我們只需要遍歷，將發現的數存在bitmap中，出現即將相應的index的bitmap的值設為1。查詢到bitmap中為0的位置，那麼這個位置的index就是missing value。

情況2: 我們只有10MB = 80 million bits的記憶體，我們可以採用如下的辦法求解。

對於所有可能的16位字首（我們的每一個數字都是32位的，這裡是高16位），每一個對應的字首有2^16種可能的情況。如下：

00000000 00000000 : 2^16種可能的數字(低位)

00000000 00000001 : 2^16種可能的數字(低位)

...........................................................................

11111111 11111111 : 2^16種可能的數字(低位)

也就是相當於有2^16個bucket，每一個bucket有2^16個可能的數字（對應著低位）。然後我們對檔案中的4billion個數字進行統計。具有相同16bit字首的數字將在同一個bucket，相當於對這個bucket進行加1。統計完成後，只要有缺失的數字，那麼這個缺失的數字的字首對應的bucket的計數值將小於2^16個。接下來我們只需要把這這個字首記錄下來。

然後釋放小所有的內陳。對那4 billion個數重新遍歷，此時我們的辦法是統計的是字尾。字尾的bucket的數目也是2^16個。然後開始統計。和剛才記錄的字首相同，且這個數對應的字尾的那個bucket設定為1，表示存在。統計完成之後，我們只需要遍歷這些bucket， bucket計數值為0的表示這個字尾不存在。那麼此時就找到了，將剛才那個字首（記錄下的）和這個為0 的字尾的bucket的數字concanate起來，最終就得了一個missing value。

介紹完上述的面試題之後，無意中看到了格雷碼的產生辦法。記錄如下：

問題： Given a number n, generate bit patterns from 0 to 2^n-1 such that successive patterns differ by one bit.

辦法：

n-bit Gray Codes can be generated from list of (n-1)-bit Gray codes using following steps.
1) Let the list of (n-1)-bit Gray codes be L1. Create another list L2 which is reverse of L1.
2) Modify the list L1 by prefixing a ‘0’ in all codes of L1.
3) Modify the list L2 by prefixing a ‘1’ in all codes of L2.
4) Concatenate L1 and L2. The concatenated list is required list of n-bit Gray codes.

For example, following are steps for generating the 3-bit Gray code list from the list of 2-bit Gray code list.
L1 = {00, 01, 11, 10} (List of 2-bit Gray Codes)
L2 = {10, 11, 01, 00} (Reverse of L1)
Prefix all entries of L1 with ‘0’, L1 becomes {000, 001, 011, 010}
Prefix all entries of L2 with ‘1’, L2 becomes {110, 111, 101, 100}
Concatenate L1 and L2, we get {000, 001, 011, 010, 110, 111, 101, 100}

程式如下：

// C++ program to generate n-bit Gray codes
#include <iostream>
#include <string>
#include <vector>
using namespace std;

// This function generates all n bit Gray codes and prints the
// generated codes
void generateGrayarr(int n) {
    // base case
    if (n <= 0)
        return;

    // vector of strings, store the gray code
    vector<string> arr;

    // start with one-bit pattern
    arr.push_back("0"); // arr[0] = 0
    arr.push_back("1"); // arr[1] = 1

    // Every iteration of this loop generates 2*i codes from previously
    // generated i codes.
    for (int i = 2; i < (1<<n); i = i<<1) {
        // Enter the prviously generated codes again in arr[] in reverse
        // order. Nor arr[] has double number of codes.
        for (int j = i-1 ; j >= 0 ; j--)
            arr.push_back(arr[j]); // arr[j] 代表第j個gray code

        // append 0 to the first half
        for (int j = 0 ; j < i ; j++)
            arr[j] = "0" + arr[j];

        // append 1 to the second half
        for (int j = i ; j < 2*i ; j++)
            arr[j] = "1" + arr[j];
    }

    // print contents of arr[]
    for (int i = 0 ; i < static_cast<int>(arr.size()) ; i++ )
        cout << arr[i] << endl;
}

// Driver program to test above function
int main() {
    generateGrayarr(4);
    return 0;
}

執行結果如下：

【Python資料分析基礎】: 資料缺失值處理
2018-07-28
Python
Python資料分析基礎: 資料缺失值處理
2020-10-31
Python
缺失值填充
2024-03-26
資料分析缺失值處理(Missing Values)——刪除法、填充法、插值法
2023-05-17
機器學習第2篇：資料預處理（缺失值）
2020-12-27
機器學習
seaborn和pandas-missingno 的資料視覺化--使用畫圖--缺失值分析
2019-01-01
視覺化
LeetCode-2887. 填充缺失值
2024-05-31
LeetCode
海量資料處理
2018-03-08
尋找海量資料集用於大資料開發實戰(維基百科網站統計資料)
2022-08-18
大資料網站
海量資料的併發處理
2018-03-14
如何應對缺失值帶來的分佈變化？探索填充缺失值的最佳插補演算法
2024-06-10
演算法
機器學習第4篇：資料預處理（sklearn 插補缺失值）
2020-12-29
機器學習
Pandas 基礎 (5) - 處理缺失的資料
2019-03-08
海量資料處理2
2018-08-28
[Python] Pandas 對資料進行查詢、替換、篩選、排序、重複值和缺失值處理
2021-02-11
Python排序
啟用海量資料價值，實現生產過程最佳化
2022-12-22
關於python中填充缺失值的問題
2024-04-05
Python
海量資料的分頁怎麼破？
2019-07-26
pandas-task07-缺失資料.md
2021-01-02
Leetcode-2028. 找出缺失的觀測資料
2024-05-31
LeetCode
深耕物件儲存 ECS釋放海量非結構化資料新價值
2021-04-14
物件
HyperLogLog：海量資料下的基數計算
2018-05-15
pandas學習task07缺失資料
2021-01-03
處理資料缺失的結構化解決辦法
2018-10-26
python時間序列缺失值補零
2024-03-18
Python
機器學習第3篇：資料預處理（使用插補法處理缺失值）
2020-12-28
機器學習
海量資料搜尋---搜尋引擎
2018-11-13
海量資料和高併發的解決方案
2019-04-02
SeaTunnel用於海量資料的同步和轉換
2022-01-13
滴滴海量離線資料的線上化 — FastLoad
2022-12-06
AST
【UV統計】海量資料統計的前世今生
2020-12-13
Pandas高階教程之:處理缺失資料
2021-06-24
使用impdp,expdp資料泵進入海量資料遷移
2018-04-04
面試：頁面載入海量資料
2018-04-26
面試
面對海量資料，如何才能查得更快?
2020-11-09
海量資料儲存之動態SchemaOU
2022-03-21
快手關於海量模型資料處理的實踐
2024-02-07
模型
CVPR 2019 | 全新缺失影象資料插補框架—CollaGAN
2019-05-05
框架
CVPR 2019 | 全新缺失影像資料插補框架—CollaGAN
2019-05-05
框架

海量資料找缺失的值

相關文章