NUS-WIDE資料集預處理

HackerTom發表於2020-11-24

原文網址 : https://blog.csdn.net/HackerTom/article/details/110092390

NUS-WIDE^[1]是一個多標籤（multi-label）資料集，26,9648 個樣本、81 個類。從 [1] 下載幾個檔案：

Groundtruth，label，解壓後有 AllLabels/ 和 TrainTestLabels/ 兩個目錄，本篇只用前者。裡面 81 檔案對應 81 個類，檔名形如 Labels_*.txt，裡面都是 26,9648 行的 0/1 資料，表明對應樣本是否屬於該類。
Tags，可以做 text 模態資料，下載得 NUS_WID_Tags.zip。其中 Final_Tag_List.txt 是論文^[2]裡提到的那 5018 個 tags；TagList1k.txt 應該是其中 1000 個 tags 的子集，全英，數量上與 DCMH^[6]的描述對應；All_Tags.txt 是各樣本的樣本標號和對應的 tags（此檔案的 tags 我猜是 5018 個 tags 來的？因為有非英文的 tags）；AllTags1k.txt 是 26,9648 x 1000 的 0/1 矩陣，對應各個樣本是否有各個 tag。
Concept List，解壓得 Concepts81.txt，81 個類的類名。
Image List，其中 Imagelist.txt 指明每個樣本對應的 image。
Image Urls，給出 image 資料的下載鏈。

下的檔案裡本身有 train/test 的劃分，但本篇忽略，保持它原本的樣本順序，後期再按設定自行劃分，如 [5]。
image 資料可另外找資源下，如 [3,4]，圖片會裝在 Flickr/ 目錄下，裡面又有 704 個子目錄，子目錄名與上述 Imagelist.txt 裡的對應，不過本篇忽略這些目錄結構，只用 Groundtruth/ 裡的檔案做 label。
本篇假設所有下載的檔案都放在 nuswide/ 目錄下，並以此為當前目錄，做出的資料檔案也儲存在這。

Common

import os
import numpy as np
import scipy.io as sio
import h5py


N_SAMPLE = 269648

Label

類的順序就按 Concepts81.txt 裡的順序。
按 Groundtruth/AllLabels/ 裡的檔案做標籤。

print("--- label ---")
LABEL_P = "Groundtruth/AllLabels"

# class order determined by `Concepts81.txt`
cls_id = {}
with open("Concepts81.txt", "r") as f:
    for cid, line in enumerate(f):
        cn = line.strip()
        cls_id[cn] = cid
# print("\nclass-ID:", cls_id)
id_cls = {cls_id[k]: k for k in cls_id}
# print("\nID-class:", id_cls)
N_CLASS = len(cls_id)
print("\n#classes:", N_CLASS)

class_files = os.listdir(LABEL_P)
# print("\nlabel file:", len(class_files), class_files)
label_key = lambda x: x.split(".txt")[0].split("Labels_")[-1]

labels = np.zeros([N_SAMPLE, N_CLASS], dtype=np.int8)
for cf in class_files:
    c_name = label_key(cf)
    cid = cls_id[c_name]
    print('->', cid, c_name)
    with open(os.path.join(LABEL_P, cf), "r") as f:
        for sid, line in enumerate(f):
            if int(line) > 0:
                labels[sid][cid] = 1
print("labels:", labels.shape, ", cardinality:", labels.sum())
# labels: (269648, 81) , cardinality: 503848
np.save("labels.npy", labels.astype(np.int8))

Image

將 image 統一放到 images/ 下，方便以後讀取，只放軟連結^[7]。

print("--- image ---")
IMAGE_LIST = "ImageList/Imagelist.txt"
IMAGE_SRC = "/usr/local/dataset/nuswide/Flickr"  # 按實際改路徑
IMAGE_DEST = "/usr/local/dataset/nuswide/images"  # 按實際改路徑
if not os.path.exists(IMAGE_DEST):
    os.makedirs(IMAGE_DEST)

with open(IMAGE_LIST, "r") as f:
    for sid, line in enumerate(f):
        img_p = os.path.join(IMAGE_SRC, line.strip())
        new_img_p = os.path.join(IMAGE_DEST, "{}.jpg".format(sid))
        os.system("ln -s {} {}".format(img_p, new_img_p))
        if sid % 1000 == 0:
            print(sid)

Text

跟 DCMH 的設定，用那 1k 個 tags，但有兩種做法，且結果不同！
後面會與 DCMH 提供的資料對比，以決定取哪一種。
這裡先把那 1k 個 tags 讀出來，順便確定 tags 的順序，即按 TagList1k.txt 的順序。

print("--- text ---")
TEXT_P = "NUS_WID_Tags"

# use 1k tags as DCMH
tag_id = {}
with open(os.path.join(TEXT_P, "TagList1k.txt"), "r", encoding='utf-8') as f:
    for tid, line in enumerate(f):
        tn = line.strip()
        tag_id[tn] = tid
id_tag = {tag_id[k]: k for k in tag_id}
print("\ntag-ID:", len(tag_id), list(tag_id)[:10])
N_TAG = len(id_tag)
print("\n#tag:", N_TAG)  # 1000

first way

第一種是利用 All_Tags.txt 檔案，從中篩出屬於那 1k 個 tags 的子集。

print("- 1st: from `All_Tags.txt` -")
texts_1 = np.zeros([N_SAMPLE, N_TAG], dtype=np.int8)
with open(os.path.join(TEXT_P, "All_Tags.txt"), "r", encoding='utf-8') as f:
    for sid, line in enumerate(f):
        # format: <sample id> <tags...>
        _tags = line.split()[1:]
        # print(_tags)
        for _t in _tags:
            if _t in tag_id:  # 限制在那 1k 個 tags 裡
                tid = tag_id[_t]
                texts_1[sid][tid] = 1
        if sid % 1000 == 0:
            print(sid)
print("1st texts:", texts_1.shape, ", cardinality:", texts_1.sum())
# 1st texts: (269648, 1000) , cardinality: 1559503
np.save("texts.All_Tags.npy", texts_1.astype(np.int8))

second way

第二種是直接從 AllTags1k.txt 讀。
注意其中 cardinality 與第一種方法所得不同。

print("- 2nd: from `AllTags1k.txt` -")
texts_2 = np.zeros([N_SAMPLE, N_TAG], dtype=np.int8)
with open(os.path.join(TEXT_P, "AllTags1k.txt"), "r") as f:
    for sid, line in enumerate(f):
        # format: 81-bit multi-hot vector
        line = list(map(int, line.split()))
        # assert len(line) == 1000
        texts_2[sid] = np.asarray(line).astype(np.int8)
        if sid % 1000 == 0:
            print(sid)
print("2nd texts:", texts_2.shape, ", cardinality:", texts_2.sum())
# 2nd texts: (269648, 1000) , cardinality: 1559464
np.save("texts.AllTags1k.npy", texts_2.astype(np.int8))

compare

這裡對比兩種方法所得 texts 資料。
區別只在其中幾個樣本，第一種所得的 tags 比第二種的多。

print("- compare 2 texts -")
print("2 text:", texts_1.shape, texts_2.shape, texts_1.sum(), texts_2.sum())

with open(os.path.join(TEXT_P, "All_Tags.txt"), "r", encoding='utf-8') as f1, \
        open(os.path.join(TEXT_P, "AllTags1k.txt"), "r") as f2:
    for i in range(texts_1.shape[0]):
        n1 = texts_1[i].sum()
        n2 = texts_2[i].sum()
        line1 = next(f1)
        line2 = next(f2)
        if n1 == n2:
            diff = np.abs(texts_1[i] - texts_1[i]).sum()
            if diff != 0:
                print("class order diff:", i, diff)
            continue
        print("--- diff:", i, n1, n2)
        tags1 = set([_t for _t in line1.split()[1:] if _t in tag_id])
        line2 = list(map(int, line2.split()))
        tags2 = set([id_tag[i] for i in range(len(line2)) if line2[i] > 0])
        print("tags 1:", tags1)
        print("\ntags 2:", tags2)

        extra1 = tags1 - tags2
        if len(extra1) > 0:
            print("\nextra 1:", extra1)
            for k in extra1:
                if k not in tag_id:
                    print("* ERROR:", k, "not it tag_id")
        extra2 = tags2 - tags1
        if len(extra2) > 0:
            print("\nextra 2:", extra2)
            for k in extra2:
                if k not in tag_id:
                    print("* ERROR:", k, "not it tag_id")

TC-21, TC-10

常見的兩種設定：只保留樣本數最多的 21 個/10 個類，即 TC-21/TC-10。
這裡製作對應的 label 資料。

print("--- TC-21, TC-10 ---")
# labels = np.load("labels.npy")
lab_sum = labels.sum(0)
# print("label sum:", lab_sum)
class_desc = np.argsort(lab_sum)[::-1]
tc21 = np.sort(class_desc[:21])
tc10 = np.sort(class_desc[:10])
print("TC-21:", {id_cls[k]: lab_sum[k] for k in tc21})
print("TC-10:", {id_cls[k]: lab_sum[k] for k in tc10})


def make_sub_class(tc):
    n_top = len(tc)
    print("- process TC-{} -".format(n_top))
    with open("class-name-tc{}.txt".format(n_top), "w") as f:
        for i in range(n_top):
            cid = tc[i]
            cn = id_cls[cid]
            n_sample = lab_sum[cid]
            # format: <new class id> <class name> <original class id> <#sample>
            f.write("{} {} {} {}\n".format(i, cn, cid, n_sample))

    sub_labels = labels[:, tc]
    print("sub labels:", sub_labels.shape, ", cardinality:", sub_labels.sum())
    # sub labels: (269648, 21) , cardinality: 411438
    # sub labels: (269648, 10) , cardinality: 332189
    np.save("labels.tc-{}.npy".format(n_top), sub_labels.astype(np.int8))


make_sub_class(tc21)
make_sub_class(tc10)

Clean Data

清洗資料，獲得乾淨資料的索引。
有兩種篩法：只篩 label 為空的、篩 label 或 text 為空的，而 text 又有兩個版本，所以一共 6 種結果。
順便記錄乾淨資料裡的新索引與原資料中的索引的對映，寫入 clean-full-map.*.txt 檔案裡（說不定以後要用）。

print("--- clean data ---")
labels_21 = np.load("labels.tc-21.npy")
labels_10 = np.load("labels.tc-10.npy")


def pick_clean(label, text, name, double_sieve):
    clean_id = []
    on_map = {}
    new_id = 0
    for i, (l, t) in enumerate(zip(label, text)):
        # if only sieved by label (`double_sieve` = False)
        # we get 19,5834 samples in TC-21, and 18,6577 in TC-10
        # which matches the one DCMH provided
        if (0 == l.sum()):
            continue
        # if sieved by both label & text (`double_sieve` = True)
        # we got 19,0421 samples in TC-21, and 18,1365 in TC-10
        if double_sieve and (0 == t.sum()):
            continue
        on_map[new_id] = i
        new_id += 1
        clean_id.append(i)
    clean_id = np.asarray(clean_id)
    print("clean id:", clean_id.shape)
    np.save("clean_id.{}.npy".format(name), clean_id)

    with open("clean-full-map.{}.txt".format(name), "w") as f:
        for k in on_map:
            f.write("{} {}\n".format(k, on_map[k]))


for label, ln in zip([labels_21, labels_10], ["tc21", "tc10"]):
    pick_clean(label, label, ln, False)
    for text, tn in zip([texts_1, texts_2], ["All_Tags", "AllTags1k"]):
        pick_clean(label, text, "{}.{}".format(ln, tn), True)

compare with DCMH

這裡與 DCMH 提供的資料^[8]對比。
結論：樣本順序不同；label 總和相等（當它正確）；第 2 種方法制得的 text 與 DCMH 總和相等（當它正確），就用它。

print("--- compare with the DCMH provided ---")
L_21_dcmh = sio.loadmat("nus-wide-tc21-lall.mat")["LAll"]
L_10_dcmh = sio.loadmat("nus-wide-tc10-lall.mat")["LAll"]
T_21_dcmh = sio.loadmat("nus-wide-tc21-yall.mat")["YAll"]
T_10_dcmh = h5py.File("nus-wide-tc10-yall.mat")["YAll"][:].T.astype(np.int)
print(L_21_dcmh.shape, L_10_dcmh.shape, T_21_dcmh.shape, T_10_dcmh.shape)
# (195834, 21) (186577, 10) (195834, 1000) (186577，1000)

clean_id_10 = np.load("clean_id.10.npy")
clean_id_21 = np.load("clean_id.21.npy")

# 對比 label 的總和，相等
print("label 21:", L_21_dcmh.sum(), labels[clean_id_21].sum())
print("label 10:", L_10_dcmh.sum(), labels[clean_id_10].sum())
# 對比兩種 text，發現第 **二** 種與 DCMH 的資料對應
print("text 21:", T_21_dcmh.sum(), texts_1[clean_id_21].sum(), texts_2[clean_id_21].sum())
print("text 10:", T_10_dcmh.sum(), texts_1[clean_id_10].sum(), texts_2[clean_id_10].sum())
L_21_my = labels[clean_id_21]
L_10_my = labels[clean_id_10]
T_21_my = texts_2[clean_id_21]
T_10_my = texts_2[clean_id_10]


def check_sample_order(L_dcmh, L_my, T_dcmh, T_my):
    nc = L_dcmh.shape[1]
    print("---", nc, "---")
    has_diff = False
    for i in range(L_dcmh.shape[0]):
        l1 = L_dcmh[i].sum()
        l2 = L_my[i].sum()
        if l1 != l2:
            print("* label diff:", i, l1, l2)
            has_diff = True
            break
        t1 = T_dcmh[i].sum()
        t2 = T_my[i].sum()
        if t1 != t2:
            print("* text diff:", i, t1, t2)
            has_diff = True
            break
    if not has_diff:
        print("DONE")


# 對比樣本順序，**不**同
check_sample_order(L_21_dcmh, L_21_my, T_21_dcmh, T_21_my)
check_sample_order(L_10_dcmh, L_10_my, T_10_dcmh, T_10_my)

Cloud Drive

百度網盤：https://pan.baidu.com/s/1362XGnPAp5zlL__eF5D_mw，提取碼：hf3r。
NUS-WIDE

References

資料預處理
2021-09-09
資料分析--資料預處理
2023-12-14
資料預處理-資料清理
2020-01-19
資料預處理 demo
2020-02-19
資料預處理-資料歸約
2020-01-19
UCI資料集詳解及其資料處理（附148個資料集及處理程式碼）
2022-04-19
nlp 中文資料預處理
2019-12-02
TANet資料預處理流程
2020-10-07
影像處理開源資料集
2020-06-08
資料預處理方法彙總
2020-03-16
資料預處理和特徵工程
2020-07-24
特徵工程
深度學習--資料預處理
2024-07-28
深度學習
資料預處理-資料整合與資料變換
2020-01-19
特徵工程之資料預處理（下）
2019-02-13
特徵工程
資料預處理之 pandas 讀表
2020-03-01
人工智慧 (01) 資料預處理
2019-12-18
人工智慧
深度學習——資料預處理篇
2019-02-18
深度學習
sklearn中常用資料預處理方法
2018-03-27
資料預處理利器 Amazon Glue DataBrew
2022-05-31
模型訓練：資料預處理和預載入
2020-10-27
模型
機器學習：探索資料和資料預處理
2020-12-13
機器學習
機器學習一：資料預處理
2019-02-27
機器學習
sklearn 第二篇：資料預處理
2019-07-30
Vaex助力高效處理大規模資料集
2023-10-27
Python資料處理（二）：處理 Excel 資料
2019-02-16
PythonExcel
資料預處理- 資料清理資料整合資料變換資料規約
2020-01-15
電影推薦系統資料預處理
2020-02-19
Sklearn之資料預處理——StandardScaler歸一化
2020-10-18
Struts2 action前的資料預處理
2020-04-05
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
機器學習筆記---資料預處理
2022-04-30
機器學習筆記
C#中的深度學習（二）：預處理識別硬幣的資料集
2020-12-22
C#深度學習
數字孿生汙水處理廠助力資料採集視覺化處理
2022-10-10
視覺化
[影像處理] 基於CleanVision庫清洗影像資料集
2024-10-24
Ai影像分割模型PaddleSeg——自定義資料集處理
2021-08-10
AI模型
資料處理
2024-07-18
文字資料預處理：sklearn 中 CountVectorizer、TfidfTransformer 和 TfidfVectorizer
2018-09-13
ORM
影象識別及處理相關資料集介紹
2019-03-09