[影像處理] 基於CleanVision庫清洗影像資料集

落痕的寒假發表於2024-10-24

CleanVision是一個開源的Python庫,旨在幫助使用者自動檢測影像資料集中可能影響機器學習專案的常見問題。該庫被設計為計算機視覺專案的初步工具,以便在應用機器學習之前發現並解決資料集中的問題。CleanVision的核心功能包括檢測完全重複、近似重複、模糊、低資訊量、過暗、過亮、灰度、不規則長寬比和尺寸異常等問題圖片。CleanVision開源倉庫地址為:CleanVision,官方文件地址為:CleanVision-docs

CleanVision基礎版安裝命令如下:

pip install cleanvision

完整版安裝命令如下:

pip install "cleanvision[all]"

檢視CleanVision版本:

# 檢視版本
import cleanvision
cleanvision.__version__
'0.3.6'

本文程式碼必要庫版本:

# 用於表格顯示
import tabulate
# tabulate版本需要0.8.10以上
tabulate.__version__
'0.9.0'

目錄
  • 1 使用說明
    • 1.1 CleanVision功能介紹
    • 1.2 基礎使用
    • 1.3 自定義檢測
    • 1.4 在Torchvision資料集上執行CleanVision
    • 1.5 在Hugging Face資料集上執行CleanVision
  • 2 參考

1 使用說明

1.1 CleanVision功能介紹

CleanVision支援多種格式的影像檔案,並能檢測以下型別的資料問題:

示例圖片 問題型別 描述 關鍵字
完全重複 完全相同的影像 exact_duplicates
近似重複 視覺上幾乎相同的影像 near_duplicates
模糊 影像細節模糊(焦點不實) blurry
資訊量低 缺乏內容的影像(畫素值的熵很小) low_information
過暗 不規則的暗影像(曝光不足) dark
過亮 不規則的亮影像(曝光過度) light
灰度 缺乏顏色的影像 grayscale
異常寬高比 寬高比異常的影像 odd_aspect_ratio
異常大小 相比資料集中其他影像,尺寸異常的影像 odd_size

上表中,CleanVision針對這些問題的檢測主要依賴於多種統計方法,其中關鍵字列表用於指定CleanVision程式碼中每種問題型別的名稱。CleanVision相容Linux、macOS和Windows系統,可在Python 3.7及以上版本的環境中執行。

1.2 基礎使用

本節介紹如何讀取資料夾中的圖片以進行問題檢測。以下示例展示了對一個包含607張圖片的資料夾進行質量檢測的過程。在檢測過程中,CleanVision將自動載入多程序以加快處理速度:

基礎使用

from cleanvision import Imagelab

# 示例資料:https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip
# 讀取示例圖片
dataset_path = "./image_files/"

# 例項化Imagelab類,以用於後續處理
imagelab = Imagelab(data_path=dataset_path)

# 使用multiprocessing進行多程序處理,n_jobs設定程序數
# n_jobs預設為None,表示自動確定程序數
# 處理時會先檢測每張圖片的image_property(影像質量)
# 等所有圖片處理完後,再檢測duplicate(重複)
imagelab.find_issues(verbose=False, n_jobs=2)
Reading images from D:/cleanvision/image_files

如果在Windows系統上執行CleanVision程式碼,需要將相關程式碼放入main函式中,以便正確載入multiprocessing模組。當然,也可以將n_jobs設定為1,以使用單程序:

from cleanvision import Imagelab

if '__main__' == __name__:
    # 示例資料:https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip
    # 讀取示例圖片
    dataset_path = "./image_files/"
    imagelab = Imagelab(data_path=dataset_path)
    imagelab.find_issues(verbose=False)
Reading images from D:/cleanvision/image_files

基於report函式,能夠報告資料集中每種問題型別的影像數量,並展示每種問題型別中最嚴重例項的影像:

imagelab.report()
Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | odd_size         |          109 |
|  1 | grayscale        |           20 |
|  2 | near_duplicates  |           20 |
|  3 | exact_duplicates |           19 |
|  4 | odd_aspect_ratio |           11 |
|  5 | dark             |           10 |
|  6 | blurry           |            6 |
|  7 | light            |            5 |
|  8 | low_information  |            5 | 

--------------------- odd_size images ----------------------

Number of examples with this issue: 109
Examples representing most severe instances of this issue:

png

--------------------- grayscale images ---------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

png

------------------ near_duplicates images ------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

Set: 0

png

Set: 1

png

Set: 2

png

Set: 3

png

----------------- exact_duplicates images ------------------

Number of examples with this issue: 19
Examples representing most severe instances of this issue:

Set: 0

png

Set: 1

png

Set: 2

png

Set: 3

png

----------------- odd_aspect_ratio images ------------------

Number of examples with this issue: 11
Examples representing most severe instances of this issue:

png

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

------------------ low_information images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

如果想建立自定義的問題識別型別,可以參考:custom_issue_manager

與資料結果互動的主要方式是透過Imagelab類。此類可用於在宏觀層面(全域性概覽)和微觀層面(每張圖片的問題和質量評分)瞭解資料集中的問題。它包含三個主要屬性:

  • Imagelab.issue_summary:問題摘要
  • Imagelab.issues:問題列表
  • Imagelab.info:資料集資訊,包括相似圖片資訊

問題結果分析

透過issue_summary屬性可以展示資料集中不同問題類別的影像數量:

# 返回結果為pandas的dataframe
res = imagelab.issue_summary
type(res)
pandas.core.frame.DataFrame

檢視彙總結果:

res
issue_type num_images
0 odd_size 109
1 grayscale 20
2 near_duplicates 20
3 exact_duplicates 19
4 odd_aspect_ratio 11
5 dark 10
6 blurry 6
7 light 5
8 low_information 5

透過issues屬性,可以展示每張圖片中各種問題的質量分數及其存在情況。這些質量分數的範圍從0到1,較低的分數表示問題的嚴重性更高:

imagelab.issues.head()
odd_size_score is_odd_size_issue odd_aspect_ratio_score is_odd_aspect_ratio_issue low_information_score is_low_information_issue light_score is_light_issue grayscale_score is_grayscale_issue dark_score is_dark_issue blurry_score is_blurry_issue exact_duplicates_score is_exact_duplicates_issue near_duplicates_score is_near_duplicates_issue
D:/cleanvision/image_files/image_0.png 1.0 False 1.0 False 0.806332 False 0.925490 False 1 False 1.000000 False 0.980373 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_1.png 1.0 False 1.0 False 0.923116 False 0.906609 False 1 False 0.990676 False 0.472314 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_10.png 1.0 False 1.0 False 0.875129 False 0.995127 False 1 False 0.795937 False 0.470706 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_100.png 1.0 False 1.0 False 0.916140 False 0.889762 False 1 False 0.827587 False 0.441195 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_101.png 1.0 False 1.0 False 0.779338 False 0.960784 False 0 True 0.992157 False 0.507767 False 1.0 False 1.0 False

由於imagelab.issues返回的是Pandas的資料表格,因此可以對特定型別的資料進行篩選:

# 得分越小,越嚴重
dark_images = imagelab.issues[imagelab.issues["is_dark_issue"] == True].sort_values(
    by=["dark_score"]
)
dark_images_files = dark_images.index.tolist()
dark_images_files
['D:/cleanvision/image_files/image_417.png',
 'D:/cleanvision/image_files/image_350.png',
 'D:/cleanvision/image_files/image_605.png',
 'D:/cleanvision/image_files/image_177.png',
 'D:/cleanvision/image_files/image_346.png',
 'D:/cleanvision/image_files/image_198.png',
 'D:/cleanvision/image_files/image_204.png',
 'D:/cleanvision/image_files/image_485.png',
 'D:/cleanvision/image_files/image_457.png',
 'D:/cleanvision/image_files/image_576.png']

視覺化其中的問題圖片:

imagelab.visualize(image_files=dark_images_files[:4])

png

完成上述任務的更簡潔方法是直接在imagelab.visualize函式中指定issue_types引數,這樣可以直接顯示某個問題下的圖片,並按嚴重程度對其進行排序展示:

# issue_types:問題型別,num_images:顯示圖片數,cell_size:每個網格中圖片尺寸
imagelab.visualize(issue_types=["low_information"], num_images=3, cell_size=(3, 3))

png

檢視圖片資訊和相似圖片

透過info屬性可以檢視資料集的資訊:

# 檢視存在的專案
imagelab.info.keys()
dict_keys(['statistics', 'dark', 'light', 'odd_aspect_ratio', 'low_information', 'blurry', 'grayscale', 'odd_size', 'exact_duplicates', 'near_duplicates'])
# 檢視統計資訊
imagelab.info["statistics"].keys()
dict_keys(['brightness', 'aspect_ratio', 'entropy', 'blurriness', 'color_space', 'size'])
# 檢視資料集的統計資訊
imagelab.info["statistics"]["size"]
count     607.000000
mean      280.830152
std       215.001908
min        32.000000
25%       256.000000
50%       256.000000
75%       256.000000
max      4666.050578
Name: size, dtype: float64

檢視資料集中基本相似的圖片個數:

imagelab.info["exact_duplicates"]["num_sets"]
9

檢視資料集中近似的圖片對:

imagelab.info["near_duplicates"]["sets"]
[['D:/cleanvision/image_files/image_103.png',
  'D:/cleanvision/image_files/image_408.png'],
 ['D:/cleanvision/image_files/image_109.png',
  'D:/cleanvision/image_files/image_329.png'],
 ['D:/cleanvision/image_files/image_119.png',
  'D:/cleanvision/image_files/image_250.png'],
 ['D:/cleanvision/image_files/image_140.png',
  'D:/cleanvision/image_files/image_538.png'],
 ['D:/cleanvision/image_files/image_25.png',
  'D:/cleanvision/image_files/image_357.png'],
 ['D:/cleanvision/image_files/image_255.png',
  'D:/cleanvision/image_files/image_43.png'],
 ['D:/cleanvision/image_files/image_263.png',
  'D:/cleanvision/image_files/image_486.png'],
 ['D:/cleanvision/image_files/image_3.png',
  'D:/cleanvision/image_files/image_64.png'],
 ['D:/cleanvision/image_files/image_389.png',
  'D:/cleanvision/image_files/image_426.png'],
 ['D:/cleanvision/image_files/image_52.png',
  'D:/cleanvision/image_files/image_66.png']]

1.3 自定義檢測

指定檢測型別

from cleanvision import Imagelab

# 示例資料:https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip
dataset_path = "./image_files/"

# 指定檢測型別
issue_types = {"blurry":{}, "dark": {}}

imagelab = Imagelab(data_path=dataset_path)

imagelab.find_issues(issue_types=issue_types, verbose=False)
imagelab.report()
Reading images from D:/cleanvision/image_files


Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

如果已經執行過find_issues函式,再次執行該函式時如果新增新的檢測型別,當前結果將會與上一次的結果合併:

issue_types = {"light": {}}
imagelab.find_issues(issue_types)
# 報告三個型別的結果
imagelab.report()
Checking for light images ...
Issue checks completed. 21 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 |
|  2 | light        |            5 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

結果儲存

以下程式碼展示瞭如何儲存和載入結果,但載入結果時,資料路徑和資料集必須與儲存時保持一致:

save_path = "./results"
# 儲存結果
# force表示是否覆蓋原檔案
imagelab.save(save_path, force=True)
# 載入結果
imagelab = Imagelab.load(save_path, dataset_path)
Successfully loaded Imagelab

閾值設定

CleanVision透過閾值控制來確定各種檢測結果,其中exact_duplicates和near_duplicates是基於影像雜湊(由 imagehash庫提供)進行檢測的,而其他型別的檢測則採用範圍為0到1的閾值來控制結果。如果圖片在某一問題型別上的得分低於設定的閾值,則認為該圖片存在該問題;閾值越高,判定為存在該問題的可能性越大。如下所示:

關鍵字 超引數
1 light threshold
2 dark threshold
3 odd_aspect_ratio threshold
4 exact_duplicates N/A
5 near_duplicates hash_size(int),hash_types(whash,phash,ahash,dhash,chash)
6 blurry threshold
7 grayscale threshold
8 low_information threshold

對於單一檢測型別,閾值設定程式碼如下:

imagelab = Imagelab(data_path=dataset_path)
issue_types = {"dark": {"threshold": 0.5}}
imagelab.find_issues(issue_types)

imagelab.report()
Reading images from D:/cleanvision/image_files
Checking for dark images ...

Issue checks completed. 20 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           20 | 

----------------------- dark images ------------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

png

如果某類問題的存在是正常的,例如天文資料集中普遍影像過暗的情況,那麼可以設定一個最大出現率(max_prevalence)。這意味著如果某一問題的影像所佔比例超過了max_prevalence,則可以認為該問題是正常的。以上示例中,dark問題的影像數量為10,影像總數為607,因此dark問題的影像佔比約為0.016。如果將max_prevalence設定為0.015,那麼出現dark問題的圖片將不會被報告為dark問題:

imagelab.report(max_prevalence=0.015)
Removing dark from potential issues in the dataset as it exceeds max_prevalence=0.015 
Please specify some issue_types to check for in imagelab.find_issues().

1.4 在Torchvision資料集上執行CleanVision

CleanVision支援使用Torchvision資料集進行問題檢測,具體程式碼如下:

準備資料集

from torchvision.datasets import CIFAR10
from torch.utils.data import ConcatDataset
from cleanvision import Imagelab

# 準備torchvision中的CIFAR10資料集
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
Files already downloaded and verified
Files already downloaded and verified
# 檢視訓練集和測試集樣本數
len(train_set), len(test_set)
(50000, 10000)

如果想對訓練集和測試集進行合併處理,可以使用如下程式碼:

dataset = ConcatDataset([train_set, test_set])
len(dataset)
60000

檢視圖片:

dataset[0][0]

png

執行CleanVision

只需在建立Imagelab示例時指定torchvision_dataset引數,即可對Torchvision資料集進行操作,後續的處理步驟與讀取資料夾中圖片的處理方式相同:

imagelab = Imagelab(torchvision_dataset=dataset)
imagelab.find_issues()
# 檢視結果
# imagelab.report()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 173 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
# 結果彙總
imagelab.issue_summary
issue_type num_images
0 blurry 118
1 near_duplicates 40
2 dark 11
3 light 3
4 low_information 1
5 grayscale 0
6 odd_aspect_ratio 0
7 odd_size 0
8 exact_duplicates 0

1.5 在Hugging Face資料集上執行CleanVision

CleanVision支援基於Hugging Face資料集(如果能用的話)進行問題檢測,程式碼如下:

# datasets是專門用於下載huggingface資料集的工具
from datasets import load_dataset
from cleanvision import Imagelab
# 以https://huggingface.co/datasets/mah91/cat為例
# 下載某個hugging face資料集,只需要將引數path設定為待下載連結datasets後的文字
# split表示提取train或test的資料,如果沒有提供分割後的資料集則返回完整的資料
dataset = load_dataset(path="mah91/cat", split="train")
Repo card metadata block was not found. Setting CardData to empty.
# 檢視資料集,可以看到該資料集有800張圖片,只提供了圖片沒有註釋。
dataset
Dataset({
    features: ['image'],
    num_rows: 800
})
# dataset.features包含資料集中不同列的資訊以及每列的型別,例如影像,音訊
dataset.features
{'image': Image(mode=None, decode=True, id=None)}

指定hf_dataset引數載入hugging face資料集:

# 載入資料至CleanVision,image_key指定包含'image'的資料
imagelab = Imagelab(hf_dataset=dataset, image_key="image")

進行檢測的程式碼如下:

imagelab.find_issues()
# 結果彙總
imagelab.issue_summary
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 4 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
issue_type num_images
0 blurry 3
1 odd_size 1
2 dark 0
3 grayscale 0
4 light 0
5 low_information 0
6 odd_aspect_ratio 0
7 exact_duplicates 0
8 near_duplicates 0

2 參考

  • CleanVision
  • CleanVision-docs
  • custom_issue_manager
  • imagehash

相關文章