PyTorch 自定義資料集

GoCodingInMyWay發表於2021-02-24

原文網址 : https://www.cnblogs.com/gocodinginmyway/p/14439879.html

準備資料

準備 COCO128 資料集，其是 COCO train2017 前 128 個資料。按 YOLOv5 組織的目錄：

$ tree ~/datasets/coco128 -L 2
/home/john/datasets/coco128
├── images
│   └── train2017
│       ├── ...
│       └── 000000000650.jpg
├── labels
│   └── train2017
│       ├── ...
│       └── 000000000650.txt
├── LICENSE
└── README.txt

詳見 Train Custom Data。

定義 Dataset

torch.utils.data.Dataset 是一個資料集的抽象類。自定義資料集時，需繼承 Dataset 並覆蓋如下方法：

__len__: len(dataset) 獲取資料集大小。
__getitem__: dataset[i] 訪問第 i 個資料。

詳見：

自定義實現 YOLOv5 資料集的例子：

import os
from pathlib import Path
from typing import Any, Callable, Optional, Tuple

import numpy as np
import torch
import torchvision
from PIL import Image


class YOLOv5(torchvision.datasets.vision.VisionDataset):

  def __init__(
    self,
    root: str,
    name: str,
    transform: Optional[Callable] = None,
    target_transform: Optional[Callable] = None,
    transforms: Optional[Callable] = None,
  ) -> None:
    super(YOLOv5, self).__init__(root, transforms, transform, target_transform)
    images_dir = Path(root) / 'images' / name
    labels_dir = Path(root) / 'labels' / name
    self.images = [n for n in images_dir.iterdir()]
    self.labels = []
    for image in self.images:
      base, _ = os.path.splitext(os.path.basename(image))
      label = labels_dir / f'{base}.txt'
      self.labels.append(label if label.exists() else None)

  def __getitem__(self, idx: int) -> Tuple[Any, Any]:
    img = Image.open(self.images[idx]).convert('RGB')

    label_file = self.labels[idx]
    if label_file is not None:  # found
      with open(label_file, 'r') as f:
        labels = [x.split() for x in f.read().strip().splitlines()]
        labels = np.array(labels, dtype=np.float32)
    else:  # missing
      labels = np.zeros((0, 5), dtype=np.float32)

    boxes = []
    classes = []
    for label in labels:
      x, y, w, h = label[1:]
      boxes.append([
        (x - w/2) * img.width,
        (y - h/2) * img.height,
        (x + w/2) * img.width,
        (y + h/2) * img.height])
      classes.append(label[0])

    target = {}
    target["boxes"] = torch.as_tensor(boxes, dtype=torch.float32)
    target["labels"] = torch.as_tensor(classes, dtype=torch.int64)

    if self.transforms is not None:
      img, target = self.transforms(img, target)

    return img, target

  def __len__(self) -> int:
    return len(self.images)

以上實現，繼承了 VisionDataset 子類。其 __getitem__ 返回了：

image: PIL Image, 大小為 (H, W)
target: dict, 含以下欄位：
- boxes (FloatTensor[N, 4]): 真實標註框 [x1, y1, x2, y2], x 範圍 [0,W], y 範圍 [0,H]
- labels (Int64Tensor[N]): 上述標註框的類別標識

讀取 Dataset

dataset = YOLOv5(Path.home() / 'datasets/coco128', 'train2017')
print(f'dataset: {len(dataset)}')
print(f'dataset[0]: {dataset[0]}')

輸出：

dataset: 128
dataset[0]: (<PIL.Image.Image image mode=RGB size=640x480 at 0x7F6F9464ADF0>, {'boxes': tensor([[249.7296, 200.5402, 460.5399, 249.1901],
        [448.1702, 363.7198, 471.1501, 406.2300],
        ...
        [  0.0000, 188.8901, 172.6400, 280.9003]]), 'labels': tensor([44, 51, 51, 51, 51, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 45, 45, 45,
        45, 50, 50, 50, 51, 51, 60, 42, 44, 45, 45, 45, 50, 51, 51, 51, 51, 51,
        51, 44, 50, 50, 50, 45])})

預覽：

使用 DataLoader

訓練需要批量提取資料，可以使用 DataLoader ：

dataset = YOLOv5(Path.home() / 'datasets/coco128', 'train2017',
  transform=torchvision.transforms.Compose([
    torchvision.transforms.ToTensor()
  ]))

dataloader = DataLoader(dataset, batch_size=64, shuffle=True,
                        collate_fn=lambda batch: tuple(zip(*batch)))

for batch_i, (images, targets) in enumerate(dataloader):
  print(f'batch {batch_i}, images {len(images)}, targets {len(targets)}')
  print(f'  images[0]: shape={images[0].shape}')
  print(f'  targets[0]: {targets[0]}')

輸出：

batch 0, images 64, targets 64
  images[0]: shape=torch.Size([3, 480, 640])
  targets[0]: {'boxes': tensor([[249.7296, 200.5402, 460.5399, 249.1901],
        [448.1702, 363.7198, 471.1501, 406.2300],
        ...
        [  0.0000, 188.8901, 172.6400, 280.9003]]), 'labels': tensor([44, 51, 51, 51, 51, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 45, 45, 45,
        45, 50, 50, 50, 51, 51, 60, 42, 44, 45, 45, 45, 50, 51, 51, 51, 51, 51,
        51, 44, 50, 50, 50, 45])}
batch 1, images 64, targets 64
  images[0]: shape=torch.Size([3, 248, 640])
  targets[0]: {'boxes': tensor([[337.9299, 167.8500, 378.6999, 191.3100],
        [383.5398, 148.4501, 452.6598, 191.4701],
        [467.9299, 149.9001, 540.8099, 193.2401],
        [196.3898, 142.7200, 271.6896, 190.0999],
        [134.3901, 154.5799, 193.9299, 189.1699],
        [ 89.5299, 162.1901, 124.3798, 188.3301],
        [  1.6701, 154.9299,  56.8400, 188.3700]]), 'labels': tensor([20, 20, 20, 20, 20, 20, 20])}

原始碼

參考

APIs:

GoCoding 個人實踐的經驗分享，可關注公眾號！

pytorch載入語音類自定義資料集
2020-11-09
PyTorch
目標檢測（4）：LeNet-5 的 PyTorch 復現（自定義資料集篇）！
2022-01-07
PyTorch
Pytorch技法：繼承Subset類完成自定義資料拆分
2022-02-17
PyTorch繼承
如何構建自定義人臉識別資料集
2018-06-26
Ai影像分割模型PaddleSeg——自定義資料集處理
2021-08-10
AI模型
PyTorch版YOLOv4更新了，不僅適用於自定義資料集，還整合了注意力和MobileNet
2020-10-05
PyTorchYOLO
自定義資料型別
2020-11-14
資料型別
Scaled-YOLOv4 快速開始，訓練自定義資料集
2021-01-09
YOLO
pytorch載入imagenet的floder資料集
2024-06-07
PyTorch
DM自定義資料型別
2020-03-18
資料型別
python 自定義資料分頁
2024-11-20
Python
自定義react資料驗證元件
2018-10-18
React元件
自定義響應資料結構
2020-10-06
資料結構
自定義元件-資料、方法、屬性
2024-10-21
元件
自定義元件-資料監聽器
2024-10-22
元件
自定義元件-純資料欄位
2024-10-22
元件
聊聊如何自定義資料脫敏
2021-12-28
常用語義分割資料集
2021-01-03
用SSD-Pytorch訓練自己的資料集
2019-03-26
PyTorch
Tensorflow2 自定義資料集圖片完成圖片分類任務
2020-06-21
【小白學PyTorch】18 TF2構建自定義模型
2020-10-04
PyTorchTF2模型
按自定義週期統計資料
2019-01-26
Vector中存放自定義資料型別
2024-12-10
資料型別
iOS自定義控制元件：自定義TableView、CollectionView空資料佔點陣圖
2019-03-05
iOS控制元件View
潤乾報表設計器中使用 spring 框架實現自定義資料集
2020-04-01
Spring框架
Pytorch中自定義神經網路卷積核權重
2020-05-02
PyTorch神經網路卷積
Mybatis使用小技巧-自定義結果集
2019-04-17
MyBatis
SingleR如何使用自定義的參考集
2021-10-23
由自定義事件到vue資料響應
2018-08-26
事件Vue
自定義註解完成資料庫切庫
2018-03-09
資料庫
自定義物件封裝資料警告解決
2024-04-07
物件封裝
自定義註解進行資料脫敏
2024-11-15
自定義開發資料庫升級程式
2022-01-06
資料庫
瀚高資料庫自定義操作符'!~~'
2022-06-21
資料庫
MNIST資料集詳解及視覺化處理（pytorch）
2020-11-24
視覺化PyTorch
SpringBoot + 通義千問 + 自定義React元件，支援EventStream資料解析！
2023-11-23
Spring BootReact元件
Python 多程式的自定義共享資料型別
2018-09-26
Python資料型別
[BUG反饋]自定義模型不顯示資料
2019-05-11
模型

PyTorch 自定義資料集

準備資料

定義 Dataset

讀取 Dataset

使用 DataLoader

原始碼

參考

相關文章