寫給程式設計師的機器學習入門 (九) - 物件識別 RCNN 與 Fast-RCNN

q303248153發表於2020-11-27

原文網址 : https://www.cnblogs.com/zkweb/p/14048685.html

因為這幾個月飯店生意恢復，加上研究 Faster-RCNN 用掉了很多時間，就沒有更新部落格了?。這篇開始會介紹物件識別的模型與實現方法，首先會介紹最簡單的 RCNN 與 Fast-RCNN 模型，下一篇會介紹 Faster-RCNN 模型，再下一篇會介紹 YOLO 模型。

圖片分類與物件識別

在前面的文章中我們看到了如何使用 CNN 模型識別圖片裡面的物體是什麼型別，或者識別圖片中固定的文字 (即驗證碼)，因為模型會把整個圖片當作輸入並輸出固定的結果，所以圖片中只能有一個主要的物體或者固定數量的文字。

如果圖片包含了多個物體，我們想識別有哪些物體，各個物體在什麼位置，那麼只用 CNN 模型是無法實現的。我們需要可以找出圖片哪些區域包含物體並且判斷每個區域包含什麼物體的模型，這樣的模型稱為物件識別模型 (Object Detection Model)，最早期的物件識別模型是 RCNN 模型，後來又發展出 Fast-RCNN (SPPnet)，Faster-RCNN ，和 YOLO 等模型。因為物件識別需要處理的資料量多，速度會比較慢 (例如 RCNN 檢測單張圖片包含的物體可能需要幾十秒)，而物件識別通常又要求實時性 (例如來源是攝像頭提供的視訊)，所以如何提升物件識別的速度是一個主要的命題，後面發展出的 Faster-RCNN 與 YOLO 都可以在一秒鐘檢測幾十張圖片。

物件識別的應用範圍比較廣，例如人臉識別，車牌識別，自動駕駛等等都用到了物件識別的技術。物件識別是當今機器學習領域的一個前沿，2017 年研發出來的 Mask-RCNN 模型還可以檢測物件的輪廓。

因為看上去越神奇的東西實現起來越難，物件識別模型相對於之前介紹的模型難度會高很多，請做好心理準備?。

物件識別模型需要的訓練資料

在介紹具體的模型之前，我們首先看看物件識別模型需要什麼樣的訓練資料：

物件識別模型需要給每個圖片標記有哪些區域，與每個區域對應的標籤，也就是訓練資料需要是列表形式的。區域的格式通常有兩種，(x, y, w, h) => 左上角的座標與長寬，與 (x1, y1, x2, y2) => 左上角與右下角的座標，這兩種格式可以互相轉換，處理的時候只需要注意是哪種格式即可。標籤除了需要識別的各個分類之外，還需要有一個特殊的非物件 (背景) 標籤，表示這個區域不包含任何可以識別的物件，因為非物件區域通常可以自動生成，所以訓練資料不需要包含非物件區域與標籤。

RCNN

RCNN (Region Based Convolutional Neural Network) 是最早期的物件識別模型，實現比較簡單，可以分為以下步驟：

用某種演算法在圖片中選取 2000 個可能出現物件的區域
擷取這 2000 個區域到 2000 個子圖片，然後縮放它們到一個固定的大小
用普通的 CNN 模型分別識別這 2000 個子圖片，得出它們的分類
排除標記為 "非物件" 分類的區域
把剩餘的區域作為輸出結果

你可能已經從步驟裡看出，RCNN 有幾個大問題?：

結果的精度很大程度取決於選取區域使用的演算法
選取區域使用的演算法是固定的，不參與學習，如果演算法沒有選出某個包含物件區域那麼怎麼學習都無法識別這個區域出來
慢，賊慢?，識別 1 張圖片實際等於識別 2000 張圖片

後面介紹模型結果會解決這些問題，但首先我們需要理解最簡單的 RCNN 模型，接下來我們細看一下 RCNN 實現中幾個重要的部分吧。

選取可能出現物件的區域

選取可能出現物件的區域的演算法有很多種，例如滑動視窗法 (Sliding Window) 和選擇性搜尋法 (Selective Search)。滑動視窗法非常簡單，決定一個固定大小的區域，然後按一定距離滑動得出下一個區域即可。滑動視窗法實現簡單但選取出來的區域數量非常龐大並且精度很低，所以通常不會使用這種方法，除非物體大小固定並且出現的位置有一定規律。

選擇性搜尋法則比較高階，以下是簡單的說明，摘自 opencv 的文章：

你還可以參考這篇文章或原始論文瞭解具體的計算方法。

如果你覺得難以理解可以跳過，因為接下來我們會直接使用 opencv 類庫中提供的選擇搜尋函式。而且選擇搜尋法精度也不高，後面介紹的模型將會使用更好的方法。

# 使用 opencv 類庫中提供的選擇搜尋函式的程式碼例子
import cv2

img = cv2.imread("圖片路徑")
s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
s.setBaseImage(img)
s.switchToSelectiveSearchFast()
boxes = s.process() # 可能出現物件的所有區域，會按可能性排序
candidate_boxes = boxes[:2000] # 選取頭 2000 個區域

按重疊率 (IOU) 判斷每個區域是否包含物件

使用演算法選取出來的區域與實際區域通常不會完全重疊，只會重疊一部分，在學習的過程中我們需要根據手頭上的真實區域預先判斷選取出來的區域是否包含物件，再告訴模型預測結果是否正確。判斷選取區域是否包含物件會依據重疊率 (IOU - Intersection Over Union)，所謂重疊率就是兩個區域重疊的面積佔兩個區域合併的面積的比率，如下圖所示。

我們可以規定重疊率大於 70% 的候選區域包含物件，重疊率小於 30% 的區域不包含物件，而重疊率介於 30% ~ 70% 的區域不應該參與學習，這是為了給模型提供比較明確的資料，使得學習效果更好。

計算重疊率的程式碼如下，如果兩個區域沒有重疊則重疊率會為 0：

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合併部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

原始論文

如果你想看 RCNN 的原始論文可以到以下的地址：

https://arxiv.org/pdf/1311.2524.pdf

使用 RCNN 識別圖片中的人臉

好了，到這裡你應該大致瞭解 RCNN 的實現原理，接下來我們試著用 RCNN 學習識別一些圖片。

因為收集圖片和標記圖片非常累人?，為了偷懶這篇我還是使用現成的資料集。以下是包含人臉圖片的資料集，並且帶了各個人臉所在的區域的標記，格式是 (x1, y1, x2, y2)。下載需要註冊帳號，但不需要交錢?。

https://www.kaggle.com/vin1234/count-the-number-of-faces-present-in-an-image

下載解壓後可以看到圖片在 train/image_data 下，標記在 bbox_train.csv 中。

例如以下的圖片：

對應 csv 中的以下標記：

Name,width,height,xmin,ymin,xmax,ymax
10001.jpg,612,408,192,199,230,235
10001.jpg,612,408,247,168,291,211
10001.jpg,612,408,321,176,366,222
10001.jpg,612,408,355,183,387,214

資料的意義如下：

Name: 檔名
width: 圖片整體寬度
height: 圖片整體高度
xmin: 人臉區域的左上角的 x 座標
ymin: 人臉區域的左上角的 y 座標
xmax: 人臉區域的右下角的 x 座標
ymax: 人臉區域的右下角的 y 座標

使用 RCNN 學習與識別這些圖片中的人臉區域的程式碼如下：

import os
import sys
import torch
import gzip
import itertools
import random
import numpy
import pandas
import torchvision
import cv2
from torch import nn
from matplotlib import pyplot
from collections import defaultdict

# 各個區域縮放到的圖片大小
REGION_IMAGE_SIZE = (32, 32)
# 分析目標的圖片所在的資料夾
IMAGE_DIR = "./784145_1347673_bundle_archive/train/image_data"
# 定義各個圖片中人臉區域的 CSV 檔案
BOX_CSV_PATH = "./784145_1347673_bundle_archive/train/bbox_train.csv"

# 用於啟用 GPU 支援
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class MyModel(nn.Module):
    """識別是否人臉 (ResNet-18)"""
    def __init__(self):
        super().__init__()
        # Resnet 的實現
        # 輸出兩個分類 [非人臉, 人臉]
        self.resnet = torchvision.models.resnet18(num_classes=2)

    def forward(self, x):
        # 應用 ResNet
        y = self.resnet(x)
        return y

def save_tensor(tensor, path):
    """儲存 tensor 物件到檔案"""
    torch.save(tensor, gzip.GzipFile(path, "wb"))

def load_tensor(path):
    """從檔案讀取 tensor 物件"""
    return torch.load(gzip.GzipFile(path, "rb"))

def image_to_tensor(img):
    """轉換 opencv 圖片物件到 tensor 物件"""
    # 注意 opencv 是 BGR，但對訓練沒有影響所以不用轉為 RGB
    img = cv2.resize(img, dsize=REGION_IMAGE_SIZE)
    arr = numpy.asarray(img)
    t = torch.from_numpy(arr)
    t = t.transpose(0, 2) # 轉換維度 H,W,C 到 C,W,H
    t = t / 255.0 # 正規化數值使得範圍在 0 ~ 1
    return t

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合併部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

def selective_search(img):
    """計算 opencv 圖片中可能出現物件的區域，只返回頭 2000 個區域"""
    # 演算法參考 https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
    s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
    s.setBaseImage(img)
    s.switchToSelectiveSearchFast()
    boxes = s.process()
    return boxes[:2000]

def prepare_save_batch(batch, image_tensors, image_labels):
    """準備訓練 - 儲存單個批次的資料"""
    # 生成輸入和輸出 tensor 物件
    tensor_in = torch.stack(image_tensors) # 維度: B,C,W,H
    tensor_out = torch.tensor(image_labels, dtype=torch.long) # 維度: B

    # 切分訓練集 (80%)，驗證集 (10%) 和測試集 (10%)
    random_indices = torch.randperm(tensor_in.shape[0])
    training_indices = random_indices[:int(len(random_indices)*0.8)]
    validating_indices = random_indices[int(len(random_indices)*0.8):int(len(random_indices)*0.9):]
    testing_indices = random_indices[int(len(random_indices)*0.9):]
    training_set = (tensor_in[training_indices], tensor_out[training_indices])
    validating_set = (tensor_in[validating_indices], tensor_out[validating_indices])
    testing_set = (tensor_in[testing_indices], tensor_out[testing_indices])

    # 儲存到硬碟
    save_tensor(training_set, f"data/training_set.{batch}.pt")
    save_tensor(validating_set, f"data/validating_set.{batch}.pt")
    save_tensor(testing_set, f"data/testing_set.{batch}.pt")
    print(f"batch {batch} saved")

def prepare():
    """準備訓練"""
    # 資料集轉換到 tensor 以後會儲存在 data 資料夾下
    if not os.path.isdir("data"):
        os.makedirs("data")

    # 載入 csv 檔案，構建圖片到區域列表的索引 { 圖片名: [ 區域, 區域, .. ] }
    box_map = defaultdict(lambda: [])
    df = pandas.read_csv(BOX_CSV_PATH)
    for row in df.values:
        filename, width, height, x1, y1, x2, y2 = row[:7]
        box_map[filename].append((x1, y1, x2-x1, y2-y1))

    # 從圖片裡面提取人臉 (正樣本) 和非人臉 (負樣本) 的圖片
    batch_size = 1000
    batch = 0
    image_tensors = []
    image_labels = []
    for filename, true_boxes in box_map.items():
        path = os.path.join(IMAGE_DIR, filename)
        img = cv2.imread(path) # 載入原始圖片
        candidate_boxes = selective_search(img) # 查詢候選區域
        positive_samples = 0
        negative_samples = 0
        for candidate_box in candidate_boxes:
            # 如果候選區域和任意一個實際區域重疊率大於 70%，則認為是正樣本
            # 如果候選區域和所有實際區域重疊率都小於 30%，則認為是負樣本
            # 每個圖片最多新增正樣本數量 + 10 個負樣本，需要提供足夠多負樣本避免偽陽性判斷
            iou_list = [ calc_iou(candidate_box, true_box) for true_box in true_boxes ]
            positive_index = next((index for index, iou in enumerate(iou_list) if iou > 0.70), None)
            is_negative = all(iou < 0.30 for iou in iou_list)
            result = None
            if positive_index is not None:
                result = True
                positive_samples += 1
            elif is_negative and negative_samples < positive_samples + 10:
                result = False
                negative_samples += 1
            if result is not None:
                x, y, w, h = candidate_box
                child_img = img[y:y+h, x:x+w].copy()
                # 檢驗計算是否有問題
                # cv2.imwrite(f"{filename}_{x}_{y}_{w}_{h}_{int(result)}.png", child_img)
                image_tensors.append(image_to_tensor(child_img))
                image_labels.append(int(result))
                if len(image_tensors) >= batch_size:
                    # 儲存批次
                    prepare_save_batch(batch, image_tensors, image_labels)
                    image_tensors.clear()
                    image_labels.clear()
                    batch += 1
    # 儲存剩餘的批次
    if len(image_tensors) > 10:
        prepare_save_batch(batch, image_tensors, image_labels)

def train():
    """開始訓練"""
    # 建立模型例項
    model = MyModel().to(device)

    # 建立損失計算器
    loss_function = torch.nn.CrossEntropyLoss()

    # 建立引數調整器
    optimizer = torch.optim.Adam(model.parameters())

    # 記錄訓練集和驗證集的正確率變化
    training_accuracy_history = []
    validating_accuracy_history = []

    # 記錄最高的驗證集正確率
    validating_accuracy_highest = -1
    validating_accuracy_highest_epoch = 0

    # 讀取批次的工具函式
    def read_batches(base_path):
        for batch in itertools.count():
            path = f"{base_path}.{batch}.pt"
            if not os.path.isfile(path):
                break
            yield [ t.to(device) for t in load_tensor(path) ]

    # 計算正確率的工具函式，正樣本和負樣本的正確率分別計算再平均
    def calc_accuracy(actual, predicted):
        predicted = torch.max(predicted, 1).indices
        acc_positive = ((actual > 0.5) & (predicted > 0.5)).sum().item() / ((actual > 0.5).sum().item() + 0.00001)
        acc_negative = ((actual <= 0.5) & (predicted <= 0.5)).sum().item() / ((actual <= 0.5).sum().item() + 0.00001)
        acc = (acc_positive + acc_negative) / 2
        return acc
 
    # 劃分輸入和輸出的工具函式
    def split_batch_xy(batch, begin=None, end=None):
        # shape = batch_size, channels, width, height
        batch_x = batch[0][begin:end]
        # shape = batch_size, num_labels
        batch_y = batch[1][begin:end]
        return batch_x, batch_y

    # 開始訓練過程
    for epoch in range(1, 10000):
        print(f"epoch: {epoch}")

        # 根據訓練集訓練並修改引數
        model.train()
        training_accuracy_list = []
        for batch_index, batch in enumerate(read_batches("data/training_set")):
            # 切分小批次，有助於泛化模型
            training_batch_accuracy_list = []
            for index in range(0, batch[0].shape[0], 100):
                # 劃分輸入和輸出
                batch_x, batch_y = split_batch_xy(batch, index, index+100)
                # 計算預測值
                predicted = model(batch_x)
                # 計算損失
                loss = loss_function(predicted, batch_y)
                # 從損失自動微分求導函式值
                loss.backward()
                # 使用引數調整器調整引數
                optimizer.step()
                # 清空導函式值
                optimizer.zero_grad()
                # 記錄這一個批次的正確率，torch.no_grad 代表臨時禁用自動微分功能
                with torch.no_grad():
                    training_batch_accuracy_list.append(calc_accuracy(batch_y, predicted))
            # 輸出批次正確率
            training_batch_accuracy = sum(training_batch_accuracy_list) / len(training_batch_accuracy_list)
            training_accuracy_list.append(training_batch_accuracy)
            print(f"epoch: {epoch}, batch: {batch_index}: batch accuracy: {training_batch_accuracy}")
        training_accuracy = sum(training_accuracy_list) / len(training_accuracy_list)
        training_accuracy_history.append(training_accuracy)
        print(f"training accuracy: {training_accuracy}")

        # 檢查驗證集
        model.eval()
        validating_accuracy_list = []
        for batch in read_batches("data/validating_set"):
            batch_x, batch_y = split_batch_xy(batch)
            predicted = model(batch_x)
            validating_accuracy_list.append(calc_accuracy(batch_y, predicted))
        validating_accuracy = sum(validating_accuracy_list) / len(validating_accuracy_list)
        validating_accuracy_history.append(validating_accuracy)
        print(f"validating accuracy: {validating_accuracy}")

        # 記錄最高的驗證集正確率與當時的模型狀態，判斷是否在 20 次訓練後仍然沒有重新整理記錄
        if validating_accuracy > validating_accuracy_highest:
            validating_accuracy_highest = validating_accuracy
            validating_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest validating accuracy updated")
        elif epoch - validating_accuracy_highest_epoch > 20:
            # 在 20 次訓練後仍然沒有重新整理記錄，結束訓練
            print("stop training because highest validating accuracy not updated in 20 epoches")
            break

    # 使用達到最高正確率時的模型狀態
    print(f"highest validating accuracy: {validating_accuracy_highest}",
        f"from epoch {validating_accuracy_highest_epoch}")
    model.load_state_dict(load_tensor("model.pt"))

    # 檢查測試集
    testing_accuracy_list = []
    for batch in read_batches("data/testing_set"):
        batch_x, batch_y = split_batch_xy(batch)
        predicted = model(batch_x)
        testing_accuracy_list.append(calc_accuracy(batch_y, predicted))
    testing_accuracy = sum(testing_accuracy_list) / len(testing_accuracy_list)
    print(f"testing accuracy: {testing_accuracy}")

    # 顯示訓練集和驗證集的正確率變化
    pyplot.plot(training_accuracy_history, label="training")
    pyplot.plot(validating_accuracy_history, label="validing")
    pyplot.ylim(0, 1)
    pyplot.legend()
    pyplot.show()

def eval_model():
    """使用訓練好的模型"""
    # 建立模型例項，載入訓練好的狀態，然後切換到驗證模式
    model = MyModel().to(device)
    model.load_state_dict(load_tensor("model.pt"))
    model.eval()

    # 詢問圖片路徑，並顯示所有可能是人臉的區域
    while True:
        try:
            # 選取可能出現物件的區域一覽
            image_path = input("Image path: ")
            if not image_path:
                continue
            img = cv2.imread(image_path)
            candidate_boxes = selective_search(img)
            # 構建輸入
            image_tensors = []
            for candidate_box in candidate_boxes:
                x, y, w, h = candidate_box
                child_img = img[y:y+h, x:x+w].copy()
                image_tensors.append(image_to_tensor(child_img))
            tensor_in = torch.stack(image_tensors).to(device)
            # 預測輸出
            tensor_out = model(tensor_in)
            # 使用 softmax 計算是人臉的概率
            tensor_out = nn.functional.softmax(tensor_out, dim=1)
            tensor_out = tensor_out[:,1].resize(tensor_out.shape[0])
            # 判斷概率大於 99% 的是人臉，新增邊框到圖片並儲存
            img_output = img.copy()
            indices = torch.where(tensor_out > 0.99)[0]
            result_boxes = []
            result_boxes_all = []
            for index in indices:
                box = candidate_boxes[index]
                for exists_box in result_boxes_all:
                    # 如果和現存找到的區域重疊度大於 30% 則跳過
                    if calc_iou(exists_box, box) > 0.30:
                        break
                else:
                    result_boxes.append(box)
                result_boxes_all.append(box)
            for box in result_boxes:
                x, y, w, h = box
                print(x, y, w, h)
                cv2.rectangle(img_output, (x, y), (x+w, y+h), (0, 0, 0xff), 1)
            cv2.imwrite("img_output.png", img_output)
            print("saved to img_output.png")
            print()
        except Exception as e:
            print("error:", e)

def main():
    """主函式"""
    if len(sys.argv) < 2:
        print(f"Please run: {sys.argv[0]} prepare|train|eval")
        exit()

    # 給隨機數生成器分配一個初始值，使得每次執行都可以生成相同的隨機數
    # 這是為了讓過程可重現，你也可以選擇不這樣做
    random.seed(0)
    torch.random.manual_seed(0)

    # 根據命令列引數選擇操作
    operation = sys.argv[1]
    if operation == "prepare":
        prepare()
    elif operation == "train":
        train()
    elif operation == "eval":
        eval_model()
    else:
        raise ValueError(f"Unsupported operation: {operation}")

if __name__ == "__main__":
    main()

和之前文章給出的程式碼例子一樣，這份程式碼也分為了 prepare, train, eval 三個部分，其中 prepare 部分負責選取區域，提取正樣本 (包含人臉的區域) 和負樣本 (不包含人臉的區域) 的子圖片；train 使用普通的 resnet 模型學習子圖片；eval 針對給出的圖片選取區域並識別所有區域中是否包含人臉。

除了選取區域和提取子圖片的處理以外，基本上和之前介紹的 CNN 模型一樣吧?。

執行以下命令以後：

python3 example.py prepare
python3 example.py train

的最終輸出如下：

epoch: 101, batch: 106: batch accuracy: 0.9999996838862198
epoch: 101, batch: 107: batch accuracy: 0.999218446914751
epoch: 101, batch: 108: batch accuracy: 0.9999996211125055
training accuracy: 0.999441394076678
validating accuracy: 0.9687856357743619
stop training because highest validating accuracy not updated in 20 epoches
highest validating accuracy: 0.9766918253771755 from epoch 80
testing accuracy: 0.9729761086851993

訓練集和驗證集的正確率變化如下：

正確率看起來很高，但這只是針對選取後的區域判斷的正確率，因為選取演算法效果比較一般並且樣本數量比較少，所以最終效果不能說令人滿意?。

執行以下命令，再輸入圖片路徑可以使用學習好的模型識別圖片：

python3 example.py eval

以下是部分識別結果：

精度一般般?。

Fast-RCNN

RCNN 慢的原因主要是因為識別幾千個子圖片的計算量非常龐大，特別是這幾千個子圖片的範圍很多是重合的，導致了很多重複的計算。Fast-RCNN 著重改善了這一部分，首先會針對整張圖片生成一個與圖片長寬相同 (或者等比例縮放) 的特徵資料，然後再根據可能包含物件的區域擷取特徵資料，然後再根據擷取後的子特徵資料識別分類。RCNN 與 Fast-RCNN 的區別如下圖所示：

遺憾的是 Fast-RCNN 只是改善了速度，並不會改善正確率。但下面介紹的例子會引入一個比較重要的處理，即調整區域範圍，它可以讓模型給出的區域更接近實際的區域。

以下是 Fast-RCNN 模型中的一些處理細節。

縮放來源圖片

在 RCNN 中，傳給 CNN 模型的圖片是經過縮放的子圖片，而在 Fast-RCNN 中我們需要傳原圖片給 CNN 模型，那麼原圖片也需要進行縮放。縮放使用的方法是填充法，如下圖所示：

縮放圖片使用的程式碼如下 (opencv 版)：

IMAGE_SIZE = (128, 88)

def calc_resize_parameters(sw, sh):
    """計算縮放圖片的引數"""
    sw_new, sh_new = sw, sh
    dw, dh = IMAGE_SIZE
    pad_w, pad_h = 0, 0
    if sw / sh < dw / dh:
        sw_new = int(dw / dh * sh)
        pad_w = (sw_new - sw) // 2 # 填充左右
    else:
        sh_new = int(dh / dw * sw)
        pad_h = (sh_new - sh) // 2 # 填充上下
    return sw_new, sh_new, pad_w, pad_h

def resize_image(img):
    """縮放 opencv 圖片，比例不一致時填充"""
    sh, sw, _ = img.shape
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    img = cv2.copyMakeBorder(img, pad_h, pad_h, pad_w, pad_w, cv2.BORDER_CONSTANT, (0, 0, 0))
    img = cv2.resize(img, dsize=IMAGE_SIZE)
    return img

縮放圖片後區域的座標也需要轉換，轉換的程式碼如下 (都是枯燥的程式碼?)：

IMAGE_SIZE = (128, 88)

def map_box_to_resized_image(box, sw, sh):
    """把原始區域轉換到縮放後的圖片對應的區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int((x + pad_w) * scale)
    y = int((y + pad_h) * scale)
    w = int(w * scale)
    h = int(h * scale)
    if x + w > IMAGE_SIZE[0] or y + h > IMAGE_SIZE[1] or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def map_box_to_original_image(box, sw, sh):
    """把縮放後圖片對應的區域轉換到縮放前的原始區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int(x / scale - pad_w)
    y = int(y / scale - pad_h)
    w = int(w / scale)
    h = int(h / scale)
    if x + w > sw or y + h > sh or x < 0 or y < 0 or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

計算區域特徵

在前面的文章中我們已經瞭解過，CNN 模型可以分為卷積層，池化層和全連線層，卷積層，池化層用於抽取圖片中各個區域的特徵，全連線層用於把特徵扁平化並交給線性模型處理。在 Fast-RCNN 中，我們不需要使用整張圖片的特徵，只需要使用部分割槽域的特徵，所以 Fast-RCNN 使用的 CNN 模型只需要卷積層和池化層 (部分模型池化層可以省略)，卷積層輸出的通道數量通常會比圖片原有的通道數量多，並且長寬會按原來圖片的長寬等比例縮小，例如原圖的大小是 3,256,256 的時候，經過處理可能會輸出 512,32,32，代表每個 8x8 畫素的區域都對應 512 個特徵。

這篇給出的 Fast-RCN 程式碼為了易於理解，會讓 CNN 模型輸出和原圖一模一樣的大小，這樣抽取區域特徵的時候只需要使用 [] 操作符即可。

抽取區域特徵 (ROI Pooling)

Fast-RCNN 根據整張圖片生成特徵以後，下一步就是抽取區域特徵 (Region of interest Pooling) 了，抽取區域特徵簡單的來說就是根據區域在圖片中的位置，截區域中該位置的資料，然後再縮放到相同大小，如下圖所示：

抽取區域特徵的層又稱為 ROI 層。

如果特徵的長寬和圖片的長寬相同，那麼擷取特徵只需要簡單的 [] 操作，但如果特徵的長寬比圖片的長寬要小，那麼就需要使用近鄰插值法 (Nearest Neighbor Interpolation) 或者雙線插值法 (Bilinear Interpolation) 進行擷取，使用雙線插值法進行擷取的 ROI 層又稱作 ROI Align。擷取以後的縮放可以使用 MaxPool，近鄰插值法或雙線插值法等演算法。

想更好的理解 ROI Align 與雙線插值法可以參考這篇文章。

調整區域範圍

在前面已經提到過，使用選擇搜尋法等演算法選取出來的區域與物件實際所在的區域可能有一定偏差，這個偏差是可以通過模型來調整的。舉個簡單的例子，如果區域內有臉的左半部分，那麼模型在經過學習後應該可以判斷出區域應該向右擴充套件一些。

區域調整可以分為四個引數：

對左上角 x 座標的調整
對左上角 y 座標的調整
對長度的調整
對寬度的調整

因為座標和長寬的值大小不一定，例如同樣是臉的左半部分，出現在圖片的左上角和圖片的右下角就會讓 x y 座標不一樣，如果遠近不同那麼長寬也會不一樣，我們需要把調整量作標準化，標準化的公式如下：

x1, y1, w1, h1 = 候選區域
x2, y2, w2, h2 = 真實區域
x 偏移 = (x2 - x1) / w1
y 偏移 = (y2 - y1) / h1
w 偏移 = log(w2 / w1)
h 偏移 = log(h2 / h1)

經過標準化後，偏移的值就會作為比例而不是絕對值，不會受具體座標和長寬的影響。此外，公式中使用 log 是為了減少偏移的增幅，使得偏移比較大的時候模型仍然可以達到比較好的學習效果。

計算區域調整偏移和根據偏移調整區域的程式碼如下：

def calc_box_offset(candidate_box, true_box):
    """計算候選區域與實際區域的偏移值"""
    x1, y1, w1, h1 = candidate_box
    x2, y2, w2, h2 = true_box
    x_offset = (x2 - x1) / w1
    y_offset = (y2 - y1) / h1
    w_offset = math.log(w2 / w1)
    h_offset = math.log(h2 / h1)
    return (x_offset, y_offset, w_offset, h_offset)

def adjust_box_by_offset(candidate_box, offset):
    """根據偏移值調整候選區域"""
    x1, y1, w1, h1 = candidate_box
    x_offset, y_offset, w_offset, h_offset = offset
    x2 = w1 * x_offset + x1
    y2 = h1 * y_offset + y1
    w2 = math.exp(w_offset) * w1
    h2 = math.exp(h_offset) * h1
    return (x2, y2, w2, h2)

計算損失

Fast-RCNN 模型會針對各個區域輸出兩個結果，第一個是區域對應的標籤 (人臉，非人臉)，第二個是上面提到的區域偏移，調整引數的時候也需要同時根據這兩個結果調整。實現同時調整多個結果可以把損失相加起來再計算各個引數的導函式值：

各個區域的特徵 = ROI層(CNN模型(圖片資料))

計算標籤的線性模型(各個區域的特徵) - 真實標籤 = 標籤損失
計算偏移的線性模型(各個區域的特徵) - 真實偏移 = 偏移損失

損失 = 標籤損失 + 偏移損失

有一個需要注意的地方是，在這個例子裡計算標籤損失需要分別根據正負樣本計算，否則模型在經過調整以後只會輸出負結果。這是因為線性模型計算抽取出來的特徵時有可能輸出正 (人臉)，也有可能輸出負 (非人臉)，而 ROI 層抽取的特徵很多是重合的，也就是來源相同，當負樣本比正樣本要多的時候，結果的方向就會更偏向於負，這樣每次調整引數的時候都會向輸出負的方向調整。如果把損失分開計算，那麼不重合的特徵可以分別向輸出正負的方向調整，從而達到學習的效果。

此外，偏移損失只應該根據正樣本計算，負樣本沒有必要學習偏移。

最終的損失計算處理如下：

各個區域的特徵 = ROI層(CNN模型(圖片資料))

計算標籤的線性模型(各個區域的特徵)[正樣本] - 真實標籤[正樣本] = 正樣本標籤損失
計算標籤的線性模型(各個區域的特徵)[負樣本] - 真實標籤[負樣本] = 負樣本標籤損失
計算偏移的線性模型(各個區域的特徵)[正樣本] - 真實偏移[正樣本] = 正樣本偏移損失

損失 = 正樣本標籤損失 + 負樣本標籤損失 + 正樣本偏移損失

合併結果區域

因為選取區域的演算法本來就會返回很多重合的區域，可能會有有好幾個區域同時和真實區域重疊率大於一定值 (70%)，導致這幾個區域都會被認為是包含物件的區域：

模型經過學習後，針對圖片預測得出結果時也有可能返回這樣的重合區域，合併這樣的區域有幾種方法：

使用最左，最右，最上，或者最下的區域
使用第一個區域 (區域選取演算法會按出現物件的可能性排序)
結合所有重合的區域 (如果區域調整效果不行，則可能出現結果區域比真實區域大很多的問題)

上面給出的 RCNN 程式碼例子已經使用第二個方法合併結果區域，下面給出的例子也會使用同樣的方法。但下一篇文章的 Faster-RCNN 則會使用第三個方法，因為 Faster-RCNN 的區域調整效果相對比較好。

原始論文

如果你想看 Fast-RCNN 的原始論文可以到以下的地址：

https://arxiv.org/pdf/1504.08083.pdf

使用 Fast-RCNN 識別圖片中的人臉

程式碼時間到了?，這份程式碼會使用 Fast-RCNN 模型來圖片中的人臉，使用的資料集和前面的例子一樣。

import os
import sys
import torch
import gzip
import itertools
import random
import numpy
import math
import pandas
import cv2
from torch import nn
from matplotlib import pyplot
from collections import defaultdict

# 縮放圖片的大小
IMAGE_SIZE = (256, 256)
# 分析目標的圖片所在的資料夾
IMAGE_DIR = "./784145_1347673_bundle_archive/train/image_data"
# 定義各個圖片中人臉區域的 CSV 檔案
BOX_CSV_PATH = "./784145_1347673_bundle_archive/train/bbox_train.csv"

# 用於啟用 GPU 支援
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class BasicBlock(nn.Module):
    """ResNet 使用的基礎塊"""
    expansion = 1 # 定義這個塊的實際出通道是 channels_out 的幾倍，這裡的實現固定是一倍
    def __init__(self, channels_in, channels_out, stride):
        super().__init__()
        # 生成 3x3 的卷積層
        # 處理間隔 stride = 1 時，輸出的長寬會等於輸入的長寬，例如 (32-3+2)//1+1 == 32
        # 處理間隔 stride = 2 時，輸出的長寬會等於輸入的長寬的一半，例如 (32-3+2)//2+1 == 16
        # 此外 resnet 的 3x3 卷積層不使用偏移值 bias
        self.conv1 = nn.Sequential(
            nn.Conv2d(channels_in, channels_out, kernel_size=3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(channels_out))
        # 再定義一個讓輸出和輸入維度相同的 3x3 卷積層
        self.conv2 = nn.Sequential(
            nn.Conv2d(channels_out, channels_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(channels_out))
        # 讓原始輸入和輸出相加的時候，需要維度一致，如果維度不一致則需要整合
        self.identity = nn.Sequential()
        if stride != 1 or channels_in != channels_out * self.expansion:
            self.identity = nn.Sequential(
                nn.Conv2d(channels_in, channels_out * self.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(channels_out * self.expansion))

    def forward(self, x):
        # x => conv1 => relu => conv2 => + => relu
        # |                              ^
        # |==============================|
        tmp = self.conv1(x)
        tmp = nn.functional.relu(tmp, inplace=True)
        tmp = self.conv2(tmp)
        tmp += self.identity(x)
        y = nn.functional.relu(tmp, inplace=True)
        return y

class MyModel(nn.Module):
    """Fast-RCNN (基於 ResNet-18 的變種)"""
    def __init__(self):
        super().__init__()
        # 記錄上一層的出通道數量
        self.previous_channels_out = 4
        # 把 3 通道轉換到 4 通道，長寬不變
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, self.previous_channels_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(self.previous_channels_out))
        # 抽取圖片各個區域特徵的 ResNet (除去 AvgPool 和全連線層)
        # 和原始的 Resnet 不一樣的是輸出的長寬和輸入的長寬會相等，以便 ROI 層按區域抽取R徵
        # 此外，為了可以讓模型跑在 4GB 視訊記憶體上，這裡減少了模型的通道數量
        self.layer1 = self._make_layer(BasicBlock, channels_out=4, num_blocks=2, stride=1)
        self.layer2 = self._make_layer(BasicBlock, channels_out=4, num_blocks=2, stride=1)
        self.layer3 = self._make_layer(BasicBlock, channels_out=8, num_blocks=2, stride=1)
        self.layer4 = self._make_layer(BasicBlock, channels_out=8, num_blocks=2, stride=1)
        # ROI 層抽取各個子區域特徵後轉換到固定大小
        self.roi_pool = nn.AdaptiveMaxPool2d((5, 5))
        # 輸出兩個分類 [非人臉, 人臉]
        self.fc_labels_model = nn.Sequential(
            nn.Linear(8*5*5, 32),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(32, 2))
        # 計算區域偏移，分別輸出 x, y, w, h 的偏移
        self.fc_offsets_model = nn.Sequential(
            nn.Linear(8*5*5, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, 4))

    def _make_layer(self, block_type, channels_out, num_blocks, stride):
        blocks = []
        # 新增第一個塊
        blocks.append(block_type(self.previous_channels_out, channels_out, stride))
        self.previous_channels_out = channels_out * block_type.expansion
        # 新增剩餘的塊，剩餘的塊固定處理間隔為 1，不會改變長寬
        for _ in range(num_blocks-1):
            blocks.append(block_type(self.previous_channels_out, self.previous_channels_out, 1))
            self.previous_channels_out *= block_type.expansion
        return nn.Sequential(*blocks)

    def _roi_pooling(self, feature_mapping, roi_boxes):
        result = []
        for box in roi_boxes:
            image_index, x, y, w, h = map(int, box.tolist())
            feature_sub_region = feature_mapping[image_index][:,x:x+w,y:y+h]
            fixed_features = self.roi_pool(feature_sub_region).reshape(-1) # 順道扁平化
            result.append(fixed_features)
        return torch.stack(result)

    def forward(self, x):
        images_tensor = x[0]
        candidate_boxes_tensor = x[1]
        # 轉換出通道
        tmp = self.conv1(images_tensor)
        tmp = nn.functional.relu(tmp)
        # 應用 ResNet 的各個層
        # 結果維度是 B,32,W,H
        tmp = self.layer1(tmp)
        tmp = self.layer2(tmp)
        tmp = self.layer3(tmp)
        tmp = self.layer4(tmp)
        # 使用 ROI 層抽取各個子區域的特徵並轉換到固定大小
        # 結果維度是 B,32*9*9
        tmp = self._roi_pooling(tmp, candidate_boxes_tensor)
        # 根據抽取出來的子區域特徵分別計算分類 (是否人臉) 和區域偏移
        labels = self.fc_labels_model(tmp)
        offsets = self.fc_offsets_model(tmp)
        y = (labels, offsets)
        return y

def save_tensor(tensor, path):
    """儲存 tensor 物件到檔案"""
    torch.save(tensor, gzip.GzipFile(path, "wb"))

def load_tensor(path):
    """從檔案讀取 tensor 物件"""
    return torch.load(gzip.GzipFile(path, "rb"))

def calc_resize_parameters(sw, sh):
    """計算縮放圖片的引數"""
    sw_new, sh_new = sw, sh
    dw, dh = IMAGE_SIZE
    pad_w, pad_h = 0, 0
    if sw / sh < dw / dh:
        sw_new = int(dw / dh * sh)
        pad_w = (sw_new - sw) // 2 # 填充左右
    else:
        sh_new = int(dh / dw * sw)
        pad_h = (sh_new - sh) // 2 # 填充上下
    return sw_new, sh_new, pad_w, pad_h

def resize_image(img):
    """縮放 opencv 圖片，比例不一致時填充"""
    sh, sw, _ = img.shape
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    img = cv2.copyMakeBorder(img, pad_h, pad_h, pad_w, pad_w, cv2.BORDER_CONSTANT, (0, 0, 0))
    img = cv2.resize(img, dsize=IMAGE_SIZE)
    return img

def image_to_tensor(img):
    """轉換 opencv 圖片物件到 tensor 物件"""
    # 注意 opencv 是 BGR，但對訓練沒有影響所以不用轉為 RGB
    arr = numpy.asarray(img)
    t = torch.from_numpy(arr)
    t = t.transpose(0, 2) # 轉換維度 H,W,C 到 C,W,H
    t = t / 255.0 # 正規化數值使得範圍在 0 ~ 1
    return t

def map_box_to_resized_image(box, sw, sh):
    """把原始區域轉換到縮放後的圖片對應的區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int((x + pad_w) * scale)
    y = int((y + pad_h) * scale)
    w = int(w * scale)
    h = int(h * scale)
    if x + w > IMAGE_SIZE[0] or y + h > IMAGE_SIZE[1] or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def map_box_to_original_image(box, sw, sh):
    """把縮放後圖片對應的區域轉換到縮放前的原始區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int(x / scale - pad_w)
    y = int(y / scale - pad_h)
    w = int(w / scale)
    h = int(h / scale)
    if x + w > sw or y + h > sh or x < 0 or y < 0 or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合併部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

def calc_box_offset(candidate_box, true_box):
    """計算候選區域與實際區域的偏移值"""
    # 這裡計算出來的偏移值基於比例，而不受具體位置和大小影響
    # w h 使用 log 是為了減少過大的值的影響
    x1, y1, w1, h1 = candidate_box
    x2, y2, w2, h2 = true_box
    x_offset = (x2 - x1) / w1
    y_offset = (y2 - y1) / h1
    w_offset = math.log(w2 / w1)
    h_offset = math.log(h2 / h1)
    return (x_offset, y_offset, w_offset, h_offset)

def adjust_box_by_offset(candidate_box, offset):
    """根據偏移值調整候選區域"""
    x1, y1, w1, h1 = candidate_box
    x_offset, y_offset, w_offset, h_offset = offset
    x2 = w1 * x_offset + x1
    y2 = h1 * y_offset + y1
    w2 = math.exp(w_offset) * w1
    h2 = math.exp(h_offset) * h1
    return (x2, y2, w2, h2)

def selective_search(img):
    """計算 opencv 圖片中可能出現物件的區域，只返回頭 2000 個區域"""
    # 演算法參考 https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
    s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
    s.setBaseImage(img)
    s.switchToSelectiveSearchFast()
    boxes = s.process()
    return boxes[:2000]

def prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets):
    """準備訓練 - 儲存單個批次的資料"""
    # 按索引值列表生成輸入和輸出 tensor 物件的函式
    def split_dataset(indices):
        image_in = []
        candidate_boxes_in = []
        labels_out = []
        offsets_out = []
        for new_image_index, original_image_index in enumerate(indices):
            image_in.append(image_tensors[original_image_index])
            for box, label, offset in zip(image_candidate_boxes, image_labels, image_box_offsets):
                box_image_index, x, y, w, h = box
                if box_image_index == original_image_index:
                    candidate_boxes_in.append((new_image_index, x, y, w, h))
                    labels_out.append(label)
                    offsets_out.append(offset)
        # 檢查計算是否有問題
        # for box, label in zip(candidate_boxes_in, labels_out):
        #    image_index, x, y, w, h = box
        #    child_img = image_in[image_index][:, x:x+w, y:y+h].transpose(0, 2) * 255
        #    cv2.imwrite(f"{image_index}_{x}_{y}_{w}_{h}_{label}.png", child_img.numpy())
        tensor_image_in = torch.stack(image_in) # 維度: B,C,W,H
        tensor_candidate_boxes_in = torch.tensor(candidate_boxes_in, dtype=torch.float) # 維度: N,5 (index, x, y, w, h)
        tensor_labels_out = torch.tensor(labels_out, dtype=torch.long) # 維度: N
        tensor_box_offsets_out = torch.tensor(offsets_out, dtype=torch.float) # 維度: N,4 (x_offset, y_offset, ..)
        return (tensor_image_in, tensor_candidate_boxes_in), (tensor_labels_out, tensor_box_offsets_out)

    # 切分訓練集 (80%)，驗證集 (10%) 和測試集 (10%)
    random_indices = torch.randperm(len(image_tensors))
    training_indices = random_indices[:int(len(random_indices)*0.8)]
    validating_indices = random_indices[int(len(random_indices)*0.8):int(len(random_indices)*0.9):]
    testing_indices = random_indices[int(len(random_indices)*0.9):]
    training_set = split_dataset(training_indices)
    validating_set = split_dataset(validating_indices)
    testing_set = split_dataset(testing_indices)

    # 儲存到硬碟
    save_tensor(training_set, f"data/training_set.{batch}.pt")
    save_tensor(validating_set, f"data/validating_set.{batch}.pt")
    save_tensor(testing_set, f"data/testing_set.{batch}.pt")
    print(f"batch {batch} saved")

def prepare():
    """準備訓練"""
    # 資料集轉換到 tensor 以後會儲存在 data 資料夾下
    if not os.path.isdir("data"):
        os.makedirs("data")

    # 載入 csv 檔案，構建圖片到區域列表的索引 { 圖片名: [ 區域, 區域, .. ] }
    box_map = defaultdict(lambda: [])
    df = pandas.read_csv(BOX_CSV_PATH)
    for row in df.values:
        filename, width, height, x1, y1, x2, y2 = row[:7]
        box_map[filename].append((x1, y1, x2-x1, y2-y1))

    # 從圖片裡面提取人臉 (正樣本) 和非人臉 (負樣本) 的圖片
    batch_size = 50
    max_samples = 10
    batch = 0
    image_tensors = [] # 圖片列表
    image_candidate_boxes = [] # 各個圖片的候選區域列表
    image_labels = [] # 各個圖片的候選區域對應的標籤 (1 人臉 0 非人臉)
    image_box_offsets = [] # 各個圖片的候選區域與真實區域的偏移值
    for filename, true_boxes in box_map.items():
        path = os.path.join(IMAGE_DIR, filename)
        img_original = cv2.imread(path) # 載入原始圖片
        sh, sw, _ = img_original.shape # 原始圖片大小
        img = resize_image(img_original) # 縮放圖片
        candidate_boxes = selective_search(img) # 查詢候選區域
        true_boxes = [ map_box_to_resized_image(b, sw, sh) for b in true_boxes ] # 縮放實際區域
        image_index = len(image_tensors) # 圖片在批次中的索引值
        image_tensors.append(image_to_tensor(img.copy()))
        positive_samples = 0
        negative_samples = 0
        for candidate_box in candidate_boxes:
            # 如果候選區域和任意一個實際區域重疊率大於 70%，則認為是正樣本
            # 如果候選區域和所有實際區域重疊率都小於 30%，則認為是負樣本
            # 每個圖片最多新增正樣本數量 + 10 個負樣本，需要提供足夠多負樣本避免偽陽性判斷
            iou_list = [ calc_iou(candidate_box, true_box) for true_box in true_boxes ]
            positive_index = next((index for index, iou in enumerate(iou_list) if iou > 0.70), None)
            is_negative = all(iou < 0.30 for iou in iou_list)
            result = None
            if positive_index is not None:
                result = True
                positive_samples += 1
            elif is_negative and negative_samples < positive_samples + 10:
                result = False
                negative_samples += 1
            if result is not None:
                x, y, w, h = candidate_box
                # 檢驗計算是否有問題
                # child_img = img[y:y+h, x:x+w].copy()
                # cv2.imwrite(f"{filename}_{x}_{y}_{w}_{h}_{int(result)}.png", child_img)
                image_candidate_boxes.append((image_index, x, y, w, h))
                image_labels.append(int(result))
                if positive_index is not None:
                    image_box_offsets.append(calc_box_offset(
                        candidate_box, true_boxes[positive_index])) # 正樣本新增偏移值
                else:
                    image_box_offsets.append((0, 0, 0, 0)) # 負樣本無偏移
            if positive_samples >= max_samples:
                break
        # 儲存批次
        if len(image_tensors) >= batch_size:
            prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets)
            image_tensors.clear()
            image_candidate_boxes.clear()
            image_labels.clear()
            image_box_offsets.clear()
            batch += 1
    # 儲存剩餘的批次
    if len(image_tensors) > 10:
        prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets)

def train():
    """開始訓練"""
    # 建立模型例項
    model = MyModel().to(device)

    # 建立多工損失計算器
    celoss = torch.nn.CrossEntropyLoss()
    mseloss = torch.nn.MSELoss()
    def loss_function(predicted, actual):
        # 標籤損失必須根據正負樣本分別計算，否則會導致預測結果總是為負的問題
        positive_indices = actual[0].nonzero(as_tuple=True)[0] # 正樣本的索引值列表
        negative_indices = (actual[0] == 0).nonzero(as_tuple=True)[0] # 負樣本的索引值列表
        loss1 = celoss(predicted[0][positive_indices], actual[0][positive_indices]) # 正樣本標籤的損失
        loss2 = celoss(predicted[0][negative_indices], actual[0][negative_indices]) # 負樣本標籤的損失
        loss3 = mseloss(predicted[1][positive_indices], actual[1][positive_indices]) # 偏移值的損失，僅針對正樣本計算
        return loss1 + loss2 + loss3

    # 建立引數調整器
    optimizer = torch.optim.Adam(model.parameters())

    # 記錄訓練集和驗證集的正確率變化
    training_label_accuracy_history = []
    training_offset_accuracy_history = []
    validating_label_accuracy_history = []
    validating_offset_accuracy_history = []

    # 記錄最高的驗證集正確率
    validating_label_accuracy_highest = -1
    validating_label_accuracy_highest_epoch = 0
    validating_offset_accuracy_highest = -1
    validating_offset_accuracy_highest_epoch = 0

    # 讀取批次的工具函式
    def read_batches(base_path):
        for batch in itertools.count():
            path = f"{base_path}.{batch}.pt"
            if not os.path.isfile(path):
                break
            yield [ [ tt.to(device) for tt in t ] for t in load_tensor(path) ]

    # 計算正確率的工具函式
    def calc_accuracy(actual, predicted):
        # 標籤正確率，正樣本和負樣本的正確率分別計算再平均
        predicted_i = torch.max(predicted[0], 1).indices
        acc_positive = ((actual[0] > 0.5) & (predicted_i > 0.5)).sum().item() / ((actual[0] > 0.5).sum().item() + 0.00001)
        acc_negative = ((actual[0] <= 0.5) & (predicted_i <= 0.5)).sum().item() / ((actual[0] <= 0.5).sum().item() + 0.00001)
        acc_label = (acc_positive + acc_negative) / 2
        # print(acc_positive, acc_negative)
        # 偏移值正確率
        valid_indices = actual[1].nonzero(as_tuple=True)[0]
        if valid_indices.shape[0] == 0:
            acc_offset = 1
        else:
            acc_offset = (1 - (predicted[1][valid_indices] - actual[1][valid_indices]).abs().mean()).item()
            acc_offset = max(acc_offset, 0)
        return acc_label, acc_offset

    # 開始訓練過程
    for epoch in range(1, 10000):
        print(f"epoch: {epoch}")

        # 根據訓練集訓練並修改引數
        model.train()
        training_label_accuracy_list = []
        training_offset_accuracy_list = []
        for batch_index, batch in enumerate(read_batches("data/training_set")):
            # 劃分輸入和輸出
            batch_x, batch_y = batch
            # 計算預測值
            predicted = model(batch_x)
            # 計算損失
            loss = loss_function(predicted, batch_y)
            # 從損失自動微分求導函式值
            loss.backward()
            # 使用引數調整器調整引數
            optimizer.step()
            # 清空導函式值
            optimizer.zero_grad()
            # 記錄這一個批次的正確率，torch.no_grad 代表臨時禁用自動微分功能
            with torch.no_grad():
                training_batch_label_accuracy, training_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
            # 輸出批次正確率
            training_label_accuracy_list.append(training_batch_label_accuracy)
            training_offset_accuracy_list.append(training_batch_offset_accuracy)
            print(f"epoch: {epoch}, batch: {batch_index}: " +
                f"batch label accuracy: {training_batch_label_accuracy}, offset accuracy: {training_batch_offset_accuracy}")
        training_label_accuracy = sum(training_label_accuracy_list) / len(training_label_accuracy_list)
        training_offset_accuracy = sum(training_offset_accuracy_list) / len(training_offset_accuracy_list)
        training_label_accuracy_history.append(training_label_accuracy)
        training_offset_accuracy_history.append(training_offset_accuracy)
        print(f"training label accuracy: {training_label_accuracy}, offset accuracy: {training_offset_accuracy}")

        # 檢查驗證集
        model.eval()
        validating_label_accuracy_list = []
        validating_offset_accuracy_list = []
        for batch in read_batches("data/validating_set"):
            batch_x, batch_y = batch
            predicted = model(batch_x)
            validating_batch_label_accuracy, validating_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
            validating_label_accuracy_list.append(validating_batch_label_accuracy)
            validating_offset_accuracy_list.append(validating_batch_offset_accuracy)
        validating_label_accuracy = sum(validating_label_accuracy_list) / len(validating_label_accuracy_list)
        validating_offset_accuracy = sum(validating_offset_accuracy_list) / len(validating_offset_accuracy_list)
        validating_label_accuracy_history.append(validating_label_accuracy)
        validating_offset_accuracy_history.append(validating_offset_accuracy)
        print(f"validating label accuracy: {validating_label_accuracy}, offset accuracy: {validating_offset_accuracy}")

        # 記錄最高的驗證集正確率與當時的模型狀態，判斷是否在 20 次訓練後仍然沒有重新整理記錄
        if validating_label_accuracy > validating_label_accuracy_highest:
            validating_label_accuracy_highest = validating_label_accuracy
            validating_label_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest label validating accuracy updated")
        elif validating_offset_accuracy > validating_offset_accuracy_highest:
            validating_offset_accuracy_highest = validating_offset_accuracy
            validating_offset_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest offset validating accuracy updated")
        elif (epoch - validating_label_accuracy_highest_epoch > 20 and
            epoch - validating_offset_accuracy_highest_epoch > 20):
            # 在 20 次訓練後仍然沒有重新整理記錄，結束訓練
            print("stop training because highest validating accuracy not updated in 20 epoches")
            break

    # 使用達到最高正確率時的模型狀態
    print(f"highest label validating accuracy: {validating_label_accuracy_highest}",
        f"from epoch {validating_label_accuracy_highest_epoch}")
    print(f"highest offset validating accuracy: {validating_offset_accuracy_highest}",
        f"from epoch {validating_offset_accuracy_highest_epoch}")
    model.load_state_dict(load_tensor("model.pt"))

    # 檢查測試集
    testing_label_accuracy_list = []
    testing_offset_accuracy_list = []
    for batch in read_batches("data/testing_set"):
        batch_x, batch_y = batch
        predicted = model(batch_x)
        testing_batch_label_accuracy, testing_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
        testing_label_accuracy_list.append(testing_batch_label_accuracy)
        testing_offset_accuracy_list.append(testing_batch_offset_accuracy)
    testing_label_accuracy = sum(testing_label_accuracy_list) / len(testing_label_accuracy_list)
    testing_offset_accuracy = sum(testing_offset_accuracy_list) / len(testing_offset_accuracy_list)
    print(f"testing label accuracy: {testing_label_accuracy}, offset accuracy: {testing_offset_accuracy}")

    # 顯示訓練集和驗證集的正確率變化
    pyplot.plot(training_label_accuracy_history, label="training_label_accuracy")
    pyplot.plot(training_offset_accuracy_history, label="training_offset_accuracy")
    pyplot.plot(validating_label_accuracy_history, label="validing_label_accuracy")
    pyplot.plot(validating_offset_accuracy_history, label="validing_offset_accuracy")
    pyplot.ylim(0, 1)
    pyplot.legend()
    pyplot.show()

def eval_model():
    """使用訓練好的模型"""
    # 建立模型例項，載入訓練好的狀態，然後切換到驗證模式
    model = MyModel().to(device)
    model.load_state_dict(load_tensor("model.pt"))
    model.eval()

    # 詢問圖片路徑，並顯示所有可能是人臉的區域
    while True:
        try:
            # 選取可能出現物件的區域一覽
            image_path = input("Image path: ")
            if not image_path:
                continue
            img_original = cv2.imread(image_path) # 載入原始圖片
            sh, sw, _ = img_original.shape # 原始圖片大小
            img = resize_image(img_original) # 縮放圖片
            candidate_boxes = selective_search(img) # 查詢候選區域
            # 構建輸入
            image_tensor = image_to_tensor(img).unsqueeze(dim=0).to(device) # 維度: 1,C,W,H
            candidate_boxes_tensor = torch.tensor(
                [ (0, x, y, w, h) for x, y, w, h in candidate_boxes ],
                dtype=torch.float).to(device) # 維度: N,5
            tensor_in = (image_tensor, candidate_boxes_tensor)
            # 預測輸出
            labels, offsets = model(tensor_in)
            labels = nn.functional.softmax(labels, dim=1)
            labels = labels[:,1].resize(labels.shape[0])
            # 判斷概率大於 90% 的是人臉，按偏移值調整區域，新增邊框到圖片並儲存
            img_output = img_original.copy()
            for box, label, offset in zip(candidate_boxes, labels, offsets):
                if label.item() <= 0.99:
                    continue
                box = adjust_box_by_offset(box, offset.tolist())
                x, y, w, h = map_box_to_original_image(box, sw, sh)
                if w == 0 or h == 0:
                    continue
                print(x, y, w, h)
                cv2.rectangle(img_output, (x, y), (x+w, y+h), (0, 0, 0xff), 1)
            cv2.imwrite("img_output.png", img_output)
            print("saved to img_output.png")
            print()
        except Exception as e:
            print("error:", e)

def main():
    """主函式"""
    if len(sys.argv) < 2:
        print(f"Please run: {sys.argv[0]} prepare|train|eval")
        exit()

    # 給隨機數生成器分配一個初始值，使得每次執行都可以生成相同的隨機數
    # 這是為了讓過程可重現，你也可以選擇不這樣做
    random.seed(0)
    torch.random.manual_seed(0)

    # 根據命令列引數選擇操作
    operation = sys.argv[1]
    if operation == "prepare":
        prepare()
    elif operation == "train":
        train()
    elif operation == "eval":
        eval_model()
    else:
        raise ValueError(f"Unsupported operation: {operation}")

if __name__ == "__main__":
    main()

執行以下命令以後：

python3 example.py prepare
python3 example.py train

在 31 輪訓練以後的輸出如下 (因為訓練時間實在長，這裡偷懶了?)：

epoch: 31, batch: 112: batch label accuracy: 0.9805490565092065, offset accuracy: 0.9293316006660461
epoch: 31, batch: 113: batch label accuracy: 0.9776784565994586, offset accuracy: 0.9191392660140991
epoch: 31, batch: 114: batch label accuracy: 0.9469732184008024, offset accuracy: 0.9101274609565735
training label accuracy: 0.9707166603858259, offset accuracy: 0.9191886570142663
validating label accuracy: 0.9306134214845806, offset accuracy: 0.9205827381299889
highest offset validating accuracy updated

執行以下命令，再輸入圖片路徑可以使用學習好的模型識別圖片：

python3 example.py eval

以下是部分識別結果：

調整區域前

調整區域後

調整區域前

調整區域後

精度和 RCNN 差不多，甚至有些降低了 (為了支援 4G 視訊記憶體縮放圖片了)。不過識別速度有很大的提升，在同一個環境下，Fast-RCNN 處理單張圖片只需要 0.4~0.5 秒，而 RCNN 則需要 2 秒左右。

寫在最後

這篇介紹的 RCNN 與 Fast-RCNN 只是用於入門物件識別的，實用價值並不大 (速度慢，識別精度低)。下一篇介紹的 Faster-RCNN 則是可以用於生產的模型，但複雜程度也會高一個等級?。

此外，這篇文章和下一篇文章的程式碼實現和論文中的實現、網上的其他實現不完全一樣，這是因為我的機器視訊記憶體較低，並且我想用盡量少的程式碼來實現相同的原理，使得程式碼更容易理解 (網上很多實現都是分一堆檔案，甚至把部分邏輯使用 c/c++ 擴充套件實現，效能上有好處但是初學者看了會頭大)。

對了，如果有什麼問題或者想討論機器學習可以加下面的微信群?，7 天內有效。

寫給程式設計師的機器學習入門 (九) - 物件識別 RCNN 與 Fast-RCNN

圖片分類與物件識別

物件識別模型需要的訓練資料

RCNN

選取可能出現物件的區域

按重疊率 (IOU) 判斷每個區域是否包含物件

原始論文

使用 RCNN 識別圖片中的人臉

Fast-RCNN

縮放來源圖片

計算區域特徵

抽取區域特徵 (ROI Pooling)

調整區域範圍

計算損失

合併結果區域

原始論文

使用 Fast-RCNN 識別圖片中的人臉

寫在最後

相關文章