訓練PaddleOCR文字方向分類模型

Eslzzyl發表於2024-08-27

原文網址 : https://www.cnblogs.com/eslzzyl/p/18378393

模型

最近在做一個專案，涉及到掃描答題卡的方向判斷。其中一種方法是訓練一個文字方向分類模型來判斷方向。此處記錄一下訓練的過程。

環境準備

在一處空閒空間足夠大的地方克隆 PaddleOCR 倉庫：https://github.com/PaddlePaddle/PaddleOCR

PaddleOCR 倉庫體積較大，需下載約 700 MB 資料。

建立一個新的虛擬環境，根據這個網頁的指導安裝 PaddlePaddle 框架。

注意，不要安裝 Numpy 2.0 或更新版本，因為 PaddleOCR 可能不相容。Numpy 1.0 的最新版本是 1.26.4。

然後，進入 PaddleOCR 倉庫的根目錄，安裝必要的依賴：

pip install -r requirements.txt
pip install albumentations

訓練資料準備

我的資料集包含約 2 萬張答題卡掃描影像，資料集按照 9:1 的比例劃分為訓練集和測試（驗證）集。我假設所有影像均為正向，以 0.2 的機率隨機將影像旋轉 180 度來生成顛倒的影像。在隨機旋轉的同時生成方向標籤。

原始資料集包含多個目錄，每個目錄中存放一個科目的答題卡影像。我在同級目錄下建立了 train 目錄和 test 目錄，用於儲存實際的訓練資料。

使用以下 bash 指令碼將原始資料集中的影像複製到 train 目錄和 test 目錄：

#!/bin/bash

# 定義源目錄和目標目錄
source_dirs=("科目1", "科目2", "科目3", "科目4", "科目5")
target_train="train"
target_test="test"

# 遍歷所有源目錄
for dir in "${source_dirs[@]}"; do
    # 獲取目錄中的所有圖片檔名
    files=$(find "$dir" -maxdepth 1 -type f -name "*.png")
    
    # 計算要移動到 train 和 test 目錄中的檔案數量
    total_files=$(echo "$files" | wc -l)
    train_count=$((total_files * 90 / 100))
    test_count=$((total_files - train_count))

    # 從所有檔案中隨機選擇 train_count 個檔案移動到 train 目錄
    train_files=$(echo "$files" | shuf -n $train_count)
    for file in $train_files; do
        cp "$file" "$target_train"
    done

    # 從剩餘檔案中隨機選擇 test_count 個檔案移動到 test 目錄
    test_files=$(echo "$files" | shuf -n $test_count)
    for file in $test_files; do
        cp "$file" "$target_test"
    done
done

並使用以下 Python 程式碼完成影像的隨即旋轉和標籤生成：

import os
import random
from PIL import Image
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import multiprocessing

def process_image(filename, directory, label_file):
    filepath = os.path.join(directory, filename)
    
    if filename.endswith('.png'):
        # 隨機決定是否旋轉影像
        if random.random() > 0.8:
            # 旋轉影像
            img = Image.open(filepath)
            img_rotated = img.rotate(180)
            img_rotated.save(filepath)
            
            # 寫入標籤（180度旋轉）
            with open(label_file, 'a') as f:
                f.write(f"{filepath}\t180\n")
        else:
            # 不旋轉，直接寫入標籤（0度）
            with open(label_file, 'a') as f:
                f.write(f"{filepath}\t0\n")

def rotate_and_label_images(directory, label_file):
    # 獲取所有檔案
    files = [filename for filename in os.listdir(directory) if filename.endswith('.png')]

    # 使用多執行緒處理檔案
    with ThreadPoolExecutor() as executor:
        list(tqdm(executor.map(lambda x: process_image(x, directory, label_file), files), total=len(files)))

# 指定目錄和標籤檔案路徑
train_dir = 'train'
test_dir = 'test'
train_label_file = 'train.txt'
test_label_file = 'test.txt'

# 處理 train 目錄
rotate_and_label_images(train_dir, train_label_file)

# 處理 test 目錄
rotate_and_label_images(test_dir, test_label_file)

上述的大部分程式碼是使用通義千問生成的。

依次執行上述兩個指令碼後，train 目錄和 test 目錄中 20% 的影像將是顛倒的，同時與它們同級的目錄中將會生成 train.txt 檔案和 test.txt 檔案，儲存檔名和標籤。兩個檔案大致遵循以下格式：

test/8D64250B-5B95-11EF-8A7E-3024A9806847.png	0
test/1F83BFEB-5B83-11EF-8A7E-3024A9806847.png	0
test/CC126431-5B92-11EF-8A7E-3024A9806847.png	0
test/87249B09-5B8B-11EF-8A7E-3024A9806847.png	180
test/G669F21C-5B8B-11EF-8A7E-3024A9806847.png	0
test/7AA0DB14-5B8C-11EF-8A7E-3024A9806847.png	180
test/EC082795-5B84-11EF-8A7E-3024A9806847.png	0
test/80B03DC5-3296-11EF-9E58-5CBAEF6F52AE.png	0
test/FEA16C22-24BC-11EF-BDD7-E86A6470B412.png	180
test/B1722A7E-5B8B-11EF-8A7E-3024A9806847.png	180
test/065A748A-5B99-11EF-8A7E-3024A9806847.png	0

需要注意，檔名和標籤之間應該用 \t 分隔，而不是空格。否則訓練指令碼將無法識別。

切換到 PaddleOCR 倉庫根目錄，新建一個 train_data 目錄，然後在其中建立一個名為 cls 的、連結到上述資料集所在目錄的軟連結。

ln -s /path/to/the/dataset cls

開始訓練

我在訓練時始終無法成功使 PaddlePaddle 呼叫 GPU 進行計算。我使用多種方法重灌了若干次，並且使用 PaddlePaddle 訓練了一個簡單的卷積網路，可以正常呼叫 GPU。但一到 PaddleOCR 的環境中就不行了，表現為佔用了一定的視訊記憶體，但 GPU 完全沒有計算，CPU滿載。GitHub issue 中沒有發現類似的問題。考慮到文字方向分類模型比較輕量，且在我的訓練資料上可以快速收斂，因此我使用 CPU 完成了簡單的訓練。

回到 PaddleOCR 倉庫根目錄，開啟 configs/cls/cls_mv3.yml，根據需要進行修改。我進行了以下修改：

@@ -1,6 +1,6 @@
 Global:
-  use_gpu: true
-  epoch_num: 100
+  use_gpu: false
+  epoch_num: 10
   log_smooth_window: 20
   print_batch_step: 10
   save_model_dir: ./output/cls/mv3/
@@ -61,7 +61,7 @@ Train:
           channel_first: False
       - ClsLabelEncode: # Class handling label
       - BaseDataAugmentation:
-      - RandAugment:
+      # - RandAugment:
       - ClsResizeImg:
           image_shape: [3, 48, 192]
       - KeepKeys:

上述改動基於 commit 1752c56。

正如上面那段說明所提到的，我在訓練時無論如何也呼叫不了 GPU，於是我將 use_gpu 改為 false（注意這個選項的值的首字母應該小寫），並根據實際的收斂速度減少了 epoch 數量。此外，我沒有使用此處描述的資料增強。

執行

python tools/train.py -c configs/cls/cls_mv3.yml

來啟動訓練。

在經過大約 4 個 epoch 後，模型收斂，精度約為 99.4%。10 個 epoch 跑完之後，模型將被儲存到 ``

模型轉換

為了執行推理，需要先將訓練階段的模型轉換成推理模型。執行

python3 tools/export_model.py -c configs/cls/cls_mv3.yml -o Global.pretrained_model=./output/cls/mv3/latest Global.save_inference_dir=./inference/cls/

即可將訓練階段的模型轉換成推理模型。轉換後，我們可以使用推理模型所在目錄的名字來引用這個模型。

推理

直接使用 paddleocr 命令列工具似乎無法完成純粹的文字方向分類任務，但訓練時提供的文字方向分類模型推理指令碼在 PaddleOCR 的倉庫中，前面已經提到，這倉庫很大。為了避免在推理端克隆龐大的 Git 倉庫，我們可以簡單修改推理指令碼的程式碼，使之僅依賴編譯好的 PaddleOCR pypi 包。

# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import cv2
import copy
import numpy as np
import math
import time
import traceback
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

import paddleocr.tools.infer.utility as utility
from paddleocr.ppocr.postprocess import build_post_process
from paddleocr.ppocr.utils.logging import get_logger
from paddleocr.ppocr.utils.utility import get_image_file_list, check_and_read

__dir__ = os.path.dirname(os.path.abspath(__file__))
sys.path.append(__dir__)
sys.path.insert(0, os.path.abspath(os.path.join(__dir__, "../..")))

os.environ["FLAGS_allocator_strategy"] = "auto_growth"

logger = get_logger()


class TextClassifier(object):
    def __init__(self, args):
        self.cls_image_shape = [int(v) for v in args.cls_image_shape.split(",")]
        self.cls_batch_num = args.cls_batch_num
        self.cls_thresh = args.cls_thresh
        postprocess_params = {
            "name": "ClsPostProcess",
            "label_list": args.label_list,
        }
        self.postprocess_op = build_post_process(postprocess_params)
        (
            self.predictor,
            self.input_tensor,
            self.output_tensors,
            _,
        ) = utility.create_predictor(args, "cls", logger)
        self.use_onnx = args.use_onnx

    def resize_norm_img(self, img):
        imgC, imgH, imgW = self.cls_image_shape
        h = img.shape[0]
        w = img.shape[1]
        ratio = w / float(h)
        if math.ceil(imgH * ratio) > imgW:
            resized_w = imgW
        else:
            resized_w = int(math.ceil(imgH * ratio))
        resized_image = cv2.resize(img, (resized_w, imgH))
        resized_image = resized_image.astype("float32")
        if self.cls_image_shape[0] == 1:
            resized_image = resized_image / 255
            resized_image = resized_image[np.newaxis, :]
        else:
            resized_image = resized_image.transpose((2, 0, 1)) / 255
        resized_image -= 0.5
        resized_image /= 0.5
        padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32)
        padding_im[:, :, 0:resized_w] = resized_image
        return padding_im

    def __call__(self, img_list):
        img_list = copy.deepcopy(img_list)
        img_num = len(img_list)
        # Calculate the aspect ratio of all text bars
        width_list = []
        for img in img_list:
            width_list.append(img.shape[1] / float(img.shape[0]))
        # Sorting can speed up the cls process
        indices = np.argsort(np.array(width_list))

        cls_res = [["", 0.0]] * img_num
        batch_num = self.cls_batch_num
        elapse = 0
        for beg_img_no in range(0, img_num, batch_num):
            end_img_no = min(img_num, beg_img_no + batch_num)
            norm_img_batch = []
            max_wh_ratio = 0
            starttime = time.time()
            for ino in range(beg_img_no, end_img_no):
                h, w = img_list[indices[ino]].shape[0:2]
                wh_ratio = w * 1.0 / h
                max_wh_ratio = max(max_wh_ratio, wh_ratio)
            for ino in range(beg_img_no, end_img_no):
                norm_img = self.resize_norm_img(img_list[indices[ino]])
                norm_img = norm_img[np.newaxis, :]
                norm_img_batch.append(norm_img)
            norm_img_batch = np.concatenate(norm_img_batch)
            norm_img_batch = norm_img_batch.copy()

            if self.use_onnx:
                input_dict = {self.input_tensor.name: norm_img_batch}
                outputs = self.predictor.run(self.output_tensors, input_dict)
                prob_out = outputs[0]
            else:
                self.input_tensor.copy_from_cpu(norm_img_batch)
                self.predictor.run()
                prob_out = self.output_tensors[0].copy_to_cpu()
                self.predictor.try_shrink_memory()
            cls_result = self.postprocess_op(prob_out)
            elapse += time.time() - starttime
            for rno in range(len(cls_result)):
                label, score = cls_result[rno]
                cls_res[indices[beg_img_no + rno]] = [label, score]
                if "180" in label and score > self.cls_thresh:
                    img_list[indices[beg_img_no + rno]] = cv2.rotate(
                        img_list[indices[beg_img_no + rno]], 1
                    )
        return img_list, cls_res, elapse


def cls(image_dir: str, cls_model_dir: str, use_gpu=False):
    args = utility.parse_args()
    args.image_dir = image_dir
    args.cls_model_dir = cls_model_dir
    args.use_gpu = use_gpu

    image_file_list = get_image_file_list(args.image_dir)
    text_classifier = TextClassifier(args)
    valid_image_file_list = []
    upside_img_list = []

    # Process images in batches of 10
    batch_size = 10
    for i in range(0, len(image_file_list), batch_size):
        batch_files = image_file_list[i:i + batch_size]
        batch_imgs = []
        for image_file in batch_files:
            img, flag, _ = check_and_read(image_file)
            if not flag:
                img = cv2.imread(image_file)
            if img is None:
                logger.info("error in loading image:{}".format(image_file))
                continue
            valid_image_file_list.append(image_file)
            batch_imgs.append(img)
        try:
            batch_imgs, cls_res, predict_time = text_classifier(batch_imgs)
        except Exception as E:
            logger.info(traceback.format_exc())
            logger.info(E)
            exit()
        for ino in range(len(batch_imgs)):
            # 如果識別結果為180度，需要記錄並在日誌中輸出
            if "180" in cls_res[ino][0] and cls_res[ino][1] > args.cls_thresh:
                upside_img_list.append(batch_files[ino])
                logger.info(
                    "The image is upside down: {}, score: {}".format(
                        valid_image_file_list[i + ino], cls_res[ino][1]
                    )
                )
    return upside_img_list


def process_image(path):
    img = cv2.imread(path)
    img = cv2.rotate(img, cv2.ROTATE_180)
    cv2.imwrite(path, img)


def check_and_rotate(path):
    cls_model_dir = "./cls"
    upside_img_list = cls(path, cls_model_dir, False)
    print(f"len of upside_img_list: {len(upside_img_list)}")
    with ThreadPoolExecutor() as executor:
        list(tqdm(executor.map(process_image, upside_img_list), total=len(upside_img_list)))

結果

誤判實驗：2263 張正向圖片，出現了 1 次誤判。
漏判實驗：2263 張顛倒圖片，出現了 2263-2056=207 次漏判，漏判率 9.1%

PaddleOCR手寫文字識別模型訓練（摘抄所得，非原創）
2024-03-14
模型
使用Bert預訓練模型文字分類（內附原始碼）
2019-03-13
模型文字分類原始碼
如何用Python和機器學習訓練中文文字情感分類模型？
2018-06-27
Python機器學習模型
文字分類模型
2020-10-28
文字分類模型
知物由學 | 更適合文字分類的輕量級預訓練模型
2021-01-26
文字分類模型
人工智慧的預訓練基礎模型的分類
2023-04-21
人工智慧模型
文字主題抽取：用gensim訓練LDA模型
2019-05-17
LDA模型
人工智慧大模型的訓練階段和使用方式來分類
2024-04-15
人工智慧大模型
使用Pytorch訓練分類器詳解（附python演練）
2018-12-27
PyTorchPython
零樣本文字分類應用：基於UTC的醫療意圖多分類，打通資料標註-模型訓練-模型調優-預測部署全流程。
2023-04-21
文字分類模型
（一）文字分類經典模型之CNN篇
2024-05-08
文字分類模型CNN
paddleocr圖片文字識別
2024-04-17
LUSE: 無監督資料預訓練短文字編碼模型
2021-07-31
模型
基於飛槳PaddlePaddle的多種影像分類預訓練模型強勢釋出
2019-07-08
模型
Bert文字分類實踐（一）：實現一個簡單的分類模型
2021-10-10
文字分類模型
文字分類-TextCNN
2018-11-09
文字分類CNN
使用 TensorFlow Hub 和估算器構建文字分類模型
2018-09-05
文字分類模型
監控大模型訓練
2024-03-13
大模型
PyTorch預訓練Bert模型
2020-11-17
PyTorch模型
fasttext訓練模型程式碼
2020-12-23
AST模型
文字識別技術升級：Airtest與PaddleOCR模型的協作小技巧
2024-07-05
AI模型
自訓練 + 預訓練 = 更好的自然語言理解模型
2020-11-13
模型
ResNet50的貓狗分類訓練及預測
2023-04-12
飛槳帶你瞭解：基於百科類資料訓練的 ELMo 中文預訓練模型
2019-06-06
模型
基於spark2.0文字分詞+多分類模型
2019-04-16
Spark分詞模型
PyTorch 模型訓練實⽤教程（程式碼訓練步驟講解）
2020-09-25
PyTorch模型
預訓練模型 & Fine-tuning
2020-10-18
模型
大模型如何提升訓練效率
2024-07-08
大模型
【AI】Pytorch_預訓練模型
2021-08-26
AIPyTorch模型
訓練一個影像分類器demo in PyTorch【學習筆記】
2022-06-30
PyTorch筆記
【預訓練語言模型】使用Transformers庫進行BERT預訓練
2024-03-13
模型ORM
教你用Pytorch建立你的第一個文字分類模型
2020-03-17
PyTorch文字分類模型
運用預訓練 Keras 模型來處理影像分類請求，學習如何使用從 Keras 建立 SavedModel
2024-04-09
Keras模型
文字識別（四）--大批量生成文字訓練集
2019-02-18
預訓練模型ProphetNet：根據未來文字資訊進行自然語言生成
2020-03-02
模型
定積分例題訓練
2024-05-05
NeurIPS Spotlight｜從分類到生成：無訓練的可控擴散生成
2024-12-05
視覺化影像處理 | 視覺化訓練器 | 影像分類
2024-07-02
視覺化

訓練PaddleOCR文字方向分類模型

環境準備

訓練資料準備

開始訓練

模型轉換

推理

結果

相關文章