文字識別——檢測部分 CTPN論文翻譯

言希0127發表於2018-09-13

原文翻譯地址:https://blog.csdn.net/quincuntial/article/details/79475339

翻譯論文彙總:https://github.com/SnailTyan/deep-learning-papers-translation

Detecting Text in Natural Image with Connectionist Text Proposal Network

Abstract

We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin. The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/.

摘要

我們提出了一種新穎的連線文字提議網路(CTPN),它能夠準確定位自然影象中的文字行。CTPN直接在卷積特徵對映中的一系列細粒度文字提議中檢測文字行。我們開發了一個垂直錨點機制,聯合預測每個固定寬度提議的位置和文字/非文字分數,大大提高了定位精度。序列提議通過迴圈神經網路自然地連線起來,該網路無縫地結合到卷積網路中,從而形成端到端的可訓練模型。這使得CTPN可以探索豐富的影象上下文資訊,使其能夠檢測極其模糊的文字。CTPN在多尺度和多語言文字上可靠地工作,而不需要進一步的後處理,脫離了以前的自底向上需要多步後過濾的方法。它在ICDAR 2013和2015的基準資料集上達到了0.88和0.61的F-measure,大大超過了最近的結果[8,35]。通過使用非常深的VGG16模型[27],CTPN的計算效率為0.14s每張影象。線上演示獲取地址:http://textdet.com/

Keywords

Scene text detection, convolutional network, recurrent neural network, anchor mechanism

關鍵詞

場景文字檢測;卷積網路;迴圈神經網路;錨點機制

1. Introduction

Reading text in natural image has recently attracted increasing attention in computer vision [8,14,15,10,35,11,9,1,28,32]. This is due to its numerous practical applications such as image OCR, multi-language translation, image retrieval, etc. It includes two sub tasks: text detection and recognition. This work focus on the detection task [14,1,28,32], which is more challenging than recognition task carried out on a well-cropped word image [15,9]. Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization.

1. 引言

在自然影象中閱讀文字最近在計算機視覺中引起越來越多的關注[8,14,15,10,35,11,9,1,28,32]。這是由於它的許多實際應用,如影象OCR,多語言翻譯,影象檢索等。它包括兩個子任務:文字檢測和識別。這項工作的重點是檢測任務[14,1,28,32],這是比在一個良好的裁剪字影象[15,9]進行的識別任務更具有挑戰性。文字模式的大變化和高度雜亂的背景構成了精確文字定位的主要挑戰。

Current approaches for text detection mostly employ a bottom-up pipeline [28,1,14,32,33]. They commonly start from low-level character or stroke detection, which is typically followed by a number of subsequent steps: non-text component filtering, text line construction and text line verification. These multi-step bottom-up approaches are generally complicated with less robustness and reliability. Their performance heavily rely on the results of character detection, and connected-components methods or sliding-window methods have been proposed. These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. However, they are not robust by identifying individual strokes or characters separately, without context information. For example, it is more confident for people to identify a sequence of characters than an individual one, especially when a character is extremely ambiguous. These limitations often result in a large number of non-text components in character detection, causing main difficulties for handling them in following steps. Furthermore, these false detections are easily accumulated sequentially in bottom-up pipeline, as pointed out in [28]. To address these problems, we exploit strong deep features for detecting text information directly in convolutional maps. We develop text anchor mechanism that accurately predicts text locations in fine scale. Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information.

目前的文字檢測方法大多采用自下而上的流程[28,1,14,32,33]。它們通常從低階別字元或筆畫檢測開始,後面通常會跟隨一些後續步驟:非文字元件過濾,文字行構建和文字行驗證。這些自底向上的多步驟方法通常複雜,魯棒性和可靠性較差。它們的效能很大程度上依賴於字元檢測的結果,並且已經提出了連線元件方法或滑動視窗方法。這些方法通常探索低階特徵(例如,基於SWT[3,13],MSER[14,33,23]或HoG[28])來區分候選文字和背景。但是,如果沒有上下文資訊,他們不能魯棒的單獨識別各個筆劃或字元。例如,相比單個字元人們更信任一個字元序列,特別是當一個字元非常模糊時。這些限制在字元檢測中通常會導致大量非文字元件,在後續步驟中的主要困難是處理它們。此外,正如[28]所指出的,這些誤檢很容易在自下而上的過程中連續累積。為了解決這些問題,我們利用強大的深度特徵直接在卷積對映中檢測文字資訊。我們開發的文字錨點機制能在細粒度上精確預測文字位置。然後,我們提出了一種網內迴圈架構,用於按順序連線這些細粒度的文字提議,從而允許它們編碼豐富的上下文資訊。

Deep Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps. Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection. However, it is difficult to apply these general object detection systems directly to scene text detection, which generally requires a higher localization accuracy. In generic object detection, each object has a well-defined closed boundary [2], while such a well-defined boundary may not exist in text, since a text line or word is composed of a number of separate characters or strokes. For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it. By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word. Therefore, text detection generally requires a more accurate localization, leading to a different evaluation standard, e.g., the Wolf’s standard [30] which is commonly employed by text benchmarks [19,21].

深度卷積神經網路(CNN)最近已經基本實現了一般物體檢測[25,5,6]。最先進的方法是Faster Region-CNN(R-CNN)系統[25],其中提出了區域提議網路(RPN)直接從卷積特徵對映中生成高質量類別不可知的目標提議。然後將RPN提議輸入Faster R-CNN[5]模型進行進一步的分類和微調,從而實現通用目標檢測的最新效能。然而,很難將這些通用目標檢測系統直接應用於場景文字檢測,這通常需要更高的定位精度。在通用目標檢測中,每個目標都有一個明確的封閉邊界[2],而在文字中可能不存在這樣一個明確定義的邊界,因為文字行或單詞是由許多單獨的字元或筆劃組成的。對於目標檢測,典型的正確檢測是鬆散定義的,例如,檢測到的邊界框與其實際邊界框(例如,PASCAL標準[4])之間的重疊>0.5,因為人們可以容易地從目標的主要部分識別它。相比之下,綜合閱讀文字是一個細粒度的識別任務,需要正確的檢測,覆蓋文字行或字的整個區域。因此,文字檢測通常需要更準確的定義,導致不同的評估標準,例如文字基準中常用的Wolf標準[19,21]。

In this work, we fill this gap by extending the RPN architecture [25] to accurate text line localization. We present several technical developments that tailor generic object detection model elegantly towards our problem. We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model.

在這項工作中,我們通過將RPN架構[25]擴充套件到準確的文字行定義來填補這個空白。我們提出了幾種技術發展,針對我們的問題可以優雅地調整通用目標檢測模型。我們通過提出一種網路內迴圈機制爭取更進一步,使我們的模型能夠直接在卷積對映中檢測文字序列,避免通過額外昂貴的CNN檢測模型進行進一步的後處理。

1.1 Contributions

We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers. This overcomes a number of main limitations raised by previous bottom-up approaches building on character detection. We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1. It makes the following major contributions:

Figure 1

Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). We densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) of the VGG16 model [27]. The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of kk anchors. (b) The CTPN outputs sequential fixed-width fine-scale text proposals. Color of each box indicates the text/non-text score. Only the boxes with positive scores are presented.

1.1 貢獻

我們提出了一種新穎的連線文字提議網路(CTPN),它可以直接定位卷積層中的文字序列。這克服了以前的建立在字元檢測基礎上的自下而上方法帶來的一些主要限制。我們利用強深度卷積特性和共享計算機制的優點,提出瞭如圖1所示的CTPN架構。主要貢獻如下:

Figure 1

圖1:(a)連線文字提議網路(CTPN)的架構。我們通過VGG16模型[27]的最後一個卷積對映(conv5)密集地滑動3×3空間視窗。每行的序列視窗通過雙向LSTM(BLSTM)[7]迴圈連線,其中每個視窗的卷積特徵(3×3×C)被用作256維的BLSTM(包括兩個128維的LSTM)的輸入。RNN層連線到512維的全連線層,接著是輸出層,聯合預測kk個錨點的文字/非文字分數,y軸座標和邊緣調整偏移。(b)CTPN輸出連續的固定寬度細粒度文字提議。每個框的顏色表示文字/非文字分數。只顯示文字框正例的分數。

First, we cast the problem of text detection into localizing a sequence of fine-scale text proposals. We develop an anchor regression mechanism that jointly predicts vertical location and text/non-text score of each text proposal, resulting in an excellent localization accuracy. This departs from the RPN prediction of a whole object, which is difficult to provide a satisfied localization accuracy.

首先,我們將文字檢測的問題轉化為一系列細粒度的文字提議。我們開發了一個錨點回歸機制,可以聯合預測每個文字提議的垂直位置和文字/非文字分數,從而獲得出色的定位精度。這背離了整個目標的RPN預測,RPN預測難以提供令人滿意的定位精度。

Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. This connection allows our detector to explore meaningful context information of text line, making it powerful to detect extremely challenging text reliably.

其次,我們提出了一種在卷積特徵對映中優雅連線序列文字提議的網路內迴圈機制。通過這種連線,我們的檢測器可以探索文字行有意義的上下文資訊,使其能夠可靠地檢測極具挑戰性的文字。

Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model. Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement.

第三,兩種方法無縫整合,以符合文字序列的性質,從而形成統一的端到端可訓練模型。我們的方法能夠在單個過程中處理多尺度和多語言的文字,避免進一步的後過濾或細化。

Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27].

第四,我們的方法在許多基準資料集上達到了新的最先進成果,顯著改善了最近的結果(例如,0.88的F-measure超過了2013年ICDAR的[8]中的0.83,而0.64的F-measure超過了ICDAR2015上[35]中的0.54 )。此外,通過使用非常深的VGG16模型[27],這在計算上是高效的,導致了每張影象0.14s的執行時間(在ICDAR 2013上)。

Text detection. Past works in scene text detection have been dominated by bottom-up approaches which are generally built on stroke or character detection. They can be roughly grouped into two categories, connected-components (CCs) based approaches and sliding-window based methods. The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3]. The sliding-window based methods detect character candidates by densely moving a multi-scale window through an image. The character or non-character window is discriminated by a pre-trained classifier, by using manually-designed features [28,29], or recent CNN features [16]. However, both groups of methods commonly suffer from poor performance of character detection, causing accumulated errors in following component filtering and text line construction steps. Furthermore, robustly filtering out non-character components or confidently verifying detected text lines are even difficult themselves [1,33,14]. Another limitation is that the sliding-window methods are computationally expensive, by running a classifier on a huge number of the sliding windows.

2. 相關工作

文字檢測。過去在場景文字檢測中的工作一直以自下而上的方法為主,一般建立在筆畫或字元檢測上。它們可以粗略地分為兩類,基於連線元件(CC)的方法和基於滑動視窗的方法。基於CC的方法通過使用快速濾波器來區分文字和非文字畫素,然後通過使用低階屬性(例如強度,顏色,梯度等[33,14,32,13,3])將文字畫素貪婪地分為筆劃或候選字元。基於滑動視窗的方法通過在影象中密集地滑動多尺度視窗來檢測候選字元。字元或非字元視窗通過預先訓練的分類器,使用手動設計的特徵[28,29]或最近的CNN特徵[16]進行區分。然而,這兩種方法通常都會受到較差的字元檢測效能的影響,導致在接下來的元件過濾和文字行構建步驟中出現累積的錯誤。此外,強大地過濾非字元元件或者自信地驗證檢測到的文字行本身就更加困難[1,33,14]。另一個限制是通過在大量的滑動視窗上執行分類器,滑動視窗方法在計算上是昂貴的。

Object detection. Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. A common strategy is to generate a number of object proposals by employing inexpensive low-level features, and then a strong CNN classifier is applied to further classify and refine the generated proposals. Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5]. Recently, Ren et al. [25] proposed a Faster R-CNN system for object detection. They proposed a Region Proposal Network (RPN) that generates high-quality class-agnostic object proposals directly from the convolutional feature maps. The RPN is fast by sharing convolutional computation. However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. More importantly, text is different significantly from general objects, making it difficult to directly apply general object detection system to this highly domain-specific task.

目標檢測。卷積神經網路(CNN)近來在通用目標檢測[25,5,6]上已經取得了實質的進步。一個常見的策略是通過使用廉價的低階特徵來生成許多目標提議,然後使用強CNN分類器來進一步對生成的提議進行分類和細化。生成類別不可知目標提議的選擇性搜尋(SS)[4]是目前領先的目標檢測系統中應用最廣泛的方法之一,如CNN(R-CNN)[6]及其擴充套件[5]。最近,Ren等人[25]提出了Faster R-CNN目標檢測系統。他們提出了一個區域提議網路(RPN),可以直接從卷積特徵對映中生成高質量的類別不可知的目標提議。通過共享卷積計算RPN是快速的。然而,RPN提議不具有判別性,需要通過額外的成本高昂的CNN模型(如Fast R-CNN模型[5])進一步細化和分類。更重要的是,文字與一般目標有很大的不同,因此很難直接將通用目標檢測系統應用到這個高度領域化的任務中。

3. Connectionist Text Proposal Network

This section presents details of the Connectionist Text Proposal Network (CTPN). It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement.

3. 連線文字提議網路

本節介紹連線文字提議網路(CTPN)的細節。它包括三個關鍵的貢獻,使文字定位可靠和準確:檢測細粒度提議中的文字,迴圈連線文字提議和邊緣細化。

3.1 Detecting Text in Fine-scale Proposals

Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size. It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals, as shown in Fig. 1 (b).

3.1 在細粒度提議中檢測文字

類似於區域提議網路(RPN)[25],CTPN本質上是一個全卷積網路,允許任意大小的輸入影象。它通過在卷積特徵對映中密集地滑動小視窗來檢測文字行,並且輸出一系列細粒度的(例如,寬度為固定的16個畫素)文字提議,如圖1(b)所示。

We take the very deep 16-layer vggNet (VGG16) [27] as an example to describe our approach, which is readily applicable to other deep models. Architecture of the CTPN is presented in Fig. 1 (a). We use a small spatial window, 3×3, to slide the feature maps of last convolutional layer (e.g., the conv5 of the VGG16). The size of conv5 feature maps is determined by the size of input image, while the total stride and receptive field are fixed as 16 and 228 pixels, respectively. Both the total stride and receptive field are fixed by the network architecture. Using a sliding window in the convolutional layer allows it to share convolutional computation, which is the key to reduce computation of the costly sliding-window based methods.

我們以非常深的16層vggNet(VGG16)[27]為例來描述我們的方法,該方法很容易應用於其他深度模型。CTPN的架構如圖1(a)所示。我們使用一個小的空間視窗3×3來滑動最後的卷積層特徵對映(例如,VGG16的conv5)。conv5特徵對映的大小由輸入影象的大小決定,而總步長和感受野分別固定為16個和228個畫素。網路架構決定總步長和感受野。在卷積層中使用滑動視窗允許它共享卷積計算,這是減少昂貴的基於滑動視窗的方法的計算量的關鍵。

Generally, sliding-window methods adopt multi-scale windows to detect objects of different sizes, where one window scale is fixed to objects of similar size. In [25], Ren et al. proposed an efficient anchor regression mechanism that allows the RPN to detect multi-scale objects with a single-scale window. The key insight is that a single window is able to predict objects in a wide range of scales and aspect ratios, by using a number of flexible anchors. We wish to extend this efficient anchor mechanism to our text task. However, text differs from generic objects substantially, which generally have a well-defined enclosed boundary and center, allowing inferring whole object from even a part of it [2]. Text is a sequence which does not have an obvious closed boundary. It may include multi-level components, such as stroke, character, word, text line and text region, which are not distinguished clearly between each other. Text detection is defined in word or text line level, so that it may be easy to make an incorrect detection by defining it as a single object, e.g., detecting part of a word. Therefore, directly predicting the location of a text line or word may be difficult or unreliable, making it hard to get a satisfied accuracy. An example is shown in Fig. 2, where the RPN is directly trained for localizing text lines in an image.

Figure 2

Fig. 2: Left: RPN proposals. Right: Fine-scale text proposals.

通常,滑動視窗方法採用多尺度視窗來檢測不同尺寸的目標,其中一個視窗尺度被固定到與目標的尺寸相似。在[25]中,Ren等人提出了一種有效的錨點回歸機制,允許RPN使用單尺度視窗檢測多尺度目標。關鍵的洞察力是單個視窗能夠通過使用多個靈活的錨點來預測各種尺度和長寬比的目標。我們希望將這種有效的錨點機制擴充套件到我們的文字任務。然而,實質上文字與普通目標不同,它們通常具有明確的封閉邊界和中心,可以從它的一部分推斷整個目標[2]。文字是一個沒有明顯封閉邊界的序列。它可能包含多層次的元件,如筆劃,字元,單詞,文字行和文字區域等,這些元件之間沒有明確區分。文字檢測是在單詞或文字行級別中定義的,因此通過將其定義為單個目標(例如檢測單詞的一部分)可能很容易進行錯誤的檢測。因此,直接預測文字行或單詞的位置可能很難或不可靠,因此很難獲得令人滿意的準確性。一個例子如圖2所示,其中RPN直接被訓練用於定點陣圖像中的文字行。

Figure 2

圖2:左:RPN提議。右:細粒度的文字提議。

We look for a unique property of text that is able to generalize well to text components in all levels. We observed that word detection by the RPN is difficult to accurately predict the horizontal sides of words, since each character within a word is isolated or separated, making it confused to find the start and end locations of a word. Obviously, a text line is a sequence which is the main difference between text and generic objects. It is natural to consider a text line as a sequence of fine-scale text proposals, where each proposal generally represents a small part of a text line, e.g., a text piece with 16-pixel width. Each proposal may include a single or multiple strokes, a part of a character, a single or multiple characters, etc. We believe that it would be more accurate to just predict the vertical location of each proposal, by fixing its horizontal location which may be more difficult to predict. This reduces the search space, compared to the RPN which predicts 4 coordinates of an object. We develop a vertical anchor mechanism that simultaneously predicts a text/non-text score and y-axis location of each fine-scale proposal. It is also more reliable to detect a general fixed-width text proposal than identifying an isolate character, which is easily confused with part of a character or multiple characters. Furthermore, detecting a text line in a sequence of fixed-width text proposals also works reliably on text of multiple scales and multiple aspect ratios.

我們尋找文字的獨特屬性,能夠很好地概括各個層次的文字元件。我們觀察到由RPN進行的單詞檢測很難準確預測單詞的水平邊,因為單詞中的每個字元都是孤立的或分離的,這使得查詢單詞的開始和結束位置很混亂。顯然,文字行是一個序列,它是文字和通用目標之間的主要區別。將文字行視為一系列細粒度的文字提議是很自然的,其中每個提議通常代表文字行的一小部分,例如寬度為16個畫素的文字塊。每個提議可能包含單個或多個筆劃,字元的一部分,單個或多個字元等。我們認為,通過固定每個提議的水平位置來預測其垂直位置會更準確,水平位置更難預測。與預測目標4個座標的RPN相比,這減少了搜尋空間。我們開發了垂直錨點機制,可以同時預測每個細粒度提議的文字/非文字分數和y軸的位置。檢測一般固定寬度的文字提議比識別分隔的字元更可靠,分隔字元容易與字元或多個字元的一部分混淆。此外,檢測一系列固定寬度文字提議中的文字行也可以在多個尺度和多個長寬比的文字上可靠地工作。

To this end, we design the fine-scale text proposal as follow. Our detector investigates each spatial location in the conv5densely. A text proposal is defined to have a fixed width of 16 pixels (in the input image). This is equal to move the detector densely through the conv5 maps, where the total stride is exactly 16 pixels. Then we design kk vertical anchors to predict yy-coordinates for each proposal. The kk anchors have a same horizontal location with a fixed width of 16 pixels, but their vertical locations are varied in kk different heights. In our experiments, we use ten anchors for each proposal, k=10k=10, whose heights are varied from 11 to 273 pixels (by ÷0.7÷0.7 each time) in the input image. The explicit vertical coordinates are measured by the height and yy-axis center of a proposal bounding box. We compute relative predicted vertical coordinates (vv) with respect to the bounding box location of an anchor as,

v_c=(c_y−c_ya)/ha,v_h=log(h/ha)(1)(1)v_c=(c_y−c_ya)/ha,v_h=log⁡(h/ha)

v\*_c=(c\*_y−c_ya)/ha,v\*_h=log(h\*/ha)(2)(2)v\*_c=(c\*_y−c_ya)/ha,v\*_h=log⁡(h\*/ha)

where v={v_c,v_h}v={v_c,v_h} and v\*={v\*_c,v\*_h}v\*={v\*_c,v\*_h} are the relative predicted coordinates and ground truth coordinates, respectively. c_yac_ya and haha are the center (yy-axis) and height of the anchor box, which can be pre-computed from an input image. c_yc_y and hh are the predicted yy-axis coordinates in the input image, while c\*_yc\*_y and h\*h\* are the ground truth coordinates. Therefore, each predicted text proposal has a bounding box with size of h×16h×16 (in the input image), as shown in Fig. 1 (b) and Fig. 2 (right). Generally, an text proposal is largely smaller than its effective receptive field which is 228××228.

 

為此,我們設計如下的細粒度文字提議。我們的檢測器密集地調查了conv5中的每個空間位置。文字提議被定義為具有16個畫素的固定寬度(在輸入影象中)。這相當於在conv5的對映上密集地移動檢測器,其中總步長恰好為16個畫素。然後,我們設計kk個垂直錨點來預測每個提議的yy座標。kk個錨點具有相同的水平位置,固定寬度為16個畫素,但其垂直位置在kk個不同的高度變化。在我們的實驗中,我們對每個提議使用十個錨點,k=10k=10,其高度在輸入影象中從11個畫素變化到273個畫素(每次÷0.7÷0.7)。明確的垂直座標是通過提議邊界框的高度和yy軸中心來度量的。我們計算相對於錨點的邊界框位置的相對預測的垂直座標(vv),如下所示:

v_c=(c_y−c_ya)/ha,v_h=log(h/ha)(1)(1)v_c=(c_y−c_ya)/ha,v_h=log⁡(h/ha)

v\*_c=(c\*_y−c_ya)/ha,v\*_h=log(h\*/ha)(2)(2)v\*_c=(c\*_y−c_ya)/ha,v\*_h=log⁡(h\*/ha)

其中v={v_c,v_h}v={v_c,v_h}和v\*={v\*_c,v\*_h}v\*={v\*_c,v\*_h}分別是相對於預測座標和實際座標。c_yac_ya和haha是錨盒的中心(yy軸)和高度,可以從輸入影象預先計算。c_yc_y和hh是輸入影象中預測的yy軸座標,而c\*_yc\*_y和h\*h\*是實際座標。因此,如圖1(b)和圖2(右)所示,每個預測文字提議都有一個大小為h×16h×16的邊界框(在輸入影象中)。一般來說,文字提議在很大程度上要比它的有效感受野228××228要小。

 

The detection processing is summarised as follow. Given an input image, we have W×H×CW×H×C conv5 features maps (by using the VGG16 model), where CC is the number of feature maps or channels, and W×HW×H is the spatial arrangement. When our detector is sliding a 3××3 window densely through the conv5, each sliding-window takes a convolutional feature of 3×3×C3×3×C for producing the prediction. For each prediction, the horizontal location (xx-coordinates) and kk-anchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image. Our detector outputs the text/non-text scores and the predicted yy-coordinates (vv) for kk anchors at each window location. The detected text proposals are generated from the anchors having a text/non-text score of >0.7>0.7 (with non-maximum suppression). By the designed vertical anchor and fine-scale detection strategy, our detector is able to handle text lines in a wide range of scales and aspect ratios by using a single-scale image. This further reduces its computation, and at the same time, predicting accurate localizations of the text lines. Compared to the RPN or Faster R-CNN system [25], our fine-scale detection provides more detailed supervised information that naturally leads to a more accurate detection.

檢測處理總結如下。給定輸入影象,我們有W×H×CW×H×C conv5特徵對映(通過使用VGG16模型),其中CC是特徵對映或通道的數目,並且W×HW×H是空間佈置。當我們的檢測器通過conv5密集地滑動3××3視窗時,每個滑動視窗使用3×3×C3×3×C的卷積特徵來產生預測。對於每個預測,水平位置(xx軸座標)和kk個錨點位置是固定的,可以通過將conv5中的空間視窗位置對映到輸入影象上來預先計算。我們的檢測器在每個視窗位置輸出kk個錨點的文字/非文字分數和預測的yy軸座標(vv)。檢測到的文字提議是從具有>0.7>0.7(具有非極大值抑制)的文字/非文字分數的錨點生成的。通過設計的垂直錨點和細粒度的檢測策略,我們的檢測器能夠通過使用單尺度影象處理各種尺度和長寬比的文字行。這進一步減少了計算量,同時預測了文字行的準確位置。與RPN或Faster R-CNN系統[25]相比,我們的細粒度檢測提供更詳細的監督資訊,自然會導致更精確的檢測。

3.2 Recurrent Connectionist Text Proposals

To improve localization accuracy, we split a text line into a sequence of fine-scale text proposals, and predict each of them separately. Obviously, it is not robust to regard each isolated proposal independently. This may lead to a number of false detections on non-text objects which have a similar structure as text patterns, such as windows, bricks, leaves, etc. (referred as text-like outliers in [13]). It is also possible to discard some ambiguous patterns which contain weak text information. Several examples are presented in Fig. 3 (top). Text have strong sequential characteristics where the sequential context information is crucial to make a reliable decision. This has been verified by recent work [9] where a recurrent neural network (RNN) is applied to encode this context information for text recognition. Their results have shown that the sequential context information is greatly facilitate the recognition task on cropped word images.

Figure 3

Fig. 3: Top: CTPN without RNN. Bottom: CTPN with RNN connection.

3.2 迴圈連線文字提議

為了提高定位精度,我們將文字行分成一系列細粒度的文字提議,並分別預測每個文字提議。顯然,將每個孤立的提議獨立考慮並不魯棒。這可能會導致對與文字模式類似的非文字目標的誤檢,如視窗,磚塊,樹葉等(在文獻[13]中稱為類文字異常值)。還可以丟棄一些含有弱文字資訊的模糊模式。圖3給出了幾個例子(上)。文字具有強大的序列特徵,序列上下文資訊對做出可靠決策至關重要。最近的工作已經證實了這一點[9],其中應用遞迴神經網路(RNN)來編碼用於文字識別的上下文資訊。他們的結果表明,序列上下文資訊極大地促進了對裁剪的單詞影象的識別任務。

Figure 3

圖3:上:沒有RNN的CTPN。下:有RNN連線的CTPN。

Motivated from this work, we believe that this context information may also be of importance for our detection task. Our detector should be able to explore this important context information to make a more reliable decision, when it works on each individual proposal. Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals. RNN provides a natural choice for encoding this information recurrently using its hidden layers. To this end, we propose to design a RNN layer upon the conv5, which takes the convolutional feature of each window as sequential inputs, and updates its internal state recurrently in the hidden layer, H_tH_t,

H_t=φ(H_t−1,X_t),t=1,2,...,W(3)(3)H_t=φ(H_t−1,X_t),t=1,2,...,W

where X_t∈R3×3×CX_t∈R3×3×C is the input conv5 feature from tt-th sliding-window (3××3). The sliding-window moves densely from left to right, resulting in t=1,2,...,Wt=1,2,...,W sequential features for each row. WW is the width of the conv5. H_tH_t is a recurrent internal state that is computed jointly from both current input (X_tX_t) and previous states encoded in H_t−1H_t−1. The recurrence is computed by using a non-linear function φφ, which defines exact form of the recurrent model. We exploit the long short-term memory (LSTM) architecture [12] for our RNN layer. The LSTM was proposed specially to address vanishing gradient problem, by introducing three additional multiplicative gates: the input gateforget gate and output gate. Details can be found in [12]. Hence the internal state in RNN hidden layer accesses the sequential context information scanned by all previous windows through the recurrent connection. We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., 228 ×× width. We use a 128D hidden layer for each LSTM, resulting in a 256D RNN hidden layer, H_t∈R256H_t∈R256.

 

受到這項工作的啟發,我們認為這種上下文資訊對於我們的檢測任務也很重要。我們的檢測器應該能夠探索這些重要的上下文資訊,以便在每個單獨的提議中都可以做出更可靠的決策。此外,我們的目標是直接在卷積層中編碼這些資訊,從而實現細粒度文字提議優雅無縫的網內連線。RNN提供了一種自然選擇,使用其隱藏層對這些資訊進行迴圈編碼。為此,我們提出在conv5上設計一個RNN層,它將每個視窗的卷積特徵作為序列輸入,並在隱藏層中迴圈更新其內部狀態:H_tH_t,

H_t=φ(H_t−1,X_t),t=1,2,...,W(3)(3)H_t=φ(H_t−1,X_t),t=1,2,...,W

其中X_t∈R3×3×CX_t∈R3×3×C是第tt個滑動視窗(3××3)的輸入conv5特徵。滑動視窗從左向右密集移動,導致每行的t=1,2,...,Wt=1,2,...,W序列特徵。WW是conv5的寬度。H_tH_t是從當前輸入(X_tX_t)和以H_t−1H_t−1編碼的先前狀態聯合計算的迴圈內部狀態。遞迴是通過使用非線性函式φφ來計算的,它定義了迴圈模型的確切形式。我們利用長短時記憶(LSTM)架構[12]作為我們的RNN層。通過引入三個附加乘法門:輸入門忘記門輸出門,專門提出了LSTM以解決梯度消失問題。細節可以在[12]中找到。因此,RNN隱藏層中的內部狀態可以訪問所有先前視窗通過迴圈連線掃描的序列上下文資訊。我們通過使用雙向LSTM來進一步擴充套件RNN層,這使得它能夠在兩個方向上對遞迴上下文進行編碼,以便連線感受野能夠覆蓋整個影象寬度,例如228××width。我們對每個LSTM使用一個128維的隱藏層,從而產生256維的RNN隱藏層H_t∈R256H_t∈R256。

 

The internal state in H_tH_t is mapped to the following FC layer, and output layer for computing the predictions of the t-th proposal. Therefore, our integration with the RNN layer is elegant, resulting in an efficient model that is end-to-end trainable without additional cost. The efficiency of the RNN connection is demonstrated in Fig. 3. Obviously, it reduces false detections considerably, and at the same time, recovers many missed text proposals which contain very weak text information.

H_tH_t中的內部狀態被對映到後面的FC層,並且輸出層用於計算第tt個提議的預測。因此,我們與RNN層的整合非常優雅,從而形成了一種高效的模型,可以在無需額外成本的情況下進行端到端的訓練。RNN連線的功效如圖3所示。顯然,它大大減少了錯誤檢測,同時還能夠恢復很多包含非常弱的文字資訊的遺漏文字提議。

3.3 Side-refinement

The fine-scale text proposals are detected accurately and reliably by our CTPN. Text line construction is straightforward by connecting continuous text proposals whose text/non-text score is >0.7>0.7. Text lines are constructed as follow. First, we define a paired neighbour (B_jB_j) for a proposal B_iB_i as B_j−>B_iB_j−>B_i, when (i) B_jB_j is the nearest horizontal distance to B_iB_i, and (ii) this distance is less than 50 pixels, and (iii) their vertical overlap is >0.7>0.7. Second, two proposals are grouped into a pair, if B_j−>B_iB_j−>B_i and B_i−>B_jB_i−>B_j. Then a text line is constructed by sequentially connecting the pairs having a same proposal.

3.3 邊緣細化

我們的CTPN能夠準確可靠地檢測細粒度的文字提議。通過連線其文字/非文字分數為>0.7>0.7的連續文字提議,文字行的構建非常簡單。文字行構建如下。首先,我們為提議B_iB_i定義一個配對鄰居(B_jB_j)作為B_j−>B_iB_j−>B_i,當(i)B_jB_j是最接近B_iB_i的水平距離,(ii)該距離小於50畫素,並且(iii)它們的垂直重疊是>0.7>0.7時。其次,如果B_j−>B_iB_j−>B_i和B_i−>B_jB_i−>B_j,則將兩個提議分組為一對。然後通過順序連線具有相同提議的對來構建文字行。

The fine-scale detection and RNN connection are able to predict accurate localizations in vertical direction. In horizontal direction, the image is divided into a sequence of equal 16-pixel width proposals. This may lead to an inaccurate localization when the text proposals in both horizontal sides are not exactly covered by a ground truth text line area, or some side proposals are discarded (e.g., having a low text score), as shown in Fig. 4. This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words. To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal). Similar to the y-coordinate prediction, we compute relative offset as,

o=(x_side−c_xa)/wa,o\*=(x\*_side−c_xa)/wao=(x_side−c_xa)/wa,o\*=(x\*_side−c_xa)/wa

where x_sidex_side is the predicted xx-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor. x\*_sidex\*_side is the ground truth (GT) side coordinate in xx-axis, which is pre-computed from the GT bounding box and anchor location. c_xac_xa is the center of anchor in xx-axis. wawa is the width of anchor, which is fixed, wa=16wa=16 . The side-proposals are defined as the start and end proposals when we connect a sequence of detected fine-scale text proposals into a text line. We only use the offsets of the side-proposals to refine the final text line bounding box. Several detection examples improved by side-refinement are presented in Fig. 4. The side-refinement further improves the localization accuracy, leading to about 2%2% performance improvements on the SWT and Multi-Lingual datasets. Notice that the offset for side-refinement is predicted simultaneously by our model, as shown in Fig. 1. It is not computed from an additional post-processing step.

 

Figure 4

Fig.4: CTPN detection with (red box) and without (yellow dashed box) the side-refinement. Color of fine-scale proposal box indicate a text/non-text score.

細粒度的檢測和RNN連線可以預測垂直方向的精確位置。在水平方向上,影象被分成一系列相等的寬度為16個畫素的提議。如圖4所示,當兩個水平邊的文字提議沒有完全被實際文字行區域覆蓋,或者某些邊的提議被丟棄(例如文字得分較低)時,這可能會導致不準確的定位。這種不準確性在通用目標檢測中可能並不重要,但在文字檢測中不應忽視,特別是對於那些小型文字行或文字。為了解決這個問題,我們提出了一種邊緣細化的方法,可以精確地估計左右兩側水平方向上的每個錨點/提議的偏移量(稱為邊緣錨點或邊緣提議)。與y座標預測類似,我們計算相對偏移為:

o=(x_side−c_xa)/wa,o\*=(x\*_side−c_xa)/wao=(x_side−c_xa)/wa,o\*=(x\*_side−c_xa)/wa

,其中x_sidex_side是最接近水平邊(例如,左邊或右邊)到當前錨點的預測的xx座標。x\*_sidex\*_side是xx軸的實際(GT)邊緣座標,它是從實際邊界框和錨點位置預先計算的。c_xac_xa是xx軸的錨點的中心。wawa是固定的錨點寬度,wa=16wa=16。當我們將一系列檢測到的細粒度文字提議連線到文字行中時,這些提議被定義為開始和結束提議。我們只使用邊緣提議的偏移量來優化最終的文字行邊界框。通過邊緣細化改進的幾個檢測示例如圖4所示。邊緣細化進一步提高了定位精度,從而使SWT和Multi-Lingual資料集上的效能提高了約2%2%。請注意,我們的模型同時預測了邊緣細化的偏移量,如圖1所示。它不是通過額外的後處理步驟計算的。

 

Figure 4

圖4:CTPN檢測有(紅色框)和沒有(黃色虛線框)邊緣細化。細粒度提議邊界框的顏色表示文字/非文字分數。

3.4 Model Outputs and Loss Functions

The proposed CTPN has three outputs which are jointly connected to the last FC layer, as shown in Fig. 1 (a). The three outputs simultaneously predict text/non-text scores (ss), vertical coordinates (v={v_c,v_h}v={v_c,v_h}) in E.q. (2) and side-refinement offset (oo). We explore kk anchors to predict them on each spatial location in the conv5, resulting in 2k2k, 2k2k and kk parameters in the output layer, respectively.

3.4 模型輸出與損失函式

提出的CTPN有三個輸出共同連線到最後的FC層,如圖1(a)所示。這三個輸出同時預測公式(2)中的文字/非文字分數( textbfs textbfs),垂直座標(v={v_c,v_h}v={v_c,v_h})和邊緣細化偏移(oo)。我們將探索kk個錨點來預測它們在conv5中的每個空間位置,從而在輸出層分別得到2k2k,2k2k和kk個引數。

We employ multi-task learning to jointly optimize model parameters. We introduce three loss functions, Lcl_sLcl_s, Lre_vLre_v and lre_olre_o, which compute errors of text/non-text score, coordinate and side-refinement, respectively. With these considerations, we follow the multi-task loss applied in [5,25], and minimize an overall objective function (LL) for an image as,

L(s_i,v_j,o_k)=1N_s∑_iLcl_s(s_i,s_i\*)+λ_1N_v∑_jLre_v(v_j,v_j\*)+λ_2N_o∑_kLre_o(o_k,o_k\*)(5)(5)L(s_i,v_j,o_k)=1N_s∑_iLcl_s(s_i,s_i\*)+λ_1N_v∑_jLre_v(v_j,v_j\*)+λ_2N_o∑_kLre_o(o_k,o_k\*)

where each anchor is a training sample, and ii is the index of an anchor in a mini-batch. s_is_i is the predicted probability of anchor ii being a true text. s_i\*={0,1}s_i\*={0,1} is the ground truth. jj is the index of an anchor in the set of valid anchors for yy-coordinates regression, which are defined as follow. A valid anchor is a defined positive anchor (s_j\*=1s_j\*=1, described below), or has an Intersection-over-Union (IoU) >0.5>0.5 overlap with a ground truth text proposal. v_jv_j and v_j\*v_j\* are the prediction and ground truth yy-coordinates associated with the jj-{th} anchor. kk is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box. o_ko_k and o_k\*o_k\* are the predicted and ground truth offsets in xx-axis associated to the kk-{th} anchor. Lcl_sLcl_s is the classification loss which we use Softmax loss to distinguish text and non-text. Lre_vLre_v and Lre_oLre_o are the regression loss. We follow previous work by using the smooth L_1L_1 function to compute them [5, 25]. λ_1λ_1 and λ_2λ_2 are loss weights to balance different tasks, which are empirically set to 1.0 and 2.0. N_sN_s N_vN_v and N_oN_o are normalization parameters, denoting the total number of anchors used by Lcl_sLcl_s, Lre_vLre_v and Lre_oLre_o, respectively.

 

我們採用多工學習來聯合優化模型引數。我們引入了三種損失函式:Lcl_sLcl_s,Lre_vLre_v和lre_olre_o,其分別計算文字/非文字分數,座標和邊緣細化。考慮到這些因素,我們遵循[5,25]中應用的多工損失,並最小化影象的總體目標函式(LL)最小化:

L(s_i,v_j,o_k)=1N_s∑_iLcl_s(s_i,s_i\*)+λ_1N_v∑_jLre_v(v_j,v_j\*)+λ_2N_o∑_kLre_o(o_k,o_k\*)(5)(5)L(s_i,v_j,o_k)=1N_s∑_iLcl_s(s_i,s_i\*)+λ_1N_v∑_jLre_v(v_j,v_j\*)+λ_2N_o∑_kLre_o(o_k,o_k\*)

,其中每個錨點都是一個訓練樣本,ii是一個小批量資料中一個錨點的索引。s_is_i是預測的錨點ii作為實際文字的預測概率。s_i\*={0,1}s_i\*={0,1}是真實值。jj是yy座標迴歸中有效錨點集合中錨點的索引,定義如下。有效的錨點是定義的正錨點(s_j\*=1s_j\*=1,如下所述),或者與實際文字提議重疊的交併比(IoU)>0.5>0.5。v_jv_j和v_j\*v_j\*是與第jj個錨點關聯的預測的和真實的yy座標。kk是邊緣錨點的索引,其被定義為在實際文字行邊界框的左側或右側水平距離(例如32個畫素)內的一組錨點。o_ko_k和o_k\*o_k\*是與第kk個錨點關聯的xx軸的預測和實際偏移量。Lcl_sLcl_s是我們使用Softmax損失區分文字和非文字的分類損失。Lre_vLre_v和Lre_oLre_o是迴歸損失。我們遵循以前的工作,使用平滑L_1L_1函式來計算它們[5,25]。λ_1λ_1和λ_2λ_2是損失權重,用來平衡不同的任務,將它們經驗地設定為1.0和2.0。N_sN_s N_vN_v和N_oN_o是標準化引數,表示Lcl_sLcl_s,Lre_vLre_v,Lre_oLre_o分別使用的錨點總數。

 

3.5 Training and Implementation Details

The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD). Similar to RPN [25], training samples are the anchors, whose locations can be pre computed in input image, so that the training labels of each anchor can be computed from corresponding GT box.

3.5 訓練和實現細節

通過使用標準的反向傳播和隨機梯度下降(SGD),可以對CTPN進行端對端訓練。與RPN[25]類似,訓練樣本是錨點,其位置可以在輸入影象中預先計算,以便可以從相應的實際邊界框中計算每個錨點的訓練標籤。

Training labels. For text/non-text classification, a binary label is assigned to each positive (text) or negative (non-text) anchor. It is defined by computing the IoU overlap with the GT bounding box (divided by anchor location). A positive anchor is defined as : (i) an anchor that has an >0.7>0.7 IoU overlap with any GT box; or (ii) the anchor with the highest IoU overlap with a GT box. By the condition (ii), even a very small text pattern can assign a positive anchor. This is crucial to detect small-scale text patterns, which is one of key advantages of the CTPN. This is different from generic object detection where the impact of condition (ii) may be not significant. The negative anchors are defined as <0.5<0.5 IoU overlap with all GT boxes. The training labels for the yy-coordinate regression (v\*v\*) and offset regression (o\*o\*) are computed as E.q. (2) and (4) respectively.

訓練標籤。對於文字/非文字分類,二值標籤分配給每個正(文字)錨點或負(非文字)錨點。它通過計算與實際邊界框的IoU重疊(除以錨點位置)來定義。正錨點被定義為:(i)與任何實際邊界框具有>0.7>0.7的IoU重疊;或者(ii)與實際邊界框具有最高IoU重疊。通過條件(ii),即使是非常小的文字模式也可以分為正錨點。這對於檢測小規模文字模式至關重要,這是CTPN的主要優勢之一。這不同於通用目標檢測,通用目標檢測中條件(ii)的影響可能不顯著。負錨點定義為與所有實際邊界框具有<0.5<0.5的IoU重疊。yy座標迴歸(v\*v\*)和偏移回歸(o\*o\*)的訓練標籤分別按公式(2)和(4)計算。

Training data. In the training process, each mini-batch samples are collected randomly from a single image. The number of anchors for each mini-batch is fixed to N_s=128N_s=128, with 1:1 ratio for positive and negative samples. A mini-patch is pad with negative samples if the number of positive ones is fewer than 64. Our model was trained on 3,000 natural images, including 229 images from the ICDAR 2013 training set. We collected the other images ourselves and manually labelled them with text line bounding boxes. All self-collected training images are not overlapped with any test image in all benchmarks. The input image is resized by setting its short side to 600 for training, while keeping its original aspect ratio.

訓練資料。在訓練過程中,每個小批量樣本從單張影象中隨機收集。每個小批量資料的錨點數量固定為N_s=128N_s=128,正負樣本的比例為1:1。如果正樣本的數量少於64,則會用小影象塊填充負樣本。我們的模型在3000張自然影象上訓練,其中包括來自ICDAR 2013訓練集的229張影象。我們自己收集了其他影象,並用文字行邊界框進行了手工標註。在所有基準測試集中,所有自我收集的訓練影象都不與任何測試影象重疊。為了訓練,通過將輸入影象的短邊設定為600來調整輸入影象的大小,同時保持其原始長寬比。

Implementation Details. We follow the standard practice, and explore the very deep VGG16 model [27] pre-trained on the ImageNet data [26]. We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation. The model was trained end-to-end by fixing the parameters in the first two convolutional layers. We used 0.9 momentum and 0.0005 weight decay. The learning rate was set to 0.001 in the first 16K iterations, followed by another 4K iterations with 0.0001 learning rate. Our model was implemented in Caffe framework [17].

實現細節。我們遵循標準實踐,並在ImageNet資料[26]上探索預先訓練的非常深的VGG16模型[27]。我們通過使用具有0均值和0.01標準差的高斯分佈的隨機權重來初始化新層(例如,RNN和輸出層)。該模型通過固定前兩個卷積層中的引數進行端對端的訓練。我們使用0.9的動量和0.0005的重量衰減。在前16K次迭代中,學習率被設定為0.001,隨後以0.0001的學習率再進行4K次迭代。我們的模型在Caffe框架[17]中實現。

4. Experimental Results and Discussions

We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection. The ICDAR 2013 is used for this component evaluation.

4. 實驗結果和討論

我們在五個文字檢測基準資料集上評估CTPN,即ICDAR 2011[21],ICDAR 2013[19],ICDAR 2015[18],SWT[3]和Multilingual[24]資料集。在我們的實驗中,我們首先單獨驗證每個提議元件的效率,例如細粒度文字提議檢測或網內迴圈連線。ICDAR 2013用於該元件的評估。

4.1 Benchmarks and Evaluation Metric

The ICDAR 2011 dataset [21] consists of 229 training images and 255 testing ones, where the images are labelled in word level. The ICDAR 2013 [19] is similar as the ICDAR 2011, and has in total 462 images, including 229 images and 233 images for training and testing, respectively. The ICDAR 2015 (Incidental Scene Text - Challenge 4) [18] includes 1,500 images which were collected by using the Google Glass. The training set has 1,000 images, and the remained 500 images are used for test. This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text. The Multilingual scene text dataset is collected by [24]. It contains 248 images for training and 239 for testing. The images include multi-languages text, and the ground truth is labelled in text line level. Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text.

4.1 基準資料集和評估標準

ICDAR 2011資料集[21]由229張訓練影象和255張測試影象組成,影象以字級別標記。ICDAR 2013[19]與ICDAR 2011類似,共有462張影象,其中包括229張訓練影象和233張測試影象。ICDAR 2015年(Incidental Scene Text —— Challenge 4)[18]包括使用Google Glass收集的1500張影象。訓練集有1000張影象,剩餘的500張影象用於測試。這個資料集比以前的資料集更具挑戰性,包括任意方向,非常小的尺度和低解析度的文字。Multilingual場景文字資料集由[24]收集。它包含248張訓練影象和239張測試影象。影象包含多種語言的文字,並且真實值以文字行級別標註。Epshtein等[3]引入了包含307張影象的SWT資料集,其中包含許多極小尺度的文字。

We follow previous work by using standard evaluation protocols which are provided by the dataset creators or competition organizers. For the ICDAR 2011 we use the standard protocol proposed by [30], the evaluation on the ICDAR 2013 follows the standard in [19]. For the ICDAR 2015, we used the online evaluation system provided by the organizers as in [18]. The evaluations on the SWT and Multilingual datasets follow the protocols defined in [3] and [24] respectively.

我們遵循以前的工作,使用由資料集建立者或競賽組織者提供的標準評估協議。對於ICDAR 2011,我們使用[30]提出的標準協議,對ICDAR 2013的評估遵循[19]中的標準。對於ICDAR 2015,我們使用了由組織者提供的線上評估系統[18]。SWT和Multilingual資料集的評估分別遵循[3]和[24]中定義的協議。

4.2 Fine-Scale Text Proposal Network with Faster R-CNN

We first discuss our fine-scale detection strategy against the RPN and Faster R-CNN system [25]. As can be found in Table 1 (left), the individual RPN is difficult to perform accurate text localization, by generating a large amount of false detections (low precision). By refining the RPN proposals with a Fast R-CNN detection model [5], the Faster R-CNN system improves localization accuracy considerably, with a F-measure of 0.75. One observation is that the Faster R-CNN also increases the recall of original RPN. This may benefit from joint bounding box regression mechanism of the Fast R-CNN, which improves the accuracy of a predicted bounding box. The RPN proposals may roughly localize a major part of a text line or word, but they are not accurate enough by the ICDAR 2013 standard. Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line.

Table 1: Component evaluation on the ICDAR 2013, and State-of-the-art results on the SWT and MULTILINGUAL.

Table 1

4.2 具有Faster R-CNN的細粒度文字提議網路

我們首先討論我們關於RPN和Faster R-CNN系統[25]的細粒度檢測策略。如表1(左)所示,通過產生大量的錯誤檢測(低精度),單獨的RPN難以執行準確的文字定位。通過使用Fast R-CNN檢測模型[5]完善RPN提議,Faster R-CNN系統顯著提高了定位精度,其F-measure為0.75。一個觀察結果是Faster R-CNN也增加了原始RPN的召回率。這可能受益於Fast R-CNN的聯合邊界框迴歸機制,其提高了預測邊界框的準確性。RPN提議可以粗略定位文字行或文字的主要部分,但根據ICDAR 2013的標準這不夠準確。顯然,所提出的細粒度文字提議網路(FTPN)在精確度和召回率方面都顯著改進了Faster R-CNN,表明通過預測一系列細粒度文字提議而不是整體文字行,FTPN更精確可靠。

表1:ICDAR 2013的元件評估以及在SWT和MULTILENGUAL資料集上的最新成果。

Table 1

4.3 Recurrent Connectionist Text Proposals

We discuss impact of recurrent connection on our CTPN. As shown in Fig. 3, the context information is greatly helpful to reduce false detections, such as text-like outliers. It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6. These appealing properties result in a significant performance boost. As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88.

Figure 6

Fig.6: CTPN detection results on extremely small-scale cases (in red boxes), where some ground truth boxes are missed. Yellow boxes are the ground truth.

4.3 迴圈連線文字提議

我們討論迴圈連線對CTPN的影響。如圖3所示,上下文資訊對於減少誤檢非常有用,例如類似文字的異常值。對於恢復高度模糊的文字(例如極小的文字)來說,這非常重要,這是我們CTPN的主要優勢之一,如圖6所示。這些吸引人的屬性可顯著提升效能。如表1(左)所示,使用我們的迴圈連線,CTPN大幅度改善了FTPN,將F-measure從0.80的提高到0.88。

Figure 6

圖6:在極小尺度的情況下(紅色框內)CTPN檢測結果,其中一些真實邊界框被遺漏。黃色邊界箱是真實值。

Running time. The implementation time of our CTPN (for whole detection processing) is about 0.14s per image with a fixed short side of 600, by using a single GPU. The CTPN without the RNN connection takes about 0.13s/image GPU time. Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained.

執行時間。通過使用單個GPU,我們的CTPN(用於整個檢測處理)的執行時間為每張影象大約0.14s,固定短邊為600。沒有RNN連線的CTPN每張影象GPU時間大約需要0.13s。因此,所提出的網內迴圈機制稍微增加了模型計算,並獲得了相當大的效能增益。

4.4 Comparisons with state-of-the-art results

Our detection results on several challenging images are presented in Fig. 5. As can be found, the CTPN works perfectly on these challenging cases, some of which are difficult for many previous methods. It is able to handle multi-scale and multi-language efficiently (e.g., Chinese and Korean).

Figure 5

Fig. 5: CTPN detection results several challenging images, including multi-scale and multi-language text lines. Yellow boxes are the ground truth.

4.4 與最新結果的比較

我們在幾個具有挑戰性的影象上的檢測結果如圖5所示。可以發現,CTPN在這些具有挑戰性的情況上可以完美的工作,其中一些對於許多以前的方法來說是困難的。它能夠有效地處理多尺度和多語言(例如中文和韓文)。

Figure 5

圖5:CTPN在幾個具有挑戰性的影象上的檢測結果,包括多尺度和多語言文字行。黃色邊界框是真實值。

The full evaluation was conducted on five benchmarks. Image resolution is varied significantly in different datasets. We set short side of images to 2000 for the SWT and ICDAR 2015, and 600 for the other three. We compare our performance against recently published results in [1,28,34]. As shown in Table 1 and 2, our CTPN achieves the best performance on all five datasets. On the SWT, our improvements are significant on both recall and F-measure, with marginal gain on precision. Our detector performs favourably against the TextFlow on the Multilingual, suggesting that our method generalize well to various languages. On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88. The gains are considerable in both precision and recall, with more than +5%+5% and +7%+7% improvements, respectively. In addition, we further compare our method against [8,11,35], which were published after our initial submission. It consistently obtains substantial improvements on F-measure and recall. This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human. As shown in Fig. 6, those challenging ones are detected correctly by our detector, but some of them are even missed by the GT labelling, which may reduce our precision in evaluation.

Table 2: State-of-the-art results on the ICDAR 2011, 2013 and 2015.

Table 2

全面評估是在五個基準資料集上進行的。影象解析度在不同的資料集中顯著不同。我們為SWT和ICDAR 2015設定影象短邊為2000,其他三個的短邊為600。我們將我們的效能與最近公佈的結果[1,28,34]進行了比較。如表1和表2所示,我們的CTPN在所有的五個資料集上都實現了最佳效能。在SWT上,我們的改進對於召回和F-measure都非常重要,並在精確度上取得了很小的收益。我們的檢測器在Multilingual上比TextFlow表現更好,表明我們的方法能很好地泛化到各種語言。在ICDAR 2013上,它的效能優於最近的TextFlow[28]和FASText[1],將F-measure從0.80提高到了0.88。精確度和召回率都有顯著提高,改進分別超過+5%+5%和+7%+7%。此外,我們進一步與[8,11,35]比較了我們的方法,它們是在我們的首次提交後釋出的。它始終在F-measure和召回率方面取得重大進展。這可能是由於CTPN在非常具有挑戰性的文字上具有很強的檢測能力,例如非常小的文字,其中一些甚至對人來說都很難。如圖6所示,我們的檢測器可以正確地檢測到那些具有挑戰性的影象,但有些甚至會被真實標籤遺漏,這可能會降低我們的評估精度。

表2:ICDAR 2011,2013和2015上的最新結果。

Table 2

We further investigate running time of various methods, as compared in Table 2. FASText [1] achieves 0.15s/image CPU time. Our method is slightly faster than it by obtaining 0.14s/image, but in GPU time. Though it is not fair to compare them directly, the GPU computation has become mainstream with recent great success of deep learning approaches on object detection [25,5,6]. Regardless of running time, our method outperforms the FASText substantially with 11%11% improvement on F-measure. Our time can be reduced by using a smaller image scale. By using the scale of 450, it is reduced to 0.09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’s approach [8] using 0.07s/image with GPU.

我們進一步調查了各種方法的執行時間,在表2中進行了比較。FASText[1]達到0.15s每張影象的CPU時間。我們的方法比它快一點,取得了0.14s每張影象,但是在GPU時間上。儘管直接比較它們是不公平的,但GPU計算已經成為主流,最近在目標檢測方面的深度學習方法[25,5,6]上取得了很大成功。無論執行時間如何,我們的方法都大大優於FASText,F-measure的效能提高了1111。我們的時間可以通過使用較小的影象尺度來縮短。在ICDAR 2013中,使用450的縮放比例時間降低到0.09s每張影象,同時獲得0.92/0.77/0.84的P/R/F,與Gupta等人的方法[8]相比,GPU時間為0.07s每張影象,我們的方法是具有競爭力的。

5. Conclusions

We have presented a Connectionist Text Proposal Network (CTPN) —— an efficient text detector that is end-to-end trainable. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional maps. We develop vertical anchor mechanism that jointly predicts precise location and text/non-text score for each proposal, which is the key to realize accurate localization of text. We propose an in-network RNN layer that connects sequential text proposals elegantly, allowing it to explore meaningful context information. These key technical developments result in a powerful ability to detect highly challenging text, with less false detections. The CTPN is efficient by achieving new state-of-the-art performance on five benchmarks, with 0.14s/image running time.

5. 結論

我們提出了連線文字提議網路(CTPN)—— 一種可端到端訓練的高效文字檢測器。CTPN直接在卷積對映的一系列細粒度文字提議中檢測文字行。我們開發了垂直錨點機制,聯合預測每個提議的精確位置和文字/非文字分數,這是實現文字準確定位的關鍵。我們提出了一個網內RNN層,可以優雅地連線順序文字提議,使其能夠探索有意義的上下文資訊。這些關鍵技術的發展帶來了檢測極具挑戰性的文字的強大能力,同時減少了誤檢。通過在五個基準資料集測試中實現了最佳效能,每張影象執行時間為0.14s,CTPN是有效的。

相關文章