ECCV 2020
Huazhong University of Science and Technology (Xiang Bai), China & Facebook AI, USA
github: https://github.com/MhLiao/MaskTextSpotterV3
Mask TextSpotter V1 是第一篇可以完全 end-to-end 的 ocr 論文
雖然在V1之前有兩篇 text spotter 方面的論文,但是訓練起來不能完全的 end-to-end,通常是分開訓練 detection 和 recognition 後再聯合訓練,或者是先訓練 detection 後再加入 recognition 一起訓練。
Mask-TextSpotter V2 基於空間 attention 的方式進行識別
與 V1 相比, V2 的主要改進是在識別部分,在 V1 中識別是對單個字符的識別。V2 中提出了SAM(Spatial Attention Module),採用了基於空間 attention 的方式進行識別。
Mask-TextSpotter V3 採用 Segmentation Proposal Network (SPN) 更加精確的得到文本行特徵
前兩版的都是基於mask-rcnn的,它們對於密集的傾斜文本和長文本效果不好,主要是因為 1.預先設定的 anchor box 無法滿足長寬比很大的文本行;2.當文本密集時,一個矩形框中會包含多個文本行。
![]()
RPN relies heavily on manually designed anchors and its proposals are represented with axis-aligned rectangles.
The former presents difficulties in handling text instances of extreme aspect ratios or irregular shapes, and the latter often includes multiple neighboring instances into a single proposal, in cases of densely oriented text.
To tackle these problems, we propose Mask TextSpotter v3, an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN.
Mask TextSpotter v3 consists of:
Given a text segmentation map, , whose values are in the range of , we first binarize into a binary map :
Here, and are the indices of the segmentation or binary map and is set to .
We dilate them by un-clipping pixels, where is calculated as . Here, and are the area and perimeter of the predicted shrunk text regions. is set to according to the value of the shrink ratio .
in Mask TextSpotter v2:
sequence recognition loss in
dice loss:
It is first pre-trained with SynthText and then fine-tuned with a mixture of SynthText, the ICDAR 2013 dataset, the ICDAR 2015 dataset, the SCUT dataset, and the Total-Text dataset (2 : 2 : 2 : 1 : 1) for 250k iterations.
We test for rotation robustness by conducting experiments on the RoIC13 dataset.
Aspect ratio robustness is verified by our experimental results on the MSRATD500 dataset, which contains many text lines of extreme aspect ratios.
Robustness to shape variations is evaluated with end-to-end recognition performance on the Total-Text dataset, which contains text instances of various shapes, including horizontal, oriented, and curved shapes.
The challenges in the IC15 dataset mainly lie in the low-resolution and small text instances.
Although Mask TextSpotter v3 is far more robust to rotated text variations than the existing state-of-the-art scene text spotters, it still suffers minor performance disturbance with some extreme rotation angles, e.g. 90◦ , as shown in Fig. 4.3.