复杂图像/视频文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室

复杂图像/视频文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室
模式识别技术创新实验室 (Pattern Recognition and Information Retrieval Lab) 计算机科学与技术系，北京科技大学

Background: Text in Video (Scenes)
Layered Caption Text Embedded Caption Text Scene Text

Background: Applications of Image/Video Text Extraction
Video Understanding and Retrieval NIST TRECVID MED (Multimedia Event Detection) *Results from “Retrieving videos using content and concepts” (J. Dalton, P. Mirajkar and J. Allan, CIKM’14).

Background: Reading in the Wild (Applications)
Signage-text is widely used as visual indicators for navigation and notification in scenes. The handheld The autonomous robot The guide dog the AR translator The user navigation The mobile head-mounted device

Background: Challenges for text detection, tracking and recognition in video and scenes
Complex Background Complicated Foreground (Text) Varied Videos (Specifically)

Outline Unified Framework for Video Text Detection, Tracking and Recognition Robust Text Detection: From Horizontal (Near Horizontal) Scene Text, to Multi-Orientation Scene Text, to Video Text End-to-End Scene Text Recognition Future Directions and Discussions

Unified Framework for Video Text Detection, Tracking and Recognition

Text Detection, Tracking and Recognition in Video (Scene Images): A Survey
Video Text Detection and Recognition with Frame by Frame Video Text Detection, Tracking and Recognition with Multiple Frames * Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu, “Text Detection, Tracking and Recognition: A Comprehensive Survey,” IEEE TIP, submitted (with revision), 2015.

Robust Text Detection: From Horizontal (Scene) Text, to Multi-orientation, to Video Text
Horizontal (near horizontal) scene text detection (Adaptive hierarchical clustering with distance metric learning) Multi-orientation scene text detection Tracking based text detection in scene videos

(Horizontal) Text detection in natural scenes: Background
Challenges with scene text detection Complex background Variations of font and size Variations of text color Variations of illumination

Text detection in natural scenes: Review
Previous text detection technologies Region-based (Sliding window-based) K. Kim et al., “Texture-based approach for text detection in images using SVM …”, TPAMI 2003. X. Chen, and A. Yuille, “Detecting and reading text in natural scenes”, CVPR 2004. T. Wang, D.J. Wu, A. Coates, and A. Y. Ng, “End-to-end text recognition with CNN”, ICPR 2012. VERY SLOW (each pixel, multi-scales) Connected components-based B. Epshtein et al., “Detecting text in natural scenes with stroke width transform (SWT)”, CVPR2010. C. Yao, X. Bai et al., Detecting texts of arbitrary orientations in natural images…, CVPR 2012, TIP 2014. W. Huang et al., Text localization in natural images with Stroke Feature Transform …, ICCV 2013, ECCV 2014. C. Yi and Y. Tian, Text string detection from natural scenes with boundary clustering, stroke segmentation, structure modeling, …, TIP 2011, TIP 2012, CVIU 2013. Y.-F. Pan, X. Hou and C.-L. Liu, “A hybrid approach to detect and localize texts in natural scene images”, TIP 2011 Frangibility in CC calculation

Text detection in natural scenes: Review
Recent MSER/ER-based text detection technologies Maximally Stable Extremal Region (MSER/ER) Robust to color, size, illumination, resolution MSER/ER-based detection A specific category of CC-based methods; Use MSERs/ERs as character candidates (have become the focus of recent projects). L. Neumann and J. Matas, (Realtime) Text localization and recognition in real-world images, ACCV 2010, ICDAR 2011/2013, CVPR 2012, ICCV 2013. H.I. Koo and D.H. Kim, “Scene text detection via connected component clustering and nontext filtering”, TIP 2013. C. Shi, C. Wang, B. Xiao, et al., Scene text detection using graph model, MSER, CRF, …, Pattern Recognition Letters 2013, CVPR 2013, ICDAR 2013, TCSVT 2014, PR 2014. L. Sun, Q. Hou, et al., Robust text detection in natural scene images by Generalized Color enhanced contrasting extremal region, … ICPR 2012, ICDAR 2013, ICPR 2014. L. Kang, D. Doermann, et al., Orientation robust text line detection with HOCC…, CVPR 2014. X.-C. Yin, et al., “Robust text detection in natural scenes,” TPAMI 2014.

Text detection in natural scenes: Motivation
Main pitfalls for MSER/ER-based text detection methods Most of the detected character candidates (MSERs/ERs) correspond to non-characters (MSER pruning) Insufficient text candidates construction with time consuming and error pruning (parameter tuning with rule-based methods) (Adaptive hierarchical clustering with metric learning) Text candidate classifier trained on an unbalanced data (Eliminating most non-text candidates with the character classifier)

Text detection in natural scenes: System overview
Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao, “Robust text detection in natural scene images,” IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), 36(5): , 2014.

Text detection in natural scenes: Highlights
A MSERs pruning algorithm with minimizing regularized variations is proposed to reduce most of the non-characters Character candidates are clustered into text candidates by the adaptive single-link clustering algorithm where distance weights and threshold are learned simultaneously using a self-training metric learning algorithm The posterior probabilities of text candidates corresponding to non-text are measured using the character classifier and text candidates with high probabilities for non-text are removed efficiently

Text detection in natural scenes: Key technologies
Character candidates extraction with MSER pruning Text candidates construction with adaptive hierarchical clustering and distance metric learning Text candidates elimination with the character classifier

Character Candidates Extraction

Character Candidates Extraction (Variation regularization)

Text Candidates Construction
Clustering-based text candidates grouping from character candidates (MSERs) Clustering: single-link clustering (elongated clusters) Similarity: weighted distance Threshold: threshold for deciding the number of clusters

Adaptive single-link clustering with distance metric learning
Feature space (similarity)

Weighted distance Clusters How to select weights and threshold? Rule-based: time consuming and error-prone Clustering-based: a separate two-stage learning style (first weights, then threshold) Adaptive (single-link) clustering where distance weights and threshold are learned simultaneously using a self-training metric learning algorithm.

(1) Sample selection Focus on the hardest part (closest and farthest data)

(2) Weight conversion Original: Converted: ( weights and threshold learned simultaneously)

(3) Model determination With the logistic regression loss, a discriminative model is designed by Distance metric learning:

(4) Self-training algorithm

Adaptive hierarchical clustering with distance metric learning
Algorithm’s application Scene character candidates grouping (scene text detection) [1] Handwriting text grouping [3] Algorithm’s extension From single-link clustering to general hierarchical clustering [2] From hierarchical clustering to partitional clustering (e.g., k-means ??) [1] Xu-Cheng Yin, et al., “Robust text detection in natural scene images,” IEEE TPAMI, 36(5), 2014. [2] Xu-Cheng Yin, et al., “Multi-orientation scene text detection with adaptive clustering”, IEEE TPAMI, 2015. [3] Adrien Delaye et al., “…online document segmentation by pairwise stroke distance learning,” Pattern Recognition, 2015.

Text Candidates Elimination
Empirical results In ICDAR 2011 competition training set, only 9% of the text candidates correspond to true text Hard to train an effective text classifier using such unbalanced dataset Text candidates elimination Most methods based on rules and heuristics Our discriminative method Use a character classifier to estimate the posterior probabilities of text candidates corresponding to non-text Remove candidates with high probability for non-text

Text Candidates Elimination

Experiments On the ICDAR 2011 Robust Reading Competition Set (Challenge 2: Reading Text in Scene Images) 1,2,3,4 Top 4 winners of ICDAR2011: Kim’s, Yi’s, TH-TextLoc System, and Neumann’s Shi et al.’s (Pattern Recognition Letters, 2013(2)) Neuman and Matas’s (CVPR2012)

Experiments Speed on ICDAR 2011 data set Methods Time (s) per image
Remarks Our Method 0.43 A Linux laptop with Intel (R) Core (TM)2 Duo 2.00GHZ CPU Shi et al.’s 1.5 A PC with Intel (R) Core (TM)2 Duo 2.33GHZ CPU Neuman and Matas’s 1.8 A “standard PC”

Experiments (ICDAR 2011 Samples)
Notice the robustness against low contrast, complex background and font variations.

Experiments On a publicly multilingual (include Chinese and English) dataset 1,2,3 Scheme III: constructed on ICDAR 2011 training set Scheme IV: constructed on the multilingual training set Pan et al.’s method (Yifeng Pan, Xinwen Hou, and Cheng-Lin Liu, IEEE TIP 20(3), 2011) Speed of Pan et al.'s method is with a PC with Pentium D 3.4GHz CPU

Experiments (Multilingual Samples)

Demos Online Demos

Demos APP Demo on Android Mobile Phones, iPhone and Tablets

ICDAR 2013 Robust Reading Competition Results
技术获奖（Technology Awards）在2013年国际文档分析与识别技术竞赛上，我们的创新技术获得本届大赛最受关注的Robust Reading Competition竞赛“自然场景文本检测”、“网络图片文本检测”、和“网络图片文本提取”三项冠军。其中，“网络图片文本提取”和“网络图片文本检测”获胜结果性能比第二名分别提高了19.36%和8.37%。特别的，“自然场景文本检测”竞赛单元自2003年国际文档与识别大会设立项目以来，由于其技术的挑战性和应用的重要性，先后吸引了来自美国、德国、中国、法国、新加坡、俄罗斯、日本等十多个国家近三十支团队参加，包括了美国加州大学、美国纽约城市大学、清华大学、中国科学院自动化所、新加坡国立大学等单位的文档分析与识别、模式识别、计算机视觉及人工智能领域世界顶级研究团队；该项比赛已经成为评价和检验自然场景与图片文本检测与识别领域最新技术研究进展的最重要国际赛事及标准。今年，我们的创新技术取得了10年来该项竞赛的最好性能，也是中国研究机构首次问鼎该项冠军。在这次国际文档分析与识别大会上，该创新技术引起了Samsung、Google、Microsoft、Amazon、Motorola、Canon、Fujitsu等跨国公司研究人员的高度关注。注：国际文档分析与识别大会（ICDAR）是国际模式识别协会（IAPR）举办的文档分析与识别、模式识别领域世界上最重要的国际学术会议之一，每两年举办1次，从1991年第1届开始，到今年（2013）已成功举办12届。

Results for the ICDAR 2013 Robust Reading Competition (Challenge2: Text Localization in Real Scenes)

Results for the ICDAR 2013 Robust Reading Competition (Challenge1: Text Localization in Born-Digital Images (Web and ))

Multi-orientation text detection: Background
Most scene text detection methods specifically are for (near) horizontal texts Few methods for multi-orientation scene texts Very few methods for multi-view scene texts [1] C. Yao, et al., CVPR’12 / C. Yao, et al., PLOS One 2013/ IEEE TIP 2014. [2] L. Gomez and D. Karatzas, ICDAR’13. [3] X.-C. Yin, et al., IEEE TPAMI 2014 / IEEE TPAMI 2015 [4] T. Phan, et al., ICCV’13.

Multi-orientation scene text detection: Background
Features for text (line) region grouping Results from X.-C. Yin, et al., IEEE TPAMI, 2014.

Multi-orientation scene text detection: Background
Challenges for multi-orientation and multi-view texts Text (top and bottom) line alignment features No long keeping in (near) horizontal orientation. Alignments are keeping in multiple orientations and views (coarse-to-fine grouping with adaptive clustering). Unstable and more complicated (other) features Interval, size, … Parameters and classifiers are learned with multi-orientation and multi-view features.

Multi-orientation scene text detection: System overview
Our (previous) efficient infrastructure of scene text detection Two schemes In X.-C. Yin, et al., IEEE TPAMI 2014.

Multi-orientation and multi-view scene text detection: Scheme-I
(Additional Stage) Orientation Computation

Arbitrary orientation computation with “Forward-Backward” algorithm
Component (pair) expansion: First, sort pairs with priorities. (char, char), (char, non-char), (non-char, non-char) Then, begin with the pair with the highest priority (Fig.a) and expand to the forward direction (c1  c2). (Fig.b) If the forward is no longer expandable, turn to the backward expansion. (Fig.d) Next, begin detecting of next line by starting off with the next highest priority pair. Expansion requirement: Angle (orientation) requirement; Distance requirement, (minor) optimal component.

Multi-orientation and multi-view scene text detection: Scheme-II
With multi-orientation and multi-view features Grouping with multi-orientation and multi-view features and classifiers Classifiers on multi-orientation and multi-view features Classifiers on multi-orientation and multi-view features In X.-C. Yin, et al., IEEE TPAMI 2015.

Text candidates construction with coarse-to-fine grouping via adaptive clustering
Morphology clustering Morphology-based grouping: Group character candidates with similar appearance together (color, size, …). Orientation clustering Orientation-based grouping: Group character pairs with consistent orientation together. Projection clustering Projection-based grouping: Separate text lines in the same orientation.

Text candidates construction with coarse-to-fine grouping via adaptive clustering
(a) Original image (b) Character candidates from MSERs (c) Groups constructed by Morphology Clustering (d) Groups constructed by Orientation Clustering (e) Groups constructed by Projection Clustering (f) Final results

Morphology clustering
Adaptive agglomerative hierarchical (single-linking) clustering with distance metric learning

Orientation clustering
Assumptions Character pairs in one consistent partition (group) should fall into a “narrow” space (orientation and intercept) Represented with the partition’s interval. Should have a distance “compactness” Represented with the partition’s central moment.

Step-I: Clustering Clusters Threshold normalization Adaptive hierarchical (divisive) clustering with a distance metric learning framework

Step-II: Verifying with “misregistration” Basic idea Including pairs with two fair close similar characters, excluding different-characters pairs and too-close-characters pairs. “Misregistration” measure function Kernel functions One simple and specific instance Verifying For one orientation, a pair of two similar characters with a fair distance, resulting in a fair size difference dsize, can be included in this orientation’s group with a high probability (by a large misreg); in contrast, a pair with two different and distant characters (very large dsize) will probably excluded. Note that for a pair with two highly close character (very small dsize) will also be probably excluded because a little change of size for these two characters can bring to a large orientation difference. dis(u,v) the centroid distance between u and v. dsize: the size different similarity.

Projection clustering
Intercept (modified) Clustering The same divisive binary clustering algorithm. Verifying with “misregistration”

Experiments Performance comparison on MSRA-TD500 database (multi-orientation text dataset)

Experiments Performance on USTB-SV1K database (multi-orientation and multi-view text dataset)

Tracking based text detection in scene videos
Multi-strategy tracking based text detection in scene videos Tracking based text detection in video with dynamic programming *Z-Y Zuo, S. Tian, X.-C. Yin, et al., “Multi-strategy tracking based text detection in scene videos,” ICDAR, 2015.

Challenges of Scene Text Detection, Tracking and Recognition
Multi-orientation Low resolution Motion of the camera or text Varied illumination

How to detect text in scene videos?
Text detection with single images (individual video frames) E.g., Xu-Cheng Yin et al., TPAMI 2014; Xu-Cheng Yin et al. TPAMI 2015; Qixiang Ye and David Doermann, TPAMI 2015; Lei Sun et al., PR 2015. Text detection with multiple frames (Tracking based text detection) Temporal-spatial information based methods Reducing false alarms, e.g., M. Tanaka and H. Goto, ICPR’08/ICDAR’09. Fusion based methods (Multiple frame integration) Reducing the influence of complex background, e.g., multi-frame averaging, time-based minimum/maximum pixel value searching (B. Wang et al., SPIE’13; X. Rong et al., ICME’14, etc.). The combination of scene text detection and tracking techniques is an effective way to improve the accuracy of text detection in scene videos.

How to track text in scene videos?
Typical tracking strategies Tracking with Template Matching (linear or nonlinear) e.g., H. Goto et al., ICPR’06/ICDAR/07; V. Fragoso et al., WACV'11. Template matching is to tackle text blur challenges, but it is difficult to handle multi-scale challenges. Tracking with Particle Filtering (nonlinear filtering) e.g., M. Mirmehdi et al., CBDAR'07/11. To solve the problem that state variables do not follow a Gaussian distribution. Tracking with Tracking-By-Detection (tracking by fusing detection results) e.g., P. X. Nguyen et al., WACV’14; X. Rong et al., ICME’14. To associate detected results in successive frames to initialize new trajectories. Multi-Strategy Tracking Based Text Detection in Scene Videos!

Multi-Strategy Tracking
Overview of our method Video frames Text detection Text tracking Text detection Multi-Strategy Tracking frame by frame

Scene text detection with adaptive clustering in individual frames: Highlights
A MSERs pruning algorithm with minimizing regularized variations is proposed to reduce most of the non-characters Character candidates are clustered into text candidates by the adaptive (single-link) clustering algorithm where distance weights and threshold are learned simultaneously using a self-training metric learning algorithm The posterior probabilities of text candidates corresponding to non-text are measured using the character classifier and text candidates with high probabilities for non-text are removed efficiently Details can be referred to Refs: [1] Xu-Cheng Yin, et al., “Robust text detection in natural scene images,” IEEE TPAMI, 2014. [2] Xu-Cheng Yin, et al., “Multi-orientation scene text detection with adaptive clustering,” IEEE TPAMI, 2015.

Multi-Strategy text tracking
To improve the precision and recall of scene text detection in video. The position of text detection The predict position of STCL (Spatio-Temporal Context Learing ) The predict position of linear prediction The tracking by detection method makes use of detection outputs to create and initialize new trajectories and amend tracking outputs. Linear prediction is used to predict linear motion while STCL can be applied for nonlinear motion. The tracking algorithm can be extended or replaced with other tracking techniques to get better performance in our methods. An effective integration and selection strategy to adaptively determine and select which candidate text block is the best matching. The flow chart of our tracking approach. 62

Tracking by detection The tracking by detection method makes use of detection outputs to create and initialize new trajectories and amend tracking outputs. Hungarian Algorithm

Spatio-Temporal Context Learing (STCL)
[24] K. Zhang, L. Zhang, M.-H. Yang, and D. Zhang, “Fast tracking via spatio-temporal context learning,” arXiv preprint arXiv: , 2013

Text Candidates Integration and Selection
(Rule-Based Selection)

Experimental Results Compared to the text detection results of Yin et al.’s method, the average performance of text tracking has an increase of f-score by 14%. However, there is only 8% increase in Minetto et al.’ method. 66

Experimental Results Compared to the text detection results of Yin et al.’s method, the average performance of text tracking has an increase of f-score by 14%. However, there is only 8% increase in Minetto et al.’ method. 67

Experimental Results Table 2. Comparative tracking based scene text performance(%) Compared to the text detection results of Yin et al.’s method, the average performance of text tracking has an increase of f-score by 14%. However, there is only 8% increase in Minetto et al.’ method. Scene text detection The evaluation scheme(ICDAR competition ) 68

Tracking Based Text Detection with Dynamic Programming

ICDAR 2015 Robust Reading Competition: TASK 3
ICDAR 2015 Robust Reading Competition: TASK 3.1 “(Scene) Text Localisation in Video” Our New Technology The Table is from the Ref: D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, et al., "ICDAR 2015 Competition on Robust Reading," ICDAR 2015.

End-to-End Scene Text Recognition

基于信息共享与反馈策略的文本端到端识别框架
信息共享和信息反馈机理基于检测-识别-理解信息反馈的文本识别框架图像预处理文本检测文字识别识别后处理（语义理解）

技术获奖： 2015年国际文档分析与识别技术大赛第一名：高于第二名12.12% 自然场景端到端文本网络图片识别(通用型)
高于第二名35.03% Robust Reading 竞赛荣获2015年国际文档分析与识别大会技术竞赛最受关注的“鲁棒阅读竞赛”中“自然场景文本端到端识别(Generic)”、“网络图片文本端到端识别(Generic)”、“网络图片文本端到端识别(Weak)”、“视频文本检测”等四项冠军。本届国际文档分析与识别技术竞赛包括十一个竞赛单元，吸引了来自中国、美国、德国、法国、英国、日本、韩国、印度等几十个国家一百多支模式识别、文档分析与识别、计算机视觉等领域高水平参赛队伍。视频文本检测第一名：高于第二名16. 53%

最新结果 ICDAR 2015 Robust Reading Competition End-to-End Text Recognition in Born-Digital Images

最新结果 ICDAR 2015 Robust Reading Competition End-to-End Text Recognition in Scene Images

Future Directions and Discussions: Scene Text Extraction in Images and Videos
Distorted text detection and recognition Skew (multi-orientation), curved, perspective and unaligned distortions Multilingual text detection and recognition Text tracking in complex videos (scene videos) Unified frameworks for tracking based text detection and recognition Text recognition and retrieval in web pictures and videos Robust text reading in the wild

Thanks Q & A 殷绪成教授/博士/博导 xuchengyin@ustb.edu.cn
殷绪成教授/博士/博导北京科技大学计算机系模式识别技术创新实验室

复杂图像/视频文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室

Similar presentations

Presentation on theme: "复杂图像/视频文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

复杂图像/视频 文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室

Similar presentations

Presentation on theme: "复杂图像/视频 文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室"— Presentation transcript:

Similar presentations

About project

Feedback

复杂图像/视频文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室

Presentation on theme: "复杂图像/视频文本检测、跟踪和识别 Xu-Cheng Yin (殷绪成) Ph.D./Prof. 模式识别技术创新实验室"— Presentation transcript: