Precise News Video Text Detection and Text Extraction Based on Multiple Frames Integration Advisor: Dr. Shwu-Huey Yen Student: Hsiao-Wei Chang 1.

Slides:

Advertisements

Similar presentations

A Graph based Geometric Approach to Contour Extraction from Noisy Binary Images Amal Dev Parakkat, Jiju Peethambaran, Philumon Joseph and Ramanathan Muthuganapathy.

Advertisements

QR Code Recognition Based On Image Processing

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.

Grey Level Enhancement Contrast stretching Linear mapping Non-linear mapping Efficient implementation of mapping algorithms Design of classes to support.

Robust statistical method for background extraction in image segmentation Doug Keen March 29, 2001.

IEEE Transactions on Consumer Electronics, Vol. 45, No. 1, AUGUST 1999 Muhammad Bilal Ahmad and Tae-Sun Choi, Senior Member,IEEE.

Computational Biology, Part 23 Biological Imaging II Robert F. Murphy Copyright  1996, 1999, All rights reserved.

Chapter 3 Image Enhancement in the Spatial Domain.

EDGE DETECTION ARCHANA IYER AADHAR AUTHENTICATION.

Prénom Nom Document Analysis: Document Image Processing Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Automatic Histogram Threshold Using Fuzzy Measures 呂惠琪.

Facial feature localization Presented by: Harvest Jang Spring 2002.

An article by: Itay Bar-Yosef, Nate Hagbi, Klara Kedem, Itshak Dinstein Computer Science Department Ben-Gurion University Beer-Sheva, Israel Presented.

Thresholding Otsu’s Thresholding Method Threshold Detection Methods Optimal Thresholding Multi-Spectral Thresholding 6.2. Edge-based.

Computer Vision Group Edge Detection Giacomo Boracchi 5/12/2007

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Text Detection in Video Min Cai Background  Video OCR: Text detection, extraction and recognition  Detection Target: Artificial text  Text.

Robust Object Segmentation Using Adaptive Thresholding Xiaxi Huang and Nikolaos V. Boulgouris International Conference on Image Processing 2007.

LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.

Lecture 2: Image filtering

A Novel 2D To 3D Image Technique Based On Object- Oriented Conversion.

Binary Image Analysis. YOU HAVE TO READ THE BOOK! reminder.

FEATURE EXTRACTION FOR JAVA CHARACTER RECOGNITION Rudy Adipranata, Liliana, Meiliana Indrawijaya, Gregorius Satia Budhi Informatics Department, Petra Christian.

Chapter 10: Image Segmentation

Copyright © 2012 Elsevier Inc. All rights reserved.. Chapter 5 Edge Detection.

S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.

University of Kurdistan Digital Image Processing (DIP) Lecturer: Kaveh Mollazade, Ph.D. Department of Biosystems Engineering, Faculty of Agriculture,

September 23, 2014Computer Vision Lecture 5: Binary Image Processing 1 Binary Images Binary images are grayscale images with only two possible levels of.

Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.

Digital Image Processing Lecture 18: Segmentation: Thresholding & Region-Based Prof. Charlene Tsai.

Detection of nerves in Ultrasound Images using edge detection techniques NIRANJAN TALLAPALLY.

AdeptSight Image Processing Tools Lee Haney January 21, 2010.

Image Segmentation Chapter 10.

Chapter 10, Part II Edge Linking and Boundary Detection The methods discussed in the previous section yield pixels lying only on edges. This section.

CSC508 Convolution Operators. CSC508 Convolution Arguably the most fundamental operation of computer vision It’s a neighborhood operator –Similar to the.

October 7, 2014Computer Vision Lecture 9: Edge Detection II 1 Laplacian Filters Idea: Smooth the image, Smooth the image, compute the second derivative.

Digital Image Processing Lecture 16: Segmentation: Detection of Discontinuities Prof. Charlene Tsai.

Image-Based Segmentation of Indoor Corridor Floors for a Mobile Robot

Image-Based Segmentation of Indoor Corridor Floors for a Mobile Robot Yinxiao Li and Stanley T. Birchfield The Holcombe Department of Electrical and Computer.

Zhongyan Liang, Sanyuan Zhang Under review for Journal of Zhejiang University Science C (Computers & Electronics) Publisher: Springer A Credible Tilt License.

Text From Corners: A Novel Approach to Detect Text and Caption in Videos Xu Zhao, Kai-Hsiang Lin, Yun Fu, Member, IEEE, Yuxiao Hu, Member, IEEE, Yuncai.

主講者 : 陳建齊. Outline & Content 1. Introduction 2. Thresholding 3. Edge-based segmentation 4. Region-based segmentation 5. conclusion 2.

By Pushpita Biswas Under the guidance of Prof. S.Mukhopadhyay and Prof. P.K.Biswas.

October 1, 2013Computer Vision Lecture 9: From Edges to Contours 1 Canny Edge Detector However, usually there will still be noise in the array E[i, j],

Wonjun Kim and Changick Kim, Member, IEEE

Digital Image Processing Lecture 16: Segmentation: Detection of Discontinuities May 2, 2005 Prof. Charlene Tsai.

Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.

TOPIC 12 IMAGE SEGMENTATION & MORPHOLOGY. Image segmentation is approached from three different perspectives :. Region detection: each pixel is assigned.

Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.

Course 3 Binary Image Binary Images have only two gray levels: “1” and “0”, i.e., black / white. —— save memory —— fast processing —— many features of.

Chapter 6 Skeleton & Morphological Operation. Image Processing for Pattern Recognition Feature Extraction Acquisition Preprocessing Classification Post.

September 26, 2013Computer Vision Lecture 8: Edge Detection II 1Gradient In the one-dimensional case, a step edge corresponds to a local peak in the first.

Sliding Window Filters Longin Jan Latecki October 9, 2002.

Detection of nerves in Ultrasound Images using edge detection techniques NIRANJAN TALLAPALLY.

Computer Graphics CC416 Lecture 04: Bresenham Line Algorithm & Mid-point circle algorithm Dr. Manal Helal – Fall 2014.

Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.

Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.

Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.

April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.

Chapter 10 Image Segmentation

IMAGE SEGMENTATION USING THRESHOLDING

Computer Vision Lecture 13: Image Segmentation III

Presenter: Ibrahim A. Zedan

Computer Vision Lecture 12: Image Segmentation II

Computer Vision Lecture 5: Binary Image Processing

Computer Vision Lecture 9: Edge Detection II

Text Detection in Images and Video

Digital Image Processing

Christian Wolf Jean-Michel Jolion Françoise Chassaing

DIGITAL IMAGE PROCESSING Elective 3 (5th Sem.)

Presentation transcript:

Precise News Video Text Detection and Text Extraction Based on Multiple Frames Integration Advisor: Dr. Shwu-Huey Yen Student: Hsiao-Wei Chang 1

2 Outline 1 Introduction 2 Related work 3 Edge detection 4 Proposed Method 5 Experimental Results 6 Conclusion

1 Introduction Video text can be scene text or caption text. Caption text, static or scrolling, is superimposed in a later stage of videos producing. Most static texts provide concise and direct description of the content presented in news video. 3

The procedure of video textual information extraction can be divided into two categories; detection and extraction. 4

2 Related work Video text detection methods can be classified into three classes. The first class is texture-based. The second class presumes that a text string contains a uniform color. The third class is edge-based. 5

Text extraction methods can be classified into two classes. The first class is global. The second class is local. 6

Niblack proposed calculating the threshold value by shifting a window across the image. The threshold is determined by the following formula: T(x,y) = m(x,y) + k * s(x,y), where m(x,y) and s(x,y) are local mean and standard deviation values, respectively. The value of k is a parameter. 7

Sauvola established on Niblack’s algorithm. The threshold is determined by the following formula: T(x,y) = m(x,y) * (1 + k * (s(x,y)/R - 1)) where m(x,y) and s(x,y) are as in Niblack's formula. R is the dynamic range of standard deviation, and a parameter k. 8

3 Edge detection 3.1 Sobel Edge detection The Sobel masks are the horizontal mask the vertical mask

The magnitude of the gradient is then calculated using the formula: 10

3.2 Canny edge detection The Canny algorithm contains three parameters. The size of the Gaussian filter. Thresholds: the use of two thresholds with hysteresis allows more flexibility than in a single- threshold approach. 11

(a) original image (b) Sobel edge (c) Canny edge 12

4 The Proposed Method 4.1 Text Detection People need 2 seconds or more to process a complex scene. Thus, if videos are played f frames per second, we are interested in video texts focusing on a fixed location for at least 2f consecutive frames. Let k be the nearest integer that is not less than f. 13

On the m th round, four reference frames on frames (m-1)  k+i, i= 1,  k/3 , 2  k/3 , 3  k/3  are chosen. We set h and w are the height and width of the character size. 14

Fig. 1. The flowchart of the proposed approach. 15

Step 1: Get four reference frames from the given one round of video frames and transform them into grayscale images. We use Eq.(1) to accomplish this. (1) where Y is the intensity value. 16

17 (a Fig. 2. Four color reference frames.

18 Fig. 3. Four grayscale reference images.

Step 2: Execute the edge detection. The Canny edge detector is applied on each grayscale image yielding an edge map. A simple line deletion (horizontal and vertical) is followed if a line is too long. 19

20 Fig. 4. Four Canny edge detection maps after removing lines that are too long.

Step 3: Do logical AND on Canny edge maps. Note that after AND operation, a position (i, j) is true (an edge pixel) if all four Canny edge maps are true at (i, j). 21

22 Fig. 5. The AND-edge-map: the result after taking AND operation on four Canny edge maps of Fig. 4.

Step 4: Mask text location. A three-stage technique is designed to find the text mask. (a) A window the size of w  h (presumed character size) slides from left to right (per column) and top to bottom (per row) on the AND-edge-map. 23

The value of BWT represents the transitions from black to white or from white to black. where b(  )=1 if it is black and 0 otherwise. If BWT is larger than the threshold T BWT, this window is masked. 24

25 (a) (b) Fig. 6. (a) Edge of the string. (b) Edge of English letter “I”.

The threshold T BWT depends on the character size. “I” is the one with the least black-and-white transitions in English letters. Figure 6 shows each letter has dimension approximately 10  18 and the BWT in Fig. 6(b) is 72 which is 0.4 of 10  18. The experimental results on many multilingual video texts are satisfying when  is 0.35, T BWT =  (w  h). 26

(b) The rough mask is examined. A horizontal line segment of length w comprising points on (i, j), …, (i+w-1, j) will be eliminated if neither of these points is an edge; and a vertical line segment of length h comprising points on (i, j), …, (i, j+h-1) will be eliminated if neither of them is an edge point. 27

(c) Considering the different background and various contrast in reference frames, the results of the Canny edge detector of the same text on different frames may differ in a few pixels. This usually causes characters to lose some pixels in the AND-operation. 28

To alleviate this problem, a morphological compensated image M is obtained by first applying a closing. Isolated blobs of small size are removed followed by a dilation to extend the text region.

30 (a) The obtained rough mask (b) Non-text pixels removal (c) Isolated noises removal (d) Morphological compensation

Step 5: Exact text location. To do the refinement, we examine every white pixel. If a white pixel is located on (i, j) position, then points located on top, bottom, left, and right of (i, j) position of the edge map will be examined. 31

In checking the upward direction, on the binary edge map, if (i, j-1), (i, j-2), …, (i, j-k+1) points are all white and (i, j-k) is the first edge pixel (black point), then every point on (i, j-1), (i, j-2), …, (i, j-k+1) in the rough text (grayscale) image must be non-text accordingly. Thus, any non-white pixels will be converted into white in the rough text image. 32

33 (a) The rough text image (b) The result after refinement Effectiveness of text refinement before and after.

Finally, to label the detected texts, we do a simple binarization on the refined text image followed by a morphological operation to connect texts that are close to each other.

35 Fig. 9. Some results of our method.

Text Extraction The flowchart of the proposed approach is given in Fig. 12.

37 Fig. 12. The flowchart of the proposed approach

38 First, for video text which is captured in a tightly bounded, we extend both the upper and lower boundary of the box at least 1 pixel to ensure that sufficient pixels are included for the application of Canny edge detector.

We compare the pixel intensity of each column both from top to bottom and from bottom to top until we reach the pixel position identified as Canny edge or the end of the boundary if there is no Canny edge. When the comparison is complete, we get the range of the background intensity. 39

Let b min and b max be the least and the largest intensity values for those encountered pixels. The range of the background intensity is thus defined as B = [b min, b max ]. It is possible that B covers the entire range of the intensity. According to B, we will discuss the binarization in two cases. 40

Case (1): B does not cover the entire range of the histogram. We use the interval B to determine the range of foreground intensity F. Since text polarity is initially unknown, the midpoint of B can be used to estimate the foreground intensity. 41

If intensity of the midpoint of B is bright (>128), then the text is positive polarity, and negative polarity otherwise. 42

43 grayscale image histogram intensity information

44 where p and  are the peak and the standard deviation of F, and k is a constant. In our experiments, k is set as we define the text intensity T as

45 (a) (b) (c) (d) (e) (a) original image, (b) extended image, (c) grayscale image, (d) Canny edge map, (e) final image for OCR.

Case (2): B covers the entire range of the histogram. Fig. 15. original image Otsu Lin Niblack Sovola Fig. 16. binarization results 46

To solve this problem, we reverse the refined image to recover the original text polarity. Repeat the steps, i.e., producing histogram, Canny edge map, and the background range B on this reversed image. Now, the background intensity range B is [0, 97]. 47

refined image reverse Canny map histogram intensity information 48

For each background pixel P, if its intensity is within the range of text intensity T, then its intensity value will be replaced by where I(P) is the intensity of P. 49

(a) (b) Fig. 20. (a) modify background intensities of Fig. 15; (b) proposed algorithm binarization result 50

5 The Experimental Results 5.1 Text Detection Results The proposed text detection algorithm was evaluated on multilingual videos clips for a total of approximately 30 minutes. All of these videos have a resolution of 400  300 and a frame rate per second. The presumed largest character size w  h was set to be 20  20 and  in T BWT was set to be

52 (a) (b) (c) Fig. 22.Three bounding boxes for the same text from large to small. (a) (b) (c) (d) (e) Fig. 23. Detected boxes (in yellow) for a ground truth text (in blue) that (a) and (b) fail to detect, (c)~(e) truly detect it but only (e) accurately matched.

In Fig. 22. the dimensions of bounding boxes are 88  20, 86  18, and 84  16 for (a), (b), (c) respectively. Any of them are perfect, but the area in (c) is only 76.4% of that in (a). To accommodate these cases, we take the medium bounding box, the one in (b), as the ground truth. 53

A detected box truly detects the texts if the area ratio r, defined in (3), is at least 50%. 54 (3) where D_BOX is a detected box, like yellow boxes in Fig. 23, G_BOX is the ground truth text box, like blue boxes in Fig. 23.

To focus the bounding preciseness, if the area ratio r is at least 80% we say the text is accurately detected. We use recall (R), precision (P), and quality of bounding preciseness (Q) to measure the efficacy of algorithms as in (14), (15), (16). 55

(14) (15) (16) 56

57 (a) (b) (c) (d) Fig. 24. The ambiguities in defining a ground truth for “TALK & VIRGINIA SHOWDOWN”.

Take characters “TALK & VIRGINIA SHOWDOWN” in Fig. 24 as an example. We accept all (a)~(d) cases to be ground truths. However, cases (a) and (b) are better because detected regions should be as tight as possible. 58

59 (a) (b) (c) (d)

60 ( e) (f) Fig. 25. Some results of our method(size 400x300). (a) and (b) are CNN videos, (c) ESPN video, (d) NHK video from Japan, (e) and (f) are two different news videos from Taiwan.

Table 1. Results on Different Video sources Video Sources Type/Language Length min sec # of G. boxes D_BOX_FD_BOX_T # of False Alarms # of boxes w. r < 50% # of boxes w. 50%  r < 80% # of boxes w. r  80% CNNNews/English9’16” ESPNSport/English5’07” NHKNews/Japanese5’35” ETTVNews/Chinese5’29” TVBSNews/Chinese4’51” Total--/--30’18”

Table 2. Results on R, P, and Q ChannelRecallPrecisionQuality CNN400/404= 99.01% 391/400= 97.75% ESPN100/111= 90.09%100/113= 88.50%71/100= 71.00% NHK310/310=100.00% ETTV939/939=100.00% 912/939= 97.12% TVBS1276/1340= 95.22% 1219/1276= 95.53% Average3025/3104= 97.45%3025/3106= 97.39%2903/3025= 95.97% 62

We summarize the contributions of the proposed algorithm in the following:  Precise boxes.  Positive and negative text polarities.  The alignment and the length of text strings.  Sizes, fonts, and multi-color characters.

The worst experimental result is the ESPN video due to the small, isolated, and sparsely located characters. We repeated the experiment with a presumed character size of w  h to be halved (10  10). 64

65 Fig. 28. Some results of other’s method. (a) Hua et al., 2001, (b) Wang et al., 2004, (c) Anthimopoulos et al., 2008, (d) Huang et al., (a) (b) (c) (d)

Our proposed algorithm works well at different size (640x480) image.

5.2 Text Extraction Results We evaluated the performance of our proposed method comparing with Otsu, Lin, Niblack, and Sauvola methods.

Niblack’s method Sauvola’s method our method original image Otsu’s method Lin’s method

69 Niblack’s Sauvola’s our method w=20, k=0.2 original image Otsu’s method Lin’s method

Niblack’s method Sauvola’s method our method

71 Niblack’s method (cited from [30]) [25] approach (cited from [30]) our method original image (cited from [30]) [30] approach (cited from [30])

72 original image (cited from [29]) Otsu’s method Lin’s method Niblack’s method (cited from [29]) Sauvola’s method (cited from [29]) adaptive Niblack’s (cited from [29]) adaptive Sauvola’s (cited from [29]) our method

73 original image (cited from [29]) Otsu’s mehod Sauvola’s method (cited from [29]) adaptive Niblack’s (cited from [29]) adaptive Sauvola’s (cited from [29]) our method Lin’s method Niblack’s method (cited from [29])

6 Conclusion We proposed a general text detection algorithm that is applicable to multilingual news videos without any limitations on text colors, fonts or sizes, alignments, or length of text strings. This algorithm has excellent performances in recall (R), precision (P), and quality of bounding preciseness (Q). 74

For text extraction, our proposed method has the merits both from global and adaptive methods. It is parameter-free, computation efficient and robust to various video text including color, font size, alignment.

Thank you !