Mining Historical Archives for Near- Duplicate Figures Thanawin Rakthanmanon, Qiang Zhu, and Eamonn J. Keogh.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

5th Intensive Course on Soil Micromorphology Naples th - 14th September Image Analysis Lecture 5 Thresholding/Segmentation.
5th Intensive Course on Soil Micromorphology Naples th - 14th September Image Analysis Lecture 5 Thresholding/Segmentation.
How to use Motif Join UI. 1. Open Command Window and type Motif_Join to call Motif_Join.m.
Indexing DNA Sequences Using q-Grams
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
QR Code Recognition Based On Image Processing
Mining Mouse Vocalizations Jesin Zakaria Department of Computer Science and Engineering University of California Riverside.
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, Java Version, Third Edition.
Theoretical Analysis. Objective Our algorithm use some kind of hashing technique, called random projection. In this slide, we will show that if a user.
Image Analysis Preprocessing Image Quantization Binary Image Analysis
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.
15 PARTIAL DERIVATIVES.
Jessica Lin, Eamonn Keogh, Stefano Loardi
Detecting Image Region Duplication Using SIFT Features March 16, ICASSP 2010 Dallas, TX Xunyu Pan and Siwei Lyu Computer Science Department University.
13. The Weak Law and the Strong Law of Large Numbers
Preprocessing ROI Image Geometry
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.
Texture Reading: Chapter 9 (skip 9.4) Key issue: How do we represent texture? Topics: –Texture segmentation –Texture-based matching –Texture synthesis.
A Novel 2D To 3D Image Technique Based On Object- Oriented Conversion.
Copyright © 2012 Elsevier Inc. All rights reserved.. Chapter 9 Binary Shape Analysis.
Radial Basis Function Networks
Understanding Topology for Soil Survey
Microsoft Word 2000 Presentation 5. Major Word Topics Columns Tables Lists.
CS 6825: Binary Image Processing – binary blob metrics
Automated Face Detection Peter Brende David Black-Schaffer Veni Bourakov.
Chapter 2 Frequency Distributions
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
September 23, 2014Computer Vision Lecture 5: Binary Image Processing 1 Binary Images Binary images are grayscale images with only two possible levels of.
XP New Perspectives on Microsoft PowerPoint 2002 Tutorial 2 1 Microsoft PowerPoint 2002 Tutorial 2 – Applying and Modifying Text and Graphic Objects.
Laboratory Exercise # 9 – Inserting Graphics to Documents Office Productivity Tools 1 Laboratory Exercise # 9 Inserting Graphics to Documents Objectives:
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. WORD 2007 M I C R O S O F T ® THE PROFESSIONAL APPROACH S E R I E S Lesson 15 Advanced Tables.
Adobe InDesign CS2—Revealed SETTING UP A DOCUMENT.
Adobe InDesign CS2--Revealed WORKING WITH TEXT. Chapter 2 Working with Text Chapter Objectives Format text Format paragraphs Create and apply styles Edit.
CS654: Digital Image Analysis Lecture 25: Hough Transform Slide credits: Guillermo Sapiro, Mubarak Shah, Derek Hoiem.
Using Pro-Engineer to Create 3 Dimensional Shapes Kevin Manner Kevin Manner Tim Reynolds Tim Reynolds Thuy Tran Thuy Tran Vuong Nguyen Vuong Nguyen.
Digital Media Dr. Jim Rowan ITEC So far… We have compared bitmapped graphics and vector graphics We have discussed bitmapped images, some file formats.
Semi-Supervised Time Series Classification Li Wei Eamonn Keogh University of California, Riverside {wli,
Chapter 3 ADOBE INDESIGN CS3 Chapter 3 SETTING UP A DOCUMENT.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Synthesizing Natural Textures Michael Ashikhmin University of Utah.
Strategy Using Strategy1. Scan Path / Strategy It is important to visualize the scan path you want for a feature before you begin taking points on your.
Fast Shapelets: All Figures in Higher Resolution.
October 1, 2013Computer Vision Lecture 9: From Edges to Contours 1 Canny Edge Detector However, usually there will still be noise in the array E[i, j],
Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.
Examples of Motifs This presentation provide examples of motifs discovered by our algorithm.
Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets Chin-Chia Michael Yeh, Yan.
Adobe Flash Professional CS5 – Illustrated
Mean Shift Segmentation
Word Lesson 6 Working with Graphics
Computer Vision Lecture 5: Binary Image Processing
Chapter 3: The Efficiency of Algorithms
Benchmark Series Microsoft Word 2016 Level 2
Mining Historical Documents for Near-Duplicate Figures
Shape representation in the inferior temporal cortex of monkeys
Spatial operations and transformations
E. Olofsen, J.W. Sleigh, A. Dahan  British Journal of Anaesthesia 
Binocular Disparity and the Perception of Depth
Handwritten Characters Recognition Based on an HMM Model
Serial Dependence in the Perception of Faces
Neuronal Selectivity and Local Map Structure in Visual Cortex
Prediction of Orientation Selectivity from Receptive Field Architecture in Simple Cells of Cat Visual Cortex  Ilan Lampl, Jeffrey S. Anderson, Deda C.
Top 40 Motifs from Artificial Book with Different Masking Ratios
INSTRUCTIONAL NOTES There are many similarities between Photoshop and Illustrator. We have attempted to place tools and commands in the context of where.
Albert K. Lee, Matthew A. Wilson  Neuron 
Spatial operations and transformations
Volume 85, Issue 6, Pages (December 2003)
Presentation transcript:

Mining Historical Archives for Near- Duplicate Figures Thanawin Rakthanmanon, Qiang Zhu, and Eamonn J. Keogh

Biddulphia alternans (J.W. Bailey) Van Heurck Synonym(s): Triceratium alternans J.W. Bailey Image source:digitised drawing Literature reference: W. Smith: British Diatomaceae Vol.1 (1853), plate 5, fig. 45 View type: Valve view Scale: Image height equivalent 53µm; Image width equivalent 57µm Biddulphia alternans (J.W. Bailey) Van Heurck Synonym(s): Triceratium alternans J.W. Bailey Image source:digitised drawing Literature reference: J. Ralfs in Pritchard: A History of Infusoria (1861), plate 6, fig. 21a View type: Valve view Scale: Image height equivalent 59µm; Image width equivalent 76µm Figure 1. Two plates from 19th-century texts on Diatoms. Plate 6 of [15] and plate 5 of [20]. middle) A zoom-in of the same species, Biddulphia alternans appearing in both texts.

Figure 2. left) A figure from page 7 of [6], a 1915 text on peerage. The original text is monochrome. right) A figure from page 109 of [3], an 1858 text on honors and decorations. [3] Burke, J. B Book of Orders of Knighthood and Decorations of Honour of all Nations, London: Hurst and Blackett. [6] Dod, C. R. and Dod, R. P Dod’s Peerage, Baronetage and Knightage of Great Britain and Ireland for 1915, London: Simpkin, Marshall, Hamilton, Kent. ltd.

Figure 3. Examples of texts with “holes”.

Figure 4. The distance measure we use is offset-invariant, so the distance between any pair of windows, left, center or right above, is exactly zero. This simple fact can be exploited to greatly reduce the search space of motif discovery. Since a pattern from another book that matches one of the above with a distance X must match all with distance X, we only need to include any one of the above in our search.

Figure 5. An illustration of our notation. Here the document D consists of two pages, separated by null values. Intuitively we expect the “T” shape in window W a to match the shape shown in W b. However, note that the trivial matching pair of W c and W d (also pair W e and W f ) are actually more similar, and need to be excluded to prevent pathological results.

Figure 6. An illustration of a pathological solution to finding the top two motif pairs between two century-old texts. top) The desirable solution finds the crescent and label (rotated “E”). bottom) A redundant and undesirable solution that we must explicitly exclude is finding one pattern (the label) twice.

Figure 7. A) Two figures from table 16 of a 1907 text on Native American rock art [13] (one image recolored red for clarity). B) No matter how we shift these two figures, no more than 16% of their pixels overlap. C) Downsampled versions of the figures share 87.2% of their pixels (D). A B D C

Figure 8. A) If we randomly choose some locations (masks) on the underlying bitmap grid on which the two figures (B) shown in Figure 7 lie, and then remove those pixels from the figures, then the distance between the edited figures (C) can only stay the same or decrease. Several random attempts at removing ¼ of the pixels in the two figures eventually produced two identical edited figures (D). A C D Mask template B

Figure 9. The summation of the number of black pixels in windows. Only windows corresponding to peaks above the threshold (the red line) need to be tested. The arrows show the center position of six potential windows.

Figure 10. Samples showing the interclass variability in the hand-drawn datasets. left) Samples from the music datasets. right) Samples from the architectural dataset.

Figure 11. left) Two typical pages from Californian petroglyphs [21]. right) Two typical pages from [13]. Note that the minor artifacts are from the original Google scanning. [13] Koch-Grünberg, T Su ̈ damerikanische Felszeichnungen (South American petroglyphs), Berlin, E. Wasmuth A-G. [21] Smith, G. A. and Turner, W. G Indian Rock Art of Southern California with Selected Petroglyph Catalog, San Bernardino County, Museum Association.

Figure 12. Six random motif pairs from the top fifty pairs created by joining the two texts [13] and [21]. Note that these results suggest that our algorithm is robust to line thickness, solid vs. hollow shapes, and various other distortions. [13] Koch-Grünberg, T Su ̈ damerikanische Felszeichnungen (South American petroglyphs), Berlin, E. Wasmuth A-G. [21] Smith, G. A. and Turner, W. G Indian Rock Art of Southern California with Selected Petroglyph Catalog, San Bernardino County, Museum Association.

Figure 13. The top two inter-book motifs discovered when linking a 1921 text, British Heraldry [4] (left), with a 1909 text, English Heraldic Book-Stamps, Figured and Described [5] (center), and (right). [4]Davenport, C British Heraldry, Methuen. [5] Davenport, C English heraldic book-stamps, figured and described, London: Archibald Constable. ltd.

Figure 14. A zoom-in of the motifs discovered in Figure 13.

Figure 15. left) The 14-segment template used to create characters. We can turn on/off each segment independently to generate a vast alphabet. middle) An example of a page which is generated from the process. right) A page of the book after adding polynomial distortion (top half), and Gaussian noise with mean 0 and variance 0.10 (bottom half).

Figure 16. Time to discover motifs in books of increasing size. Our algorithm can find a motif in 512 pages in 5.5 minutes and 2048 pages in 33 minutes. (inset) As a sanity check we confirmed that the discovered motifs are plausible, as here (noise removed for clarity).

Figure 17. Effect of Gaussian noise. Our algorithm can handle significant amounts of noise. An example of a page containing noise at var=0.10 is shown in Figure 15.right.

Figure 18. The total execution time of three search algorithms: an exact motif search, an exact motif search on just the potential windows, and our algorithm ApproxMotif. We compared the running times of: 1. Exact motif search over the entire document by applying best known motif discovery technique in [27] 2. Exact motif search over just the potential windows 3. Our proposed algorithm, ApproxMotif x 10 4 Execution Time (sec) Number of Pages Exact search(all Windows) Exact search(potential Windows) ApproxMotif [27]Mueen, A. and Keogh, E. J., and Shamlo, N. B Finding Time Series Motifs in Disk-Resident Data. ICDM,

Figure 19. The effect of parameters on our algorithm. We test on artificial books with polynomial distortion and each result is averaged over ten runs. The bold/red line represents the parameters learned from just the first two pages.

Figure 20. The average distance from top-20 motifs from our algorithm and the exact search algorithm. The bold/red line shows the default parameters. This shows that the quality of motifs is not sensitive to different parameter settings and very close to the result from the exact search algorithm Iteration=5 Iteration=9 Iteration=10 Iteration=11 Iteration=20 Exact search Number of Iterations C Number of pages Average Distance HDS=2 (4:1) HDS=3 (9:1) Exact search Hash Downsampling B Number of pages Average Distance Masking Ratio A Mask 60% Mask 50% Mask 40% Mask 30% Mask 20% Exact search