An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury C. Lee Giles

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Images Images are a key component of any multimedia presentation.

Random Forest Predrag Radenković 3237/10

A Graph based Geometric Approach to Contour Extraction from Noisy Binary Images Amal Dev Parakkat, Jiju Peethambaran, Philumon Joseph and Ramanathan Muthuganapathy.

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

x – independent variable (input)

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Redaction: redaction: PANAKOS ANDREAS. An Interactive Tool for Color Segmentation. An Interactive Tool for Color Segmentation. What is color segmentation?

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

An Introduction to Support Vector Machines Martin Law.

Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.

Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Efficient Model Selection for Support Vector Machines

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Object Bank Presenter ： Liu Changyu Advisor ： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 4 th, 2013.

Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Calculating Fractal Dimension from Vector Images Kelly Ran FIGURE 1. Examples of fractals (a) Vector graphics image (b) Sierpinski Carpet D ≈ 1.89 FIGURE.

1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.

S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

An Introduction to Support Vector Machines (M. Law)

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

Image Emotional Semantic Query Based On Color Semantic Description Wei-Ning Wang, Ying-Lin Yu Department of Electronic and Information Engineering, South.

© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED Machine Learning and Failure Prediction in Hard Disk Drives Dr. Amit Chattopadhyay Director.

Text From Corners: A Novel Approach to Detect Text and Caption in Videos Xu Zhao, Kai-Hsiang Lin, Yun Fu, Member, IEEE, Yuxiao Hu, Member, IEEE, Yuncai.

Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

Experience Report: System Log Analysis for Anomaly Detection

Big data classification using neural network

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Semi-Supervised Clustering

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Classification with Perceptrons Reading:

Basic machine learning background with Python scikit-learn

Brain Hemorrhage Detection and Classification Steps

Discriminative Frequent Pattern Analysis for Effective Classification

CSCI N317 Computation for Scientific Applications Unit Weka

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

A Novel Smoke Detection Method Using Support Vector Machine

Automatic Handwriting Generation

CAMCOS Report Day December 9th, 2015 San Jose State University

NAÏVE BAYES CLASSIFICATION

Presentation transcript:

An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury C. Lee Giles Information Sciences and Technology Pennsylvania State University

Fully Automated Data Extraction from Line Graphs client number times global locking object range locking

Example from Real Data: Figure Extraction Batch extraction of Vector graphics from PDFs is hard. We built a machine learning based tool for that (only one other tool exists, came up after our work).

Example from Real Data: Understanding Figure Type Is this figure a line graph? bar graph? pie chart? We designed an automated classification tool, using unsupervised learning.

Example from Real Data: Extracting and Classifying Text from the Figure Our system automatically extracts and classifies the text from figures. What are the possible texts? 1.Legend: Black 2.X-axis value: Blue 3.Y-axis value: Green 4.X-Axis label: Yellow 5.Y-Axis label: Purple 6.Figure label: Not present 7.Other text: Not present

Example from Real Data: Separating Out the Curves How many colors in the plot? “visually distinguishable color”: 4 colors in the curves, black, grey and white in other regions. Number of colors according to image processing softwares: > Our system automatically identifies “visually distinguishable colors” and segments the image based on them. Proper curve and text identification implies automated data extraction.

Another Example: Data Extraction from Line Graphs

Summary We have proposed an architecture for data extraction and semantics understanding of figures in scholarly documents (Accepted, KET 2015) Designed a batch extractor for raster graphics from PDFs (In Submission). Designed algorithms for automated data extraction from color line graphs in scholarly papers (In preparation). Future works: Algorithms for automated data extraction from monochrome line graphs and other types of figures (bar graphs, pie charts). Creating a natural language summary of figures in scholarly papers.

Figure Extraction from PDFs: Challenges Goal: extract figure, associated metadata (caption, mention) from figures. Tasks: Figure extraction, Metadata extraction, Figure metadata matching [JCDL 13, ICDAR 13]. Figure Extraction: Input: A PDF page, Output: Locations of the figure regions on the page. How figures are embedded in documents? Raster graphics (PNG, JPEG). Vector graphics (PS, EPS, SVG) It is easy to extract raster graphics from PDF documents. Each PDF operation is of the form Raster graphics is embedded as Easy for a parser to extract the content stream and output it as an image. Extraction of vector graphics is hard. Each figure is a set of graphics operations. Operations are of the form ( ). Each operation is included separately in the PDF, not grouped as an image. Grouping is tricky: No software exists to batch extract vector graphics from PDFs.

Figure Extraction from PDF Documents: Approaches Image processing (KET 2015) : Split a PDF into pages, convert each page into an image, use page segmentation to generate text/graphics region. Not very scalable, PDF to image conversion takes more than 2.5 seconds of CPU time. Vector images are extracted as raster ones: information loss. PDF processing (Submitted, ICDAR 2015) : Process the PDF primitives to extract figure locations. Fast, but the grouping of PDF primitives is hard. Why?

Graphics in PDF documents PDF primitives: path, image and character. How are the paths drawn on PDF? Path construction: Draw a path (line/Beizer curve) from point (x1,y1) to (x2,y2). Path painting: Paints a path by filling it, with several rules. Clipping paths: Defines the region where the paths would be painted. Graphics state: Determines how and where the paths will be painted on the screen. Uses transformation matrix.

An Object Model for PDF Challenges in extracting path locations: Things to take care of: operators, graphics state, clipping path. Most “graphics” operators are not implemented in existing PDF parsers such as PDFBox. Would be beneficial to have a “object” model of the PDF. Input: PDF document, Output: bounding boxes of paths. There can be too many paths, we want the visually distinguishable ones. A recent work released a software pdfXtk: “obtain a simplified representation of the most important lines and boxes which are of material importance for layout analysis”.

Using pdfXtk for Figure Extraction pdfXtk produces paths, but they can belong to figures, tables, symbols, equations etc. Some of these paths need to be filtered out. When the page contains multiple figures, paths belonging to figure regions need to be combined.

Processing PDF Objects for Figure Extraction Classification: Classify each path as a part of figure region (positive) or noise (negative). Clustering: Cluster the positively classified paths. Merging and evaluation: Merge the paths in each cluster to produce final bounding boxes. Evaluate the results. Can we build a heuristic independent model for the task? Training data Test data Figure Locati ons

Dataset for Experiment 200 randomly selected PDFs from CiteSeerX were split into 1800 pages. Test data: 85 pages selected from a sample of 300 pages. Test data contains more than one figure and 5 paths. Training data (for classifier): 50 pages containing at least one figure (3000 paths, 85% positive.). Negative: 50 pages containing no figure but tables. Finally, 4000 paths, positive to negative ratio 2:1. Positive instances manually labeled using LabelMe. We have many paths per page, can’t do exhaustive enumeration.

Classification of Paths Character density ratio: Character density within a bounding box around the path/ Character density in the whole document. Distance from boundary: Minimum of distances from all boundaries. Number of paths in eps neighborhood: Motivated from DBSCAN. Area. Classifiers: Linear Kernel SVM, Decision Tree, Logistic Regression and Gaussian Naive Bayes. 70:30 stratified division, 200 runs for each classifier. It is important to have high recall for the negative class, at the expense of others.

Classification of Paths: Results Best classifier is decision tree with max depth=3. Can be used to create rules. RecallPrecisionF1-score NegativePositiveNegativePositiveNegativePositive 73.2 ± ± ± ± ± ± 4.5

Clustering and Merging Paths K means with Euclidean-center distance function. Initialization: 1. Nearest point to a figure caption and 2. K-means++. Merging: combine all rectangles belonging to a path into a large rectangle. Evaluation: Gold standard: A set of rectangles (R g ) denoting actual figure locations on a page. Predicted: A set of rectangles (R p ) denoting predicted figure locations on a page. First find a correspondence between predicted and gold standard data. For each pair (R p and R g calculate figure-precision, figure-recall and figure-F1-score.)

Clustering and Merging Paths: Results Figure-precision: Area(R p ∩R g )/Area(R p ) Figure-recall: Area(R p ∩R g )/Area(R g ) Figure-F1-score: Harmonic mean of figure-precision and figure-recall. InitializationFigure-precisionFigure-recallFigure-F1-score Nearest point to figure caption K-means

PDF Processing for Figure Extraction: Error Analysis There are four types of errors: 1.Intersecting boxes 2.Wrong classification 3.High density 4.Wrong initialization Type 1 and 3 can be handled by using a heuristic that a figure caption box shouldn’t intersect a merged region.

Document Analysis Module: Summary We show that it is possible to extract figures and their metadata in a scalable, heuristic independent way. Novel features are proposed, evaluation methods and dataset is prepared for future work. We validate the PDF object model as an appropriate tool for information extraction in digital libraries.

Image Processing Module Figures such as line graphs are generated from data ( from a table). Given a figure, we want to recreate the table from which it was generated. Tasks: Preprocess the figure to identify possible sub-figures. Identify type of the figure. Define a metadata structure for each figure type. Extract data from the figure to populate the metadata structure. Current work focuses on 2D line graphs (plot containing axes and curves). Most figures in academic documents are line graphs. Some previous works has explored data extraction and semantics of line graphs. Easy to define a metadata structure: a line graph is almost always generated from a table. Classification and challenges in automated data extraction of line graphs is discussed here. Future work explores the data extraction process.

Classification of Figures: Features and Results Binary classification: positive (2D line graph), negative (everything else). Simple unsupervised feature learning (similar to [6]): Rescale image to 128x128. Don’t maintain aspect ratio. Binarize and divide into 4x4 patches. Extract N random patches, and cluster them into K clusters. Our experiments suggest N should be 100 and K should be 5. Generate the feature vector by concatenating 5 cluster centers: 80 dimensional features. Data: Cross validation results: Results on test data: accuracy on test data: 85%, comparable with [7]. Trainin g Test Positive20627 Negativ e Total42454 LDAQDASVMRandom Forest Accura cy Precisi on Recall F-score

Data Extraction from Line Graphs: Challenges Generic steps in data extraction from line graphs: Extract: 1. axes values, 2. legends, 3. Curves. Map: Each pixel (x,y) in the plotting region to the graph scale (x’,y’) using the axes values. Associate: Each “curve” point in the plotting region to one of the legends. Easy cases: curves are drawn in separate colors, plotting region contains only curves and legends (see right). Hard cases: curves are drawn in same colors and/or plotting region contains text/ graphics other than curves and legends. The example here shows an easy case. Our analysis (WIP, DocEng 2015)on 300 line graphs sampled from 10,000 computer science papers suggests: 52% of all the plots are color plots and should be addressed first. 58% of the plots have noise in the plotting region, but in 87% of such cases the noise is due to a grid structure and should be easy to remove. While there are limited number of “visually distinguishable” colors (red and blue here), there are more than thousand “actual colors” due to anti-aliasing. This problem need to be solved. Figure label Axes values legends Axes labels

Conclusion, Ongoing and Future Work Figures are important in scholarly documents and need to be analyzed. We present a modular architecture for such analysis and describe two modules here (the paper also discusses the search engine module, omitted here). We identify the challenges in automated data extraction from line graphs by analyzing figures extracted from a large collection of scholarly papers. Our future work involves such data extraction and natural language summary generation for line graphs.