An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury C. Lee Giles

An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury (sagnik@psu.edu)sagnik@psu.edu C. Lee Giles (giles@ist.psu.edu)giles@ist.psu.edu Information Sciences and Technology Pennsylvania State University

Fully Automated Data Extraction from Line Graphs client number times global locking object range locking 40.0055 80.019 120.1211

Example from Real Data: Figure Extraction Batch extraction of Vector graphics from PDFs is hard. We built a machine learning based tool for that (only one other tool exists, came up after our work).

Example from Real Data: Understanding Figure Type Is this figure a line graph? bar graph? pie chart? We designed an automated classification tool, using unsupervised learning.

Example from Real Data: Extracting and Classifying Text from the Figure Our system automatically extracts and classifies the text from figures. What are the possible texts? 1.Legend: Black 2.X-axis value: Blue 3.Y-axis value: Green 4.X-Axis label: Yellow 5.Y-Axis label: Purple 6.Figure label: Not present 7.Other text: Not present

Example from Real Data: Separating Out the Curves How many colors in the plot? “visually distinguishable color”: 4 colors in the curves, black, grey and white in other regions. Number of colors according to image processing softwares: > 1500. Our system automatically identifies “visually distinguishable colors” and segments the image based on them. Proper curve and text identification implies automated data extraction.

Another Example: Data Extraction from Line Graphs

Summary We have proposed an architecture for data extraction and semantics understanding of figures in scholarly documents (Accepted, KET 2015) Designed a batch extractor for raster graphics from PDFs (In Submission). Designed algorithms for automated data extraction from color line graphs in scholarly papers (In preparation). Future works: Algorithms for automated data extraction from monochrome line graphs and other types of figures (bar graphs, pie charts). Creating a natural language summary of figures in scholarly papers.

Figure Extraction from PDFs: Challenges Goal: extract figure, associated metadata (caption, mention) from figures. Tasks: Figure extraction, Metadata extraction, Figure metadata matching [JCDL 13, ICDAR 13]. Figure Extraction: Input: A PDF page, Output: Locations of the figure regions on the page. How figures are embedded in documents? Raster graphics (PNG, JPEG). Vector graphics (PS, EPS, SVG) It is easy to extract raster graphics from PDF documents. Each PDF operation is of the form Raster graphics is embedded as Easy for a parser to extract the content stream and output it as an image. Extraction of vector graphics is hard. Each figure is a set of graphics operations. Operations are of the form ( ). Each operation is included separately in the PDF, not grouped as an image. Grouping is tricky: No software exists to batch extract vector graphics from PDFs.

Figure Extraction from PDF Documents: Approaches Image processing (KET 2015) : Split a PDF into pages, convert each page into an image, use page segmentation to generate text/graphics region. Not very scalable, PDF to image conversion takes more than 2.5 seconds of CPU time. Vector images are extracted as raster ones: information loss. PDF processing (Submitted, ICDAR 2015) : Process the PDF primitives to extract figure locations. Fast, but the grouping of PDF primitives is hard. Why?

Graphics in PDF documents PDF primitives: path, image and character. How are the paths drawn on PDF? Path construction: Draw a path (line/Beizer curve) from point (x1,y1) to (x2,y2). Path painting: Paints a path by filling it, with several rules. Clipping paths: Defines the region where the paths would be painted. Graphics state: Determines how and where the paths will be painted on the screen. Uses transformation matrix.

An Object Model for PDF Challenges in extracting path locations: Things to take care of: operators, graphics state, clipping path. Most “graphics” operators are not implemented in existing PDF parsers such as PDFBox. Would be beneficial to have a “object” model of the PDF. Input: PDF document, Output: bounding boxes of paths. There can be too many paths, we want the visually distinguishable ones. A recent work released a software pdfXtk: “obtain a simplified representation of the most important lines and boxes which are of material importance for layout analysis”.

Using pdfXtk for Figure Extraction pdfXtk produces paths, but they can belong to figures, tables, symbols, equations etc. Some of these paths need to be filtered out. When the page contains multiple figures, paths belonging to figure regions need to be combined.

Processing PDF Objects for Figure Extraction Classification: Classify each path as a part of figure region (positive) or noise (negative). Clustering: Cluster the positively classified paths. Merging and evaluation: Merge the paths in each cluster to produce final bounding boxes. Evaluate the results. Can we build a heuristic independent model for the task? Training data Test data Figure Locati ons

Dataset for Experiment 200 randomly selected PDFs from CiteSeerX were split into 1800 pages. Test data: 85 pages selected from a sample of 300 pages. Test data contains more than one figure and 5 paths. Training data (for classifier): 50 pages containing at least one figure (3000 paths, 85% positive.). Negative: 50 pages containing no figure but tables. Finally, 4000 paths, positive to negative ratio 2:1. Positive instances manually labeled using LabelMe. We have many paths per page, can’t do exhaustive enumeration.

Classification of Paths Character density ratio: Character density within a bounding box around the path/ Character density in the whole document. Distance from boundary: Minimum of distances from all boundaries. Number of paths in eps neighborhood: Motivated from DBSCAN. Area. Classifiers: Linear Kernel SVM, Decision Tree, Logistic Regression and Gaussian Naive Bayes. 70:30 stratified division, 200 runs for each classifier. It is important to have high recall for the negative class, at the expense of others.

Classification of Paths: Results Best classifier is decision tree with max depth=3. Can be used to create rules. RecallPrecisionF1-score NegativePositiveNegativePositiveNegativePositive 73.2 ± 6.673.1 ± 6.653.7 ± 5.886.8 ± 2.861.7 ± 4.879.1 ± 4.5

Clustering and Merging Paths K means with Euclidean-center distance function. Initialization: 1. Nearest point to a figure caption and 2. K-means++. Merging: combine all rectangles belonging to a path into a large rectangle. Evaluation: Gold standard: A set of rectangles (R g ) denoting actual figure locations on a page. Predicted: A set of rectangles (R p ) denoting predicted figure locations on a page. First find a correspondence between predicted and gold standard data. For each pair (R p and R g calculate figure-precision, figure-recall and figure-F1-score.)

Clustering and Merging Paths: Results Figure-precision: Area(R p ∩R g )/Area(R p ) Figure-recall: Area(R p ∩R g )/Area(R g ) Figure-F1-score: Harmonic mean of figure-precision and figure-recall. InitializationFigure-precisionFigure-recallFigure-F1-score Nearest point to figure caption 81.985.080.9 K-means++78.480.476.6

PDF Processing for Figure Extraction: Error Analysis There are four types of errors: 1.Intersecting boxes 2.Wrong classification 3.High density 4.Wrong initialization Type 1 and 3 can be handled by using a heuristic that a figure caption box shouldn’t intersect a merged region.

Document Analysis Module: Summary We show that it is possible to extract figures and their metadata in a scalable, heuristic independent way. Novel features are proposed, evaluation methods and dataset is prepared for future work. We validate the PDF object model as an appropriate tool for information extraction in digital libraries.

Image Processing Module Figures such as line graphs are generated from data ( from a table). Given a figure, we want to recreate the table from which it was generated. Tasks: Preprocess the figure to identify possible sub-figures. Identify type of the figure. Define a metadata structure for each figure type. Extract data from the figure to populate the metadata structure. Current work focuses on 2D line graphs (plot containing axes and curves). Most figures in academic documents are line graphs. Some previous works has explored data extraction and semantics of line graphs. Easy to define a metadata structure: a line graph is almost always generated from a table. Classification and challenges in automated data extraction of line graphs is discussed here. Future work explores the data extraction process.

Classification of Figures: Features and Results Binary classification: positive (2D line graph), negative (everything else). Simple unsupervised feature learning (similar to [6]): Rescale image to 128x128. Don’t maintain aspect ratio. Binarize and divide into 4x4 patches. Extract N random patches, and cluster them into K clusters. Our experiments suggest N should be 100 and K should be 5. Generate the feature vector by concatenating 5 cluster centers: 80 dimensional features. Data: Cross validation results: Results on test data: accuracy on test data: 85%, comparable with [7]. Trainin g Test Positive20627 Negativ e 21827 Total42454 LDAQDASVMRandom Forest Accura cy 0.68 0.720.73 Precisi on 0.610.520.64 Recall0.940.730.970.94 F-score0.740.610.770.76

Data Extraction from Line Graphs: Challenges Generic steps in data extraction from line graphs: Extract: 1. axes values, 2. legends, 3. Curves. Map: Each pixel (x,y) in the plotting region to the graph scale (x’,y’) using the axes values. Associate: Each “curve” point in the plotting region to one of the legends. Easy cases: curves are drawn in separate colors, plotting region contains only curves and legends (see right). Hard cases: curves are drawn in same colors and/or plotting region contains text/ graphics other than curves and legends. The example here shows an easy case. Our analysis (WIP, DocEng 2015)on 300 line graphs sampled from 10,000 computer science papers suggests: 52% of all the plots are color plots and should be addressed first. 58% of the plots have noise in the plotting region, but in 87% of such cases the noise is due to a grid structure and should be easy to remove. While there are limited number of “visually distinguishable” colors (red and blue here), there are more than thousand “actual colors” due to anti-aliasing. This problem need to be solved. Figure label Axes values legends Axes labels

Conclusion, Ongoing and Future Work Figures are important in scholarly documents and need to be analyzed. We present a modular architecture for such analysis and describe two modules here (the paper also discusses the search engine module, omitted here). We identify the challenges in automated data extraction from line graphs by analyzing figures extracted from a large collection of scholarly papers. Our future work involves such data extraction and natural language summary generation for line graphs.

An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury C. Lee Giles

Similar presentations

Presentation on theme: "An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury C. Lee Giles"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury C. Lee Giles

Similar presentations

Presentation on theme: "An Architecture for Information Extraction from Figures in Digital Libraries Sagnik Ray Choudhury C. Lee Giles"— Presentation transcript:

Similar presentations

About project

Feedback