Presentation is loading. Please wait.

Presentation is loading. Please wait.

DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications Congrui Li 1 John Erickson 1

Similar presentations


Presentation on theme: "DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications Congrui Li 1 John Erickson 1"— Presentation transcript:

1 DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications Congrui Li 1 (lic10@rpi.edu), John Erickson 1 (erickj4@rpi.edu), Xiaogang Ma 1 (max7@rpi.edu),lic10@rpi.eduerickj4@rpi.edumax7@rpi.edu Patrick West 1 (westp@rpi.edu), Mark Ghiorso 2 (ghiorso@ofm-research.org), and Peter Fox 1 (pfox@cs.rpi.edu)westp@rpi.edughiorso@ofm-research.orgpfox@cs.rpi.edu 1 Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY 12180, United States 2 OFM Research Inc, Seattle, WA 98115, United States Abstract Reusability of data is a point of major importance in scientific research. There are many occasions when we would like to reuse the data in the old publications in the 1960’s or even older. However, the data in those old publications are normally not ready for direct reuse as they are not in the machine readable formats yet. It is very common that the document formats used are not geared toward reusability. A particularly difficult format to reuse is the Portable Document Format (PDF) as it was never designed for this purpose. This DCO boundary activity focused on the task of retrieving data from tables and plots in the scanned pdf publications as efficiently and accurately as possible. Optical character recognition (OCR) is the key technique for this task. It refers to the process of extracting machine characters from input images (usually in the form of scanned documents). A variety of open source programs have been tested for different use cases. There are also some issues remained to be improved which have been listed below. Visit boundary activity webpage Use Case of Data Extraction from Tables: original scanned PDF file image of Table I (.png) machine readable.txt file after OCR (using PyTesser) Available Tools for Data Extraction from Tables: PyTesser (https://code.google.com/p/pytesser/)https://code.google.com/p/pytesser/ OCRopus (https://code.google.com/p/ocropus/)https://code.google.com/p/ocropus/ TableSeer (http://tableseer.sourceforge.net/)http://tableseer.sourceforge.net/ ChemXSeerTableExtractor (http://chemxseer.ist.psu.edu/ChemXSeerTableExtract or/TableExtractorServlet)http://chemxseer.ist.psu.edu/ChemXSeerTableExtract or/TableExtractorServlet Apache Tika (http://tika.apache.org/)http://tika.apache.org/ Google Docs (http://docs.google.com/)http://docs.google.com/ FreeOCR (http://www.paperfile.net/) (not open scource)http://www.paperfile.net/ Problems left to be Solved: precision of OCR, especially for irregular characters (superscripts, subscripts, Greek letters, math symbols, etc) preservation of table structure after OCR automatic table detection very time consuming for manually double check the OCR results Available Tools for data extraction from plots: Plot Digitizer (http://plotdigitizer.sourceforge.net/)http://plotdigitizer.sourceforge.net/ o use autotrace (http://autotrace.sourceforge.net/) to make it semi-automatichttp://autotrace.sourceforge.net/ WebPlotDigitizer (http://arohatgi.info/WebPlotDigitizer/)http://arohatgi.info/WebPlotDigitizer/ Plot Digitizer (http://www.southalabama.edu/physics/software/plotdigitizer.htm)http://www.southalabama.edu/physics/software/plotdigitizer.htm XY 6.965374.00383 9.416726.33851 14.34737.34124 19.28817.01065 26.70575.68145 31.509323.3506 31.519522.0173 33.996221.0187 35.202824.686 40.176620.0221 …… …… original scanned PDF file image of Fig. 1 (.png) in Plot Digitizer, simply indicate where the line is on the plot with a thick paint brush the program attempts to automatically sort out the data from the grid line. This auto- digitizing feature depends on an image vectorization program called "autotrace". machine readable.csv file output from Plot Digitizer with the autotrace feature (totally 278 data points) Use Case of Data Extraction from Plots:


Download ppt "DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications Congrui Li 1 John Erickson 1"

Similar presentations


Ads by Google