Enabling Efficient Chinese Jiapu Information Extraction

Slides:



Advertisements
Similar presentations
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
Advertisements

By Professor Vasile AVRAM, PhD Informatics in Economy Department Academy of Economic Studies – Bucharest ROMANIA Defining Metrics to Automate the Quantitative.
Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011.
© 2011 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Kiran Kaja | Accessibility Engineer Ensuring Accessibility in Document Conversion.
Computer Science Research for Family History and Genealogy David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan.
Scott N. Woodfield David W. Embley Stephen W. Liddle Brigham Young University.
1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
High-Level View of a Source-Centric Genealogical Model: “The Model with Four Boxes” Randy Wilson March 9, 2005.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
BYU Craigslist Alerter Oliver Nina, Meher Shaikh Andrew Zitzelberger.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
1/20 Document Segmentation for Image Compression 27/10/2005 Emma Jonasson Supervisor: Dr. Peter Tischer.
Optical Character Recognition for Logistics Reporting Contributors: Joy Kamunyori, Mike Frost, Ashraf Islam A recording of the WebEx session can be found.
P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given.
October 29, Marla Roll Director Shannon Lavey Service Coordinator and Provider Allison Kidd Assistive Technology IT Coordinator Accessibility Specialist.
Objective 5.01: Understand database tables used in business Database Fundamentals.
Image Search SMX West 2009 Case Study. Image Search Facts.
A summary of the report written by W. Alink, R.A.F. Bhoedjang, P.A. Boncz, and A.P. de Vries.
Diego Saavedra Data Capture and Validation at Point of Collection.
Get more out of your OKI MFP
--Caesar Cat.  Write an optical character recognition application that identifies and recognizes printed text within an image.
Incident reporting in the Swedish Consumer Price Index Martin Kullendorff, Oxana Tarassiouk Statistics Sweden Economic Statistics Department, Price Statistics.
OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012.
Accounting Automation Solutions. Is Bookkeeping Stressful and Costly? Inaccurate Data Low Productivity Inaccurate Data Low Productivity Storage Cost Resource.
Turning Audio Search and Speech Analytics into Business Intelligence.
Avoiding Segmentation in Multi-digit Numeral String Recognition by Combining Single and Two-digit Classifiers Trained without Negative Examples Dan Ciresan.
Image Workflow Processes Elspeth Haston, Robert Cubey, Martin Pullan & David J Harris.
Alison Mancusi February 12, 2011 Overview of Exalead.
Affordable Computerized Maintenance Management Solutions (CMMS) Gabi Miles Hach Company May 22, 2009.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.
FAMILYSEARCH INDEXING IS WORLDWIDE. INDEXING 1.WHAT IS INDEXING? - A PROCESS WHERE A PERSON CAN TRANSCRIBE DATA FROM A DIGITAL IMAGE WHICH IS THEN POSTED.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Cage: A Keyword.
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Instruction For the Chinese vocabulary: Record your pronunciation of the character or word. Provide the English meaning. Use the mouse write the character.
Automatic Discovery and Processing of EEG Cohorts from Clinical Records Mission: Enable comparative research by automatically uncovering clinical knowledge.
OntoSoar: Soar Finds Facts in Text Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 33 rd Soar Workshop, June 2013 pl 6/6/201333rd.
“The Future of Family History Technology” in Academic Research FHTW15 – Panel David W. Embley.
Challenges and Opportunities. Common Pedigree Research Issues Record Linkage Data Standardization Efficient Data Access Expert Finding.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
Artificial Intelligence, simulation and modelling.
OCR Nationals ICT – Unit 1 Task 2 Grade C Task Overview As part of your task organising the French trip, you need to find out about different options.
Melissadata.in. Your Partner for Global Address and Contact Data Quality Your Partner for Global Address and Contact Data Quality Since 1985, Melissa.
Automation Living in a Paper Oriented World and The Steps to Automation.
DIGITIZATION IN THEORY AND PRACTICE WEBSITE: Helen Nneka Okpala Presentation done at University of.
Enhance Zone Label OCR Text/ Bitmaps Text/ Bitmaps Database Full Text Index/ Search Full Text Index/ Search Index/ Search by Word ROI Pattern Index/ Search.
Donna Morrell, CTR NAACCR 2014 Annual Conference Ottawa, Ontario, Canada June 25, 2014 Using Scanners and Optical Character Recognition for Pathology Report.
Model Rockets By: David Bivens. Big question Does size change the height of how high it goes? Because rockets come in different sizes.
The Neural Engineering Data Consortium Mission: To focus the research community on a progression of research questions and to generate massive data sets.
Multiplication Timed Tests.
BIG DATA Initiative SMART SubstationBig Data Solution.
Optical Character Recognition
S.Rajeswari Head , Scientific Information Resource Division
Robotic Process Automation Training| RPA online Training at GoLogica
Database Fundamentals
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
Vision for an Automatically Constructed FH-WoK
Pragmatic Quality Assessment for Automatically Extracted Data
فصل3: ایجاد تغییر در فرهنگ سازمانی
ایجاد تغییر در فرهنگ سازمان
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Extracting Full Names from Diverse and Noisy Scanned Document Images
Temple Ready within an Hour of Collection Capture
NEW FEATURES ON OUR WEBSITE
ListReader: Wrapper Induction for Lists in OCRed Documents
New Platform to Support Digital Humanities in the Czech Republic
Warmup Chapter P Review
Complete the family of four
Presentation transcript:

Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

Chinese Jiapu: Clan Family Record Top right image left page: lineage with 5 generations (17, 18, 19, 20, 21), starting from the head (top right, in the large frame), with two sons and a third on the continuation page Bottom right image right page: continuation of top right page, with more sons Bottom right image left page: vital information, 3rd column from the right is the head, 4th is his wife 12,871,979 Jiapu images ~ half billion fact assertions

Click-Only (or at least Mostly) Extraction Tool COMET Click-Only (or at least Mostly) Extraction Tool

COMET with Jiapu Add additional blow up (not animated on) that shows the IME (Input Method Editor).

COMET with Jiapu Add additional blow up (not animated on) that shows the IME (Input Method Editor).

COMET with Jiapu Add additional blow up (not animated on) that shows the IME (Input Method Editor).

Structured Data Stored for Query and Search

Requirements for COMET to work well with Jiapu Good alignment Good OCR

Requirements for COMET to work well with Jiapu Good alignment Good OCR not OK OK

Requirements for COMET to work well with Jiapu Good alignment Good OCR four characters as one bad OCR ?? missing OCR many incorrect characters

OCR Resolution Currently insufficient: Acrobat Professional Tesseract Abbyy Is there a better OCR engine for Chinese? Can we do better with what we have? Image enhancement Resize glyphs Use an ensemble of OCR engines Train Tesseract for Jiapu peculiarities

Alignment Resolution Currently Engineering solution (but lots of work) Open source PDFBox: needs work for Acrobat Pro Tesseract & Abbyy interact incorrectly with PDFBox Engineering solution (but lots of work)

Automating Jiapu Data Extraction Future Work Automating Jiapu Data Extraction

Future Work Verify & Correct

Conclusions “Failed” experiment ? A way out Potential Alignment engineering OCR: find or do R&D Potential Click-Only (Mostly) Extraction – a win Semi-automatic extraction – a big win

Conclusions “Failed” experiment ? A way out Potential Alignment engineering OCR: find or do R&D Potential Click-Only (Mostly) Extraction – a win Semi-automatic extraction – a big win