© Copyright 2015 ABBYY ANALYSING OFFICIAL DOCUMENTS USING ABBYY COMPRENO TECHNOLOGY ABBYY HQ 2015 Anatoly Starostin Maria Stepanova.

Slides:



Advertisements
Similar presentations
Anatomy of a Web Page. Parts of a Web Page Title Bar Navigation Tool Bar Location Bar Header Graphic/Image Text Horizontal Rule Links.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
TARTAR Information Extraction Transforming Arbitrary Tables into F-Logic Frames with TARTAR Aleksander Pivk, York Sure, Philipp Cimiano, Matjaz Gams, Vladislav.
Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.
Concepts of Database Management Sixth Edition
PROCESS MODELING Transform Description. A model is a representation of reality. Just as a picture is worth a thousand words, most models are pictorial.
Toward Automatic Processing and Indexing of Microfilm.
Database Design Chapter 2. Goal of all Information Systems  To add value –Reduce costs –Increase sales or revenue –Provide a competitive advantage.
Database Features Lecture 2. Desirable features in an information system Integrity Referential integrity Data independence Controlled redundancy Security.
Computer Science 103 Chapter 2 HyperText Markup Language (HTML)
Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway Supported by NSF.
Chapter 12 Information Systems. 2 Chapter Goals Define the role of general information systems Explain how spreadsheets are organized Create spreadsheets.
APPENDIX C DESIGNING DATABASES
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Introduction to Database Systems
Modern Real Estate Practice in Illinois
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Cristina MARTINETTI BUFFA, BORTOLOTTI & MATHIS DISPUTE RESOLUTION IN M&A TRANSACTIONS WARSAW May 2010 MODEL M&A CONTRACTS OF.
©Silberschatz, Korth and Sudarshan5.1Database System Concepts Chapter 5: Other Relational Languages Query-by-Example (QBE) Datalog.
Introduction to Accounting Information Systems
Concepts and Terminology Introduction to Database.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
© Copyright 2013 ABBYY NLP PLATFORM FOR EU-LINGUAL DIGITAL SINGLE MARKET Alexander Rylov LTi Summit 2013 Confidential.
Concepts of Relational Databases. Fundamental Concepts Relational data model – A data model representing data in the form of tables Relations – A 2-dimensional.
ITGS Case Study Theatre Booking System Ayushi Pradhan.
Information Systems Today (©2006 Prentice Hall) 3-1 CS3754 Class Note 12 Summery of Relational Database.
Chapter 8. ATTRIBUTE DATA INPUT AND MANAGEMENT
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Dimitrios Skoutas Alkis Simitsis
Database What is a database? A database is a collection of information that is typically organized so that it can easily be storing, managing and retrieving.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
4 Chapter Four Introduction to HTML. 4 Chapter Objectives Learn basic HTML commands Discover how to display graphic image objects in Web pages Create.
Chapter 1 1 Lecture # 1 & 2 Chapter # 1 Databases and Database Users Muhammad Emran Database Systems.
Large Principle Money Transfer - APAC
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Relational Database. I. Relational Database 1. Introduction 2. Database Models 3. Relational Database 4. Entity-Relationship Models 5. RDB Design Principles.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fluency with Information Technology Third Edition by Lawrence Snyder Chapter.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin APPENDIX C DESIGNING DATABASES APPENDIX C DESIGNING DATABASES.
ProPerf NSP Release 2.0 Concepts Date issued April 2006 Document reference & release version TR-IAS Ed 2.1 These presentation materials describe.
Session 1 Module 1: Introduction to Data Integrity
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Object storage and object interoperability
* Database is a group of related objects * Objects can be Tables, Forms, Queries or Reports * All data reside in Tables * A Row in a Table is a record.
Introduction to HTML C151 Multi-User Operating Systems.
March, 2007RCO LLC, RCO Text Analysis Technologies for information extraction and business intelligence We can tell you everything about.
2.4. Choose and configure HTML5 tags to organize content and forms Choose and configure HTML5 tags for input and validation. Building the User Interface.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Rationale Databases are an integral part of an organization. Aspiring Database Developers should be able to efficiently design and implement databases.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Best-of-Breed Hybrid Methods for Text De-identification Yang H, Garibaldi JM. Automatic detection of protected health information from clinical narratives.
REAL ESTATE PROFESSIONALS
Databases Chapter 16.
Introduction to Database Systems
Big Data Quality the next semantic challenge
Professional Tax Help. Tax Services Tax and estate planning.
ERD’s REVIEW DBS201.
LECTURE 34: Database Introduction
FIBO-aligned Semantic Triples
Test Your Tech Blogging is: Someone's online journal.
Database Design Hacettepe University
LECTURE 33: Database Introduction
The sources of accounting for business transactions
PolyAnalyst Web Report Training
Database management systems
Unit 6 - XML Transformations
Presentation transcript:

© Copyright 2015 ABBYY ANALYSING OFFICIAL DOCUMENTS USING ABBYY COMPRENO TECHNOLOGY ABBYY HQ 2015 Anatoly Starostin Maria Stepanova

Terms and Conditions: Amount of contract: $24,500 Warranty period: 90 days Place of contract: New York Date of effectiveness: June 30,1997 Vehicle data: Mileage: 10,230 Buyer: Name: Susan McMillan Address: 3122 Sixth Avenue, New York, N.Y. Complexity of the task 2 Unstructured fields Structured fields Title: BILL OF SALE – MOTOR VEHICLE Vehicle data: Make: Ford Model: Mustang Year: 1997 VIN: M15325VG1223 Metadata: Author: Jack Smith File name: Bill_vehicle.docx Creation date: 2015/09/0 Seller: Name: Joseph Torentino Address: Queens Blvd., New York, N.Y.

Information Extraction using ABBYY Compreno (basic principles) Introduction

Information extraction task « Иван работает в одном московском магазине, начиная с 2010 года » ●Find all instances of entities and facts from the specific domain ontology mentioned in the given natural text

Language independent representation Many different language phenomena are taken into account ABBYY Compreno syntactic-semantic parser

Syntactic-semantic parser Based on universal (complete) language modeling Involves corpus-trained statistical models Analysis is complete: every word in the text is mapped on to some node in the language model Information extraction Rule-based approach: rules are applied to fragments of syntactic-semantic trees Rules create data consistent with certain domain ontology Domain ontologies are not complete – they only describe parts of the world relevant for the customer ABBYY Compreno technology

7 REAL LIFE CASE 1: purchase and sale agreements

8

9

10 REAL LIFE CASE 1: purchase and sale agreements

11 land sq. m Russian Federation, Belgorod region, Graivoronsky district, village Bubaykovo, Lenin street, rubles non-residential building sq. m Russian Federation, Belgorod region, Graivoronsky district, village Bubaykovo, Lenin street, rubles garage sq. m Russian Federation, Belgorod region, Graivoronsky district, village Bubaykovo, Lenin street, rubles REAL LIFE CASE 1: purchase and sale agreements

Syntactic-semantic parser Non-tree links (coreferential) Local identification rules Anaphoric rules (complex anaphora cases) ABBYY Compreno mechanisms to meet the challenges

13 REAL LIFE CASE 2: extracting data from tables

Extracting data from tables ABBYY COMPRENO

●Input: a table, embedded in the natural text, the cells of which may also contain natural language fragments Example of information extraction

●Information extraction is consistent with certain model (ontology) ●Model ontology. The basic concept – “Real Estate” with several attributes Example of information extraction

●Goal: to extract information on real estate – price, location, owner, number of rooms and so on ●In the table below – 3 real estate properties Price Location Attributes Example of information extraction

Person Organization For some cases of entity extraction the information on the structure of the table is not necessary Example of information extraction

For the extraction of other objects the table structure is important ●Information in whole table row corresponds to a single real estate property; information in different rows corresponds to different real estate properties. Real Estate property Example of information extraction

For the extraction of other objects the table structure is important ●Person/Organization in the first column is the owner of the real estate extracted from the same row Real Estate Example of information extraction

For the extraction of other objects the table structure is important ●The currency is in the head of the table (the “Currency” column), specific values – in the cells below Money Currency Example of information extraction

Table analysis pipeline ●Preliminary automatic tagging of a table ●Syntactic-semantic parsing of the cells ●Extraction of basic entities and facts (ordinary rules) ●Extraction of information with the help of rules for table parsing

●Preliminary automatic tagging of a table ●Automatic detection of the table heading –Can be challenging: the head may consist of several lines. It is not necessarily formatted –Various techniques are possible. We use a number of heuristics ●Table cells annotation –Annotate the cells with row and column numbers –Take into account the cases when several cells are joined in a single row/column Table analysis pipeline

●Extraction of information with the help of rules for table parsing ●Auxiliary ontology which models tables ●The first layer of rules: extraction of objects in the head of the table ●The second layer of rules: extraction of objects in the body of the table. The rules have access to the objects extracted from the head of the table Table analysis pipeline

●The rules for extraction of information from the table head (Layer 1): 25 Table analysis pipeline

●The rules for extraction of information from the table head (Layer 1): Table analysis pipeline

●The rules for extraction of information from the table body (Layer 2): Table analysis pipeline

Advantages of the approach ●The rules can be quite general. Can handle paraphrasing ●The rules are language-independent. Are able to analyse tables in different languages

Advantages of the approach ●The system can handle some changes in table structure ( movement of columns, deletion of columns, new columns)

30 REAL LIFE CASE 3: bank statements

31 REAL LIFE CASE 3: bank statements

32 REAL LIFE CASE 4: CV’s Период обучения Название учебного заведения Факульте т Специальность Средний балл с 2000 г. по 2005 г. Хабаровский институт инфокоммуникаций (филиал) Государственного общеобразовательного учреждения «Сибирский Государственный университет телекоммуникаций и информатики» Сети связи и системы коммутац ий Инженер связи Диплом с отличие м с 2006 г. по 2009 г. Дальневосточный Государственный Университет путей сообщения, Институт интегрированных форм обучения Менедж мент Финансовый менеджмент 4.3 Иностранный языкУровень владения АнглийскийIntermediate II ОБРАЗОВАНИЕ

33 REAL LIFE CASE 4: CV’s

Thank you for your attention Questions?

Contacts ABBYY Headquarters Phone: +7 (495) Fax: +7 (495) Web: