Download presentation
Presentation is loading. Please wait.
Published byGladys Bryan Modified over 8 years ago
1
© Copyright 2015 ABBYY ANALYSING OFFICIAL DOCUMENTS USING ABBYY COMPRENO TECHNOLOGY ABBYY HQ 2015 Anatoly Starostin Maria Stepanova
2
Terms and Conditions: Amount of contract: $24,500 Warranty period: 90 days Place of contract: New York Date of effectiveness: June 30,1997 Vehicle data: Mileage: 10,230 Buyer: Name: Susan McMillan Address: 3122 Sixth Avenue, New York, N.Y. Complexity of the task 2 Unstructured fields Structured fields Title: BILL OF SALE – MOTOR VEHICLE Vehicle data: Make: Ford Model: Mustang Year: 1997 VIN: M15325VG1223 Metadata: Author: Jack Smith File name: Bill_vehicle.docx Creation date: 2015/09/0 Seller: Name: Joseph Torentino Address: 12213 Queens Blvd., New York, N.Y.
3
Information Extraction using ABBYY Compreno (basic principles) Introduction
4
Information extraction task « Иван работает в одном московском магазине, начиная с 2010 года » ●Find all instances of entities and facts from the specific domain ontology mentioned in the given natural text
5
Language independent representation Many different language phenomena are taken into account ABBYY Compreno syntactic-semantic parser
6
Syntactic-semantic parser Based on universal (complete) language modeling Involves corpus-trained statistical models Analysis is complete: every word in the text is mapped on to some node in the language model Information extraction Rule-based approach: rules are applied to fragments of syntactic-semantic trees Rules create data consistent with certain domain ontology Domain ontologies are not complete – they only describe parts of the world relevant for the customer ABBYY Compreno technology
7
7 REAL LIFE CASE 1: purchase and sale agreements
8
8
9
9
10
10 REAL LIFE CASE 1: purchase and sale agreements
11
11 land 15552 sq. m Russian Federation, Belgorod region, Graivoronsky district, village Bubaykovo, Lenin street, 20. 1 000 000 rubles non-residential building 1273.67 sq. m Russian Federation, Belgorod region, Graivoronsky district, village Bubaykovo, Lenin street, 20. 505000 rubles garage 854.46 sq. m Russian Federation, Belgorod region, Graivoronsky district, village Bubaykovo, Lenin street, 20. 500 000 rubles REAL LIFE CASE 1: purchase and sale agreements
12
Syntactic-semantic parser Non-tree links (coreferential) Local identification rules Anaphoric rules (complex anaphora cases) ABBYY Compreno mechanisms to meet the challenges
13
13 REAL LIFE CASE 2: extracting data from tables
14
Extracting data from tables ABBYY COMPRENO
15
●Input: a table, embedded in the natural text, the cells of which may also contain natural language fragments Example of information extraction
16
●Information extraction is consistent with certain model (ontology) ●Model ontology. The basic concept – “Real Estate” with several attributes Example of information extraction
17
●Goal: to extract information on real estate – price, location, owner, number of rooms and so on ●In the table below – 3 real estate properties Price Location Attributes Example of information extraction
18
Person Organization For some cases of entity extraction the information on the structure of the table is not necessary Example of information extraction
19
For the extraction of other objects the table structure is important ●Information in whole table row corresponds to a single real estate property; information in different rows corresponds to different real estate properties. Real Estate property Example of information extraction
20
For the extraction of other objects the table structure is important ●Person/Organization in the first column is the owner of the real estate extracted from the same row Real Estate Example of information extraction
21
For the extraction of other objects the table structure is important ●The currency is in the head of the table (the “Currency” column), specific values – in the cells below Money Currency Example of information extraction
22
Table analysis pipeline ●Preliminary automatic tagging of a table ●Syntactic-semantic parsing of the cells ●Extraction of basic entities and facts (ordinary rules) ●Extraction of information with the help of rules for table parsing
23
●Preliminary automatic tagging of a table ●Automatic detection of the table heading –Can be challenging: the head may consist of several lines. It is not necessarily formatted –Various techniques are possible. We use a number of heuristics ●Table cells annotation –Annotate the cells with row and column numbers –Take into account the cases when several cells are joined in a single row/column Table analysis pipeline
24
●Extraction of information with the help of rules for table parsing ●Auxiliary ontology which models tables ●The first layer of rules: extraction of objects in the head of the table ●The second layer of rules: extraction of objects in the body of the table. The rules have access to the objects extracted from the head of the table Table analysis pipeline
25
●The rules for extraction of information from the table head (Layer 1): 25 Table analysis pipeline
26
●The rules for extraction of information from the table head (Layer 1): Table analysis pipeline
27
●The rules for extraction of information from the table body (Layer 2): Table analysis pipeline
28
Advantages of the approach ●The rules can be quite general. Can handle paraphrasing ●The rules are language-independent. Are able to analyse tables in different languages
29
Advantages of the approach ●The system can handle some changes in table structure ( movement of columns, deletion of columns, new columns)
30
30 REAL LIFE CASE 3: bank statements
31
31 REAL LIFE CASE 3: bank statements
32
32 REAL LIFE CASE 4: CV’s Период обучения Название учебного заведения Факульте т Специальность Средний балл с 2000 г. по 2005 г. Хабаровский институт инфокоммуникаций (филиал) Государственного общеобразовательного учреждения «Сибирский Государственный университет телекоммуникаций и информатики» Сети связи и системы коммутац ий Инженер связи Диплом с отличие м с 2006 г. по 2009 г. Дальневосточный Государственный Университет путей сообщения, Институт интегрированных форм обучения Менедж мент Финансовый менеджмент 4.3 Иностранный языкУровень владения АнглийскийIntermediate II ОБРАЗОВАНИЕ
33
33 REAL LIFE CASE 4: CV’s
34
Thank you for your attention Questions?
35
Contacts ABBYY Headquarters Phone: +7 (495) 783 3700 Fax: +7 (495) 783 2663 E-mail: office@abbyy.com Web: www.abbyy.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.