DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data and information in an organization, and it includes the organization's manual and automated processes. A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model. The model in most common use today is the relational model.
Querying unstructured sources
Structure query over unstructured document Extract/Select/Annotate politicianNews From Where politicianNews(X,Y,Z), Z:politician(name:N), N=hillaryClinton [Fill database uri] This kind of query can be executed over database or unstructured document. Only the rewriting strategy changes
Information extraction and Annotation Information extraction (IE): enables to acquire information contained in unstructured documents and store them in structured forms Current Web into a Semantic Web requires automatic approaches for annotation of existing data since manual annotation approaches will not scale in general. More scalable semi- automatic approaches known from ontology learning deal with extraction of ontologies from texts (also in tabular form). An ontology-based system for information extraction from semi and unstructured Web Documents
Motivations Existing IE approaches mainly exploits syntactic structure of information and not its actual semantics Much work on IE from HTML documents: There is not a unique winning approach Extraction rules are able to identify tabular information only when such a structure is explicitly declared Variability of HTML language and the use of Cascading Style Sheet technology, produce classic HTML approaches not robust Too little work on IE from PDF documents: No ontology-based approaches Existing Table Recognition approaches and information extraction follow distinct scope
State of Art Existing Approaches and Systems Manual approches TSIMMIS Minerva W4F XWRAP JEDI FLORID Supervised Approches SRV RAPIER WHISK WIEN STALKER SoftMealy NoDoSe DEByE LixTo Unsupervised Approaches STAVIES DeLa RoadRunner EXALG DEPTA NLP-oriented system GATE RAPIER SRV WHISK TextRunner SnowBall PDF-oriented approaches Flesca et Al. (Fuzzy System) 06 Gottlob et Al. 06 Document Understanding techniques
PDF Document: the standard format for document publication, sharing and exchange IE from Adobe Portable Document Format (PDF) One of the most diffused unstructured document format PDF documents are completely unstructured and their internal encoding is visualization-oriented The PDF document description language represents a PDF document as a collection of 2-dimensional typographic elements contained in content streams Traditional wrapping/IE systems cannot be applied
Information Extraction from Documents by means extraction rules that: i. Exploit a human-oriented document representation: 2-dimensional representation ii. Exploit semantics of the information represented in a Knowledge Base iii. Directly Populate (enrich) the Knowledge Base with the Extracted Information iv. Handle both natural language and document structures (by exploiting embedded Table Recognition Approach) v. Allow (Semantic) annotation of unstructured sources for enabling semantic classification and search Goals
Proposed Approach To exploit semantics represented in a Knowledge Base To recognize information (when they are organized in both textual and tabular form) To directly store extracted information in the Knowledge Base
2-Dimensional Document Representation Semantic given by the position Value about Operating revenues Obtained in 2007 year
Internal Document Representation: Input Document
2-Dimensional Document Representation: Document Portion (0,0) X Y
2-Dimensional Document Representation: Document Portion (0,0) X Y (1,32) (4,33)
2-Dimensional Document Representation: Document Portion (1,32) (4,33) Portioning Process
Attribute Grammars Example: math expression E → [+ | −] T [ (+ | −) T ]* T → F [ (* | /) F]* F → NUM | (E) An attribute for each symbol of the grammar and local attributes used as aid. So, the semantic action allow to compute the value of the expression: E → {double E.ris; int segno =1;} [+ | − {segno= −1;} ] T 1 {E.ris=segno*T 1.ris;} [ (+ {segno=1;} | − {segno=−1;}) T 2 {E.ris=E.ris+segno*T 2.ris;} ]* T → {double T.ris; int oper;} F1 {T.ris=F 1.ris;} [ (* {oper=1;} | / {oper=2;} ) F 2 {T.ris=(oper==1)?T.ris*F 2.ris : T.ris/F 2.ris;}]* F → {double F.ris;} NUM {F.ris=NUM.val;} | (E) {F.ris=E.ris;}
Simple Extraction Patterns: regex Recognize a float number \d+(\.\d{2})? Mail address: (C|c)ittà
placecity Knowledge Representation Formalism IDname IDnamepopulationinState chicago“Chicago” illinois cityClimate city koppenCli -mate
Self-Describing/Populating Ontology (SDO) A SDO is an ontology in which objects and classes can be equipped by a set of rules named descriptors. Descriptors are object-oriented grammatical rules that: Allow to recognize and extract objects from documents and populate classes with new extracted objects Exploit Knowledge contained in OOKB for the extraction Can exploit each other in describing more complex objects
Descriptors Class Descriptors that handle 2-D capabilities: class weatherRecord( wCity:city, wWarns:warnings,Temp:temperature, wHumid:percentage, wPress:pressure, wDescr:weatherDescription, wWind:wind). <weatherRecord(C,Wa,T,H,P,D,Wi)> -> <X:city()>{C:=X;} (<X:warnings()>{Wa:=X;})? <X:temperature()>{T:=X;} <X:percentage()>{H:=X} <X:pressure()>{P:=X;} <X:wind()>{Wi:=X;} 2D-BOTH. <X:weatherDescription()>{D:=X;} General or Domain Specific Knowledge
The system architecture Attribute Transition Network (ATN) implemented as logic programs in OntoDLP Language
The system architecture Direct use of Chart Parsing Algorithms for AG parsing
The system architecture: 2- D matcher Direct use of Chart Parsing Algorithms for AG parsing