Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integration of Information from Multiple Sources of Textual Data

Similar presentations


Presentation on theme: "Integration of Information from Multiple Sources of Textual Data"— Presentation transcript:

1 Integration of Information from Multiple Sources of Textual Data
November 22, 1999 전 승 원

2 1. Introduction The number of information sources in the Internet is exponentially increasing. Information is highly heterogeneous both in its structure and in its origin. Data types: Textual data, Images, Sounds, etc.

3 Integration from Textual Data
1. Introduction Integration from Textual Data To organize data (often a huge amount) coming from multiple heterogeneous sources in easily accessible structures. Research topic for different research communities: Database, Artificial Intelligence, and Information Retrieval Two scenarios: Known sources, Unknown sources

4 The First Scenario (Known Sources)
1. Introduction The First Scenario (Known Sources) The sources of heterogeneous textual data are known. Widely investigated in the database area: decision support systems, integration of heterogeneous databases, and datawarehouse DARPA Intelligent Integration of Information (I3) research program

5 The Second Scenario (Unknown Sources)
1. Introduction The Second Scenario (Unknown Sources) Information Discovery problem Arising mainly due to the Internet explosion In this scenario, First: to individuate among a huge amount of sources of heterogeneous textual data a possibly low amount of relevant sources. Second: to face the problem of scenario 1. Out of the scope We will discuss problems and solutions of the extraction and integration of information from highly heterogeneous multiple sources of textual data in order to provide true information.

6 Heterogeneity Heterogeneous sources
1. Introduction Heterogeneity Heterogeneous sources Databases, File Systems, Knowledge Bases, Digital Libraries, Information Retrieval Systems, and Electronic Mail Systems. Structural and implementation heterogeneity Differences in hardware platforms, DBMS, data models, and data languages. Semantic heterogeneity Different names are employed to represent the same information. Different modeling constructs are used to represent the same piece of information in different sources. How to cope with the heterogeneity Two fundamental approaches: Structural and Semantic Databases: RDB, OODB, etc.

7 Structural Approach Characterized as follows
1. Introduction Structural Approach Characterized as follows Self-describing model Semantic information is effectively encoded in rules Arguments in favor of the Structural Approach Flexibility, generality and conciseness of a self-describing model A form of first-order logic languages is provided. Useful when a client doesn’t know in advance the structure of the objects of a source. Schema-less property The TSIMMIS Project Self-describing model: each data item has an associated descriptive label and without a strong typing system. Rule: does integration. Flexibility… - A good candidate for the integration of widely heterogeneous and semi-structured information sources A form of … - the logic language allows the declarative specification of a mediator. (MEDIATOR? Mediator specification is a set of rules which defines the mediator view of the data and the set of functions that are invoked to translate objects from one format to another) Useful … - traditional data models V.S. conventional OO language.

8 Semantic Approach Characterized as follows
1. Introduction Semantic Approach Characterized as follows For each source, meta-data (conceptual schema) must be available. Semantic information is encoded in the schema. Partial or total schema unification is performed. It adopts conventional OO data models. Arguments in favor of the Semantic Approach It allows us to organize extensional knowledge and to give a high level abstraction view of information; To check consistency of instances with respect to their descriptions, and thus to preserve the Quality of data; and To efficiently extract information through the query optimization. A relevant effort has been devoted to develop OO standards: CORBA and ODMG93. Schema Nature The MOMIS Project CORBA - object exchanging among heterogeneous systems ODMG93 - object oriented databases

9 1. Introduction Virtual Approach First proposed in multidatabase models in the early 1980s Recently, developed on the use of description logic. Conjunctive queries (select, project, join) Open World Assumption Top-down approach for the schema

10 2. The TSIMMIS Project The Stanford-IBM Manager of Multiple Information Sources To develop tools that facilitate the rapid integration of heterogeneous textual sources

11 TSIMMIS Architecture Application Mediator Mediator Generator Mediator
2. The TSIMMIS Project TSIMMIS Architecture Application Mediator Mediator Generator Mediator Wrapper Generator Info Wrapper Info Wrapper Info Wrapper

12 TSIMMIS Architecture (continued)
2. The TSIMMIS Project TSIMMIS Architecture (continued) Wrappers (Translators) and Mediators Common Model OEM (Object Exchange Model) Query Languages OEM-QL MSL (Mediator Specification Language) Possible bottlenecks An ad-hoc translator must be developed for any information source. Implementing a mediator can be complicated and time-consuming. Important goals to provide a translator generator to automatically or semi-automatically generate mediators

13 The OEM Model and the MSL Language
2. The TSIMMIS Project The OEM Model and the MSL Language OEM model: self-describing model <ob1: person, set, {sub1, sub2, sub3, sub4, sub5}> <sub1: last_name, str, ‘Smith’> <sub2: first_name, str, ‘John’> <sub3: role, str, ‘faculty’> <sub4: department, str, ‘cs’> <sub5: telephone, str, ‘ ’> MSL language First-order logic language that allows the declarative specification of mediators. Rule: head :- tail head: the pattern of the top-level integrated object supported by the mediator tail: the pattern of the object to be fetched from the source

14 The TSIMMIS Wrapper Generator
2. The TSIMMIS Project The TSIMMIS Wrapper Generator OEM Support Libraries to quickly implement wrappers, mediators and end-user interfaces The architecture of wrappers Client Support Library either a mediator or an application Server Support Library either a translator or a mediator Converter MSL  Native Query QDTL (Query Description and Translation Language) Extractor Packager Filter Processor

15 TSIMMIS Wrapper Client Converter Driver Client Support Library
2. The TSIMMIS Project TSIMMIS Wrapper Client Client Support Library Server Support Library Filter Processor QDTL Converter Driver Packager Extractor Information Source DEX

16 Converter and QDTL Example: WHOIS information source
2. The TSIMMIS Project Converter and QDTL Example: WHOIS information source > lookup -ln ‘ss’ > lookup -ln ‘ss’ -fn ‘ff’ QDTL (Query Description and Translation Language) D1: (QT1.1) Query ::= *0 :- <0 person {<last_name $LN>}> (AC1.1) {printf (lookup_query, ‘lookup -ln %s’, $LN);} (QT2.2) Query ::= *0 :- <0 person { <last_name $LN> <first_name $FN>}> (AC2.2) {printf (lookup_query, ‘lookup -ln %s -fn %s ’ , $LN, $FN);}

17 Converter and QDTL (continued)
2. The TSIMMIS Project Converter and QDTL (continued) Converter exploits each template to describe much more queries Directly supported queries queries with a syntax analogous to the template Logically supported queries A query q is logically supported by a template t if q is logically equivalent to, or subsumed by, a query q’ directly supported by t. Indirectly supported queries A query q is indirectly supported by a template t if q can be decomposed in a query q’ directly supported by t and a filter that is applied on the results of q’. Example (Q6) *Q :- <Q person {<last_name ‘Smith’> <role ‘student’>}>

18 Extractor, DEX Templates, and Filter Processor
2. The TSIMMIS Project Extractor, DEX Templates, and Filter Processor Extractor Input A query result expressed in a unstructured format DEX templates Packager Output a set of OEM object Filter Processor The filter is a MSL query built by the Converter.

19 The TSIMMIS Mediator Generator
2. The TSIMMIS Project The TSIMMIS Mediator Generator The MedMaker system the TSIMMIS component developed for declaratively specifying mediators Example CS objects in OEM <&e1, employee, set, {&f1, &l1, &t1, &rep1}> <&f1, first_name, string, ‘Joe’> <&l1, last_name, string, ‘Chung’> <&t1, title, string, ‘professor’> <&rep1, reports_to, string, ‘John Hennessy’> WHOIS objects in OEM <&p1, person, set, {&n1, &d1, &rel1, &elem1}> <&n1, name, string, ‘Joe Chung’> <&d1, dept, string, ‘cs’> <&rel1, relation, string, ‘employee’> <&elem1, e_mail, string,

20 The TSIMMIS Mediator Generator (continued)
2. The TSIMMIS Project The TSIMMIS Mediator Generator (continued) An object exported by ‘MED’ <&cp1, cs_person, set, {&mn1, &mrel1, &t1, &rep1, &elem1}> <&mn1, name, string, ‘Joe Chung’> <&mrel1, relation, string, ‘employee’> <&t1, title, string, ‘professor’> <&rep1, reports_to, string, ‘John Hennessy’> <&elem1, e_mail, string, Rules of ‘MED’ (MS1) Rules: <cs_person {<name N> <rel R> Rest1 Rest2}> :- <person {<name N> <dept ‘cs’> <relation R> | Rest1}> @whois AND decomp(N, LN, FN) AND <R {<first_name FN> <last_name LN> | External: decomp(string, string, string) (bound, free, free) impl by name_to_lnfn decomp(string, string, string) (free, bound, bound) impl by lnfn_to_name

21 Architecture and Implementation of MSI
2. The TSIMMIS Project Architecture and Implementation of MSI Mediator Specification Interpreter processes a query on the basis of the rules expressed in MSL. Three modules VE&AO (View Expander and Algebraic Optimizer) builds the logical datamerge program cost-based optimizer builds the physical datamerge program execution plan datamerge engine

22 3. The MOMIS Project The MOMIS Project Mediator envirOnment for Multiple Information Sources to allow a user to pose a single query and to receive a single unified answer “Read-only views” Common data model and language ODMI3, ODLI3 - a subset of the corresponding ODMG93 ODM and ODL olcd Object Language with Complements and Descriptive cycles ODB-Tools GARLIC, SIMS

23 Information Integration in MOMIS
3. The MOMIS Project Information Integration in MOMIS Extraction and analysis process derives a Common Thesaurus of terminological relationships constructs clusters olcd description logics to set up the Thesaurus by inferring relationships to optimize the queries against the global schema Unification process builds an integrated global schema Hierarchical clustering allows automated identification of ODLI3 classes

24 Architecture Wrappers Mediators ODB-Tools ARTEMIS
3. The MOMIS Project Architecture Wrappers lie above each source. responsible for translating the structure of the data source into the common ODLI3 and translating the OQLI3 to a local request. Mediators lie above the wrapper. software modules that combines, integrates, and refines ODLI3 schemata received from the wrappers. Generates the OQLI3 queries for the wrappers ODB-Tools ARTEMIS

25 3. The MOMIS Project ODLI3 Data description language used to communicate between wrappers and mediators engines. Based on ODL, adds features of the intelligent information integration system. if then rules mapping rules A source independent language

26 Phases of Intelligent schema integration
3. The MOMIS Project Phases of Intelligent schema integration Generation of a Common Thesaurus Terminological relationships are derived in a semi-automatic way by analyzing the structure and context of classes in the schema SYN (synonym), BT (Broader Term), NT (Narrow Term), RT (Related Term) ODB-Tools and olcd Affinity analysis of ODLI3 classes to evaluate the level of affinity between classes’ intra and inter sources. Clustering ODLI3 classes Generation of the mediator global schema A class is defined for each cluster.

27 Example University source (S1) Computer_Science source (S2)
3. The MOMIS Project Example University source (S1) Research_Staff(first_name, last_name, relation, , dept_code, selection_code) School_Member(first_name, last_name, faculty, year) Department(dept_name, dept_code, budget, dept_area) Section(section_name, section_code, length, room_code) Room(room_code, seats_number, notes) Computer_Science source (S2) CS_Person(name) Professor:CS_Person(title, belongs_to:Division,rank) Student:CS_Person(year, takes:set<Course>, rank) Division(description, address:Location, fund, sector, employee_nr) Location(city, street, number, county) Course(course_name, taught_by:Professo) Tax_Position source (S3) University_Student(name, student_code, faculty_name, tax_fee)

28 Example (continued) 3. The MOMIS Project 0.25 0.35 0.35 0.39 Location
0.73 0.66 0.54 Room Division Department 0.6 Research_Staff Section Course 0.6 CS_Person 0.6 Professor 0.65 University_Student School_Member Student

29 Global Class Specification in ODLI3
3. The MOMIS Project Global Class Specification in ODLI3 Interface University_Person (extent Research_Staffers, School_Members, CS_Persons Professors, Students, University_Students key name) { attribute string name mapping_rule (University.Research_Staff.first_name and University.research_Staff.last_name), (University.School_Member.first_name and University.School_Member.last_name), …… Tax_Position.University_Student.name; attribute string rank mapping_rule University.Research_Staff = ‘Professor’, University.School_Member = ‘Student’, …}

30 The Global Schema Builder
3. The MOMIS Project The Global Schema Builder SIM1 (Schemata Integrator Module, first version) reads the local schemata descriptions to derive the Common Thesaurus. interacting with the user using description logics (supported by ODB-Tools) Artemis Affinity Coefficients are computed between all the pairs of local classes to be integrated. Similar classes are grouped together using clustering techniques.

31 Description Logics and ODB-Tools
3. The MOMIS Project Description Logics and ODB-Tools DL (Description Logics language) also known as Concept Language or Terminological Logics reasoning techniques CODM (Complex Object Data Model) The expressiveness gave rise to new problems odl (Object Description Logics) olcd included by ODB-Tools. ODB-Tools Two modules: ODB-Designer and ODB-QOptimizer

32 Discussion The TSIMMIS system The MOMIS system
structural approach Drawbacks inefficient retrieval of data to be integrated incapability to answer not-predefined queries The MOMIS system semantic approach generation of the global schema for the mediator is a semi-automated process More (not mentioned) points query decomposition and optimization object fusion in mediator system integration of semi-structured data


Download ppt "Integration of Information from Multiple Sources of Textual Data"

Similar presentations


Ads by Google