Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Chapter 10: Designing Databases
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Multi-Phase Reasoning of temporal semantic knowledge Sakirulai O. Isiaq and Taha Osman School of Computer and Informatics Nottingham Trent University Nottingham.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Ontology Notes are from:
Search Engines and Information Retrieval
Xyleme A Dynamic Warehouse for XML Data of the Web.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Overview Distributed vs. decentralized Why distributed databases
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Ch 12 Distributed Systems Architectures
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Storing XML using native storage Presented by Molato Badr Supervised by Dr. H.Haddouti.
Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University
Chapter 4 Database Management Systems. Chapter 4Slide 2 What is a Database Management System (DBMS)?  Database An organized collection of related data.
Hybrid Keyword Search across Peer-to-Peer Federated Data PhD Dissertation Defense Florida State University Jungkee (Jake) Kim.
Databases & Data Warehouses Chapter 3 Database Processing.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.
Search Engines and Information Retrieval Chapter 1.
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
June 25 th PDPTA Incorporating an XML Matching Engine into Distributed Brokering Systems.
Web based METS creation Ralf Stockmann case study.
Database System Concepts and Architecture
1 P2P File-Sharing Solution CS654 – Software Architecture course project Guide: T V Prabhakar Members: S Pavan Kumar – Y1306 D V Janardhan Rao – Y
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
1 XML Based Networking Method for Connecting Distributed Anthropometric Databases 24 October 2006 Huaining Cheng Dr. Kathleen M. Robinette Human Effectiveness.
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Introduction to the Semantic Web and Linked Data
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
GLOBEX INFOTEK Copyright © 2013 Dr. Emelda Ntinglet-DavisSYSTEMS ANALYSIS AND DESIGN METHODSINTRODUCTORY SESSION EFFECTIVE DATABASE DESIGN for BEGINNERS.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Distributed Database Management Systems. Reading Textbook: Ch. 1, Ch. 3 Textbook: Ch. 1, Ch. 3 For next class: Ch. 4 For next class: Ch. 4 FarkasCSCE.
Raluca Paiu1 Semantic Web Search By Raluca PAIU
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.
Information Retrieval in Practice
AMGA Web Interface Salvatore Scifo INFN sez. Catania
CHAPTER 3 Architectures for Distributed Systems
NOSQL databases and Big Data Storage Systems
Ahmet Fatih Mustacoglu
Information Retrieval
Scalable, distributed database system built on multicore systems
Paraskevi Raftopoulou, Euripides G.M. Petrakis
AMGA Web Interface Vincenzo Milazzo
Database Systems Instructor Name: Lecture-3.
Introduction to NoSQL Database Systems
New Tools In Education Minjun Wang
Presentation transcript:

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM 2005)

Motivation Internet Where is the Information?

Outline Two Typical Search Paradigms Problems of Current Search Approaches Local Hybrid Keyword Search Hybrid Search on Distributed Databases

Two Typical Search Paradigms Searching over structured data Relational Databases Searching over unstructured data Information Retrieval Internet Environment Semistructured Data – XML Keyword Search in DB Web Search Engines – Technologies from Information Retrieval Hybrid Keyword Search ?

Current Approaches – Keyword-only Search Web Search Engines  Web crawlers visit Web pages and collect the keyword based text indexes.  Fast information retrieval Keyword Search in databases  Web integration on legacy DBMS  Dynamic Web publication through embedded DB  Easy to use without knowledge of DB schema

Problems of Current Approaches – Keyword-based Web Search Engines  Can not collect every connected resource  Query results are often unrelated Keyword Search in Databases  Losing the inherent meaning of the schema  Query results are not based on semantic schema

Current Approaches – Semantic Semantic Web  Multiple relation links with directed labeled graphs and machines can understand the relationship between different resources  Describes metadata about resources  To represent the relations of the objects on the Web; the object terms defined under a specific description – an Ontology

Problems of Current Approaches – Semantic Web Ontology design is sophisticated Lack of unified definition ** Limited adoption

Our Approach Hybrid search mechanisms – Semantic metadata + Keyword search Semantic Solution Semantic Web might be better than Hybrid search Hybrid search must be better than Web search engines Simplicity Hybrid search is simpler than Semantic Web

Hybrid Keyword Search Service A search service fetches target information data against a search query. Unstructured data A file containing data – MS Word, PDF, PS documents Metadata: Structured or semistructured data – XML We utilized an XML-enabled relational DBMS and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to address the search against metadata and text.

How to Combine? (1) Two entity sets and a relationship in relational DBMS We can obtain the hybrid search result using a nested subquery

How to Combine? (2) A hash table is used for joining search results in non- DBMS based system (Apache Xindice + Lucene)

Local Query Processing – XML (1) XML-enabled RDB  DBLP XML record (1,000 – 10,000)  Non indexed matches except year match bound by the number of matches.  Combined query time depends on # of year query results Average XML Query Time

Local Query Processing – XML (2) Apache Xindice  DBLP XML record (1,000 – 10,000)  Indexed approximate matches for text elements in XML instances as bad as non- indexed queries  Exact matches bound by the number of matches. Average XML Query Time

Local Query Processing – Hybrid (1) Hybrid search query performance measurement  XML-enabled RDB  For 100,000 XML instances and 100,000 text documents  Small result set: 4 XML and a keyword matches  Large result set: 7,752 XML and 41,889 documents (3,227) MetadataAuthorYear (Nested subquery) Year (Hash table) Few Keywords 0.04 Sec Sec.5.70 Sec. Many Keywords 0.48 Sec. Half hour6.96 Sec.

Local Query Processing – Hybrid (2) Hybrid search query performance measurement  Apache Xindice + Jakarta Lucene  For 10,000 XML instances and 10,000 text documents  Small result set: 2 XML and a keyword matches  Large result set: 192 XML and 4,562 documents (41)

Discussion – Local Hybrid Search XML-enabled RDB provides proper response except some extreme query loads.  Inefficient query plan and query optimization in an old version – better performance in a newer version A native XML DB (Apache Xindice) had very limited scalability. (No accurate query result over 16,000 XML instances) We will generalize hybrid search to a distributed environment.

Hybrid Search on Distributed Databases Data Independence: logically and physically independent; the same schema – no change, data encapsulation in each machine Network Transparency: depends on MOM or P2P framework No replication – restricted to a computer cluster Fragment: full partition; horizontal fragmentation The query result for the distributed databases is the collection of query results from individual database queries.

Scalable Hybrid Search Architecture on DDBS Search Service Message Broker Client Search Service Search Service Subscriber for a query topic Publisher for a temporary topic Publisher for a query topic Subscriber for a temporary topic Query Message Query Message Result Message Result Message Client

Cooperating Broker Network Distributed Databases based on NaradaBrokering Network

Query Processing – DDBS (1) 100,000 XML and 100,000 Documents in 8 machines – 12,500 each Few keyword match (1-3) on 1 machine only RDB – 0.04 Sec. for few keyword match Avg. response time for an author exact match query over 8 search services

Query Processing – DDBS (2) 100,000 XML and 100,000 Documents in 8 machines – 12,500 each RDB – half hour or 6.96 Sec. (Hash table) Avg. response time for a year match query over 8 search services

Coupling vs. Scalability From ICDE 2002 Tutorial

Query Propagate and Results back on a P2P Network

Peer group architecture of the P2P Search

Conclusion We addressed the semantic loss of keyword-only search while remaining a simpler solution than the Semantic Web Our architecture contributed a performance improvement for some queries Extension of the scalability of Xindice XML query limited to a small size on a single machine