Hybrid Keyword Search across Peer-to-Peer Federated Data PhD Dissertation Defense Florida State University Jungkee (Jake) Kim.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Chapter 10: Designing Databases
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Database Architectures and the Web
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
Peer-to-Peer Networks as a Distribution and Publishing Model Jorn De Boever (june 14, 2007)
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
Based on last years lecture notes, used by Juha Takkinen.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Overview Distributed vs. decentralized Why distributed databases
1 Client-Server versus P2P  Client-server Computing  Purpose, definition, characteristics  Relationship to the GRID  Research issues  P2P Computing.
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Information Retrieval
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
TECHNIQUES FOR OPTIMIZING THE QUERY PERFORMANCE OF DISTRIBUTED XML DATABASE - NAHID NEGAR.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
June 25 th PDPTA Incorporating an XML Matching Engine into Distributed Brokering Systems.
CSC271 Database Systems Lecture # 4.
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
MobileMAN Internal meetingHelsinki, June 8 th 2004 NETikos activity in MobileMAN project Veronica Vanni NETikos S.p.A.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Introduction to the Semantic Web and Linked Data
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
ICS362 – Distributed Systems Dr. Ken Cosh Week 2.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
A Demonstration of Collaborative Web Services and Peer-to-Peer Grids Minjun Wang Department of Electrical Engineering and Computer Science Syracuse University,
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
1P2P4mm workshop, Vico Equense 6. June 2008 Information Society Technologies VICTORY – a multimodal, cross-platform and distributed multimedia repository.
Information Architecture The Open Group UDEF Project
Distributed Database Management Systems. Reading Textbook: Ch. 1, Ch. 3 Textbook: Ch. 1, Ch. 3 For next class: Ch. 4 For next class: Ch. 4 FarkasCSCE.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
第 1 讲 分布式系统概述 §1.1 分布式系统的定义 §1.2 分布式系统分类 §1.3 分布式系统体系结构.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Research Directions in Databases Technological Education Institution of Larisa in collaboration with Staffordshire University Larisa Dr. Theodoros.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Distributed Database Management Systems
CHAPTER 3 Architectures for Distributed Systems
GSAF Grid Storage Access Framework
EE 122: Peer-to-Peer (P2P) Networks
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Remarks on Peer to Peer Grids
New Tools In Education Minjun Wang
Presentation transcript:

Hybrid Keyword Search across Peer-to-Peer Federated Data PhD Dissertation Defense Florida State University Jungkee (Jake) Kim

Motivation Internet Where is the Information?

Outline Two Typical Search Paradigms Problem Statements of Current Approaches Hybrid Keyword Search Hybrid Search on Distributed Databases Hybrid Search across Peer-to-Peer Federated Databases

Two Typical Search Paradigms Searching over structured data Relational Databases Searching over unstructured data Information Retrieval Internet Environment Semistructured Data – XML Keyword Search in DB Web Search Engines – Technologies from Information Retrieval Hybrid Keyword Search ?

Current Approaches – Keyword-only Search Web Search Engines Web crawlers visit Web pages and collect the keyword based text indexes. Web crawlers visit Web pages and collect the keyword based text indexes. Fast information retrieval Fast information retrieval Keyword Search in databases Web integration on legacy DBMS Web integration on legacy DBMS Dynamic Web publication through embedded DB Dynamic Web publication through embedded DB Easy to use without knowledge of DB schema Easy to use without knowledge of DB schema

Problems of Current Approaches – Keyword-based Web Search Engines Can not collect every connected resource Can not collect every connected resource Query results are often unrelated Query results are often unrelated Keyword Search in Databases Losing the inherent meaning of the schema Losing the inherent meaning of the schema Query results are not based on semantic schema Query results are not based on semantic schema

Current Approaches – Semantic Semantic Web Multiple relation links with directed labeled graphs and machines can understand the relationship between different resources Multiple relation links with directed labeled graphs and machines can understand the relationship between different resources Describes metadata about resources Describes metadata about resources To represent the relations of the objects on the Web; the object terms defined under a specific description – an Ontology To represent the relations of the objects on the Web; the object terms defined under a specific description – an Ontology

Problems of Current Approaches – Semantic Web Ontology design is sophisticated Lack of unified definition * * Limited adoption

Our Approach Hybrid search mechanisms – Semantic metadata + Keyword search Semantic Solution Semantic Web might be better than Hybrid search Hybrid search must be better than Web search engines Simplicity Hybrid search is simpler than Semantic Web

Hybrid Keyword Search Service A search service fetches target information data against a search query. Unstructured data A file containing data – MS Word, PDF, PS documents Metadata: Structured or semistructured data – XML We utilized an XML-enabled relational DBMS and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to address the search against metadata and text.

How to Combine? (1) Two entities and a relationship in relational DBMS We can obtain the hybrid search result using a nested subquery

How to Combine? (2) A hash table is used for joining search results in non- DBMS based system (Apache Xindice + Lucene)

Local Query Processing – XML (1) XML-enabled RDB DBLP XML record DBLP XML record (1,000 – 10,000) (1,000 – 10,000) Non indexed matches except year match bound by the number of matches. Non indexed matches except year match bound by the number of matches. Combined query time depends on # of year query results Combined query time depends on # of year query results Average XML Query Time

Local Query Processing – XML (2) Apache Xindice DBLP XML record (1,000 – 10,000) Indexed approximate matches for text elements in XML instances as bad as non- indexed queries Exact matches bound by the number of matches. Average XML Query Time

Local Query Processing – Hybrid (1) Hybrid search query performance measurement XML-enabled RDB XML-enabled RDB For 100,000 XML instances and 100,000 text documents For 100,000 XML instances and 100,000 text documents Small result set: 4 XML and a keyword matches Small result set: 4 XML and a keyword matches Large result set: 7,752 XML and 41,889 documents Large result set: 7,752 XML and 41,889 documents MetadataAuthorYear (Nested subquery) Year (Hash table) FewKeywords0.04Sec Sec Sec. ManyKeywords0.48Sec. Half hour 6.96 Sec.

Local Query Processing – Hybrid (2) Hybrid search query performance measurement Apache Xindice + Jakarta Lucene For 10,000 XML instances and 10,000 text documents Small result set: 2 XML and a keyword matches Large result set: 192 XML and 4,562 documents

Discussion – Local Hybrid Search XML-enabled RDB provides proper response except some extreme query loads. A native XML DB (Apache Xindice) had very limited scalability. (No accurate query result over 16,000 XML instances) We will generalize hybrid search to a distributed environment.

Hybrid Search on Distributed Databases Data Independence: logically and physically independent; the same schema – no change, data encapsulation in each machine Network Transparency: depends on MOM or P2P framework No replication – restricted to a computer cluster Fragment: full partition; horizontal fragmentation The query result for the distributed databases is the collection of query results from individual database queries.

Scalable Hybrid Search Architecture on DDBS Search Service Message Broker Client Search Service Search Service Subscriber for a query topic Publisher for a temporary topic Publisher for a query topic Subscriber for a temporary topic Query Message Query Message Result Message Result Message Client

Cooperating Broker Network Distributed Databases based on NaradaBrokering Network

Query Processing – DDBS (1) 100,000 XML and 100,000 Documents in 8 machines – 12,500 each Few keyword match (1-3) on 1 machine only RDB – 0.04 Sec. for few keyword match Avg. response time for an author exact match query over 8 search services

Query Processing – DDBS (2) 100,000 XML and 100,000 Documents in 8 machines – 12,500 each RDB – half hour or 6.96 Sec. (Hash table) Avg. response time for a year match query over 8 search services

Data Integration Hub Partial integration – possible method to increase the data portion queried c.f. Supernode in P2P We designed a partial integration architecture through a message-oriented middleware – the NaradaBrokering system NaradaBrokering system JMS compliant topic-based communication JMS compliant topic-based communication Scalability by brokers hierarchical connection Scalability by brokers hierarchical connection Passive queries / Static binding Passive queries / Static binding We attached a RDBMS to store the metadata and index the contents of the data

Architecture of Data Integration Hub

Coupling vs. Scalability From ICDE 2002 Tutorial

Query Propagate and Results back on a P2P Network

Peer group architecture of the P2P Search

Performance Test for Peer Group Communication (JXTA) ….. Subnet ASubnet BSubnet C Client PeerRendezvous Peer Search Service Peers GroupPropagation GroupPropagation Point-to-point Pipe Connection

Performance for Group Peer Communication – 1 Peer per Node Average Response Time for a Query

Performance for Group Peer Communication – Multiple Peers per Node Allowed (1) Average Response Time for a Query with Multiple Peers per Node Allowed

Performance for Group Peer Communication – Multiple Peers per Node Allowed (2) Message Response Time for 32 Group Peers

Related Works (1) Distributed lookup in routing to reduce the unnecessary communications Distributed Hash Table (DHT) – Chord, CAN, Pastry, and Tapestry Distributed Hash Table (DHT) – Chord, CAN, Pastry, and Tapestry JXTA: DHT + multiple random walks JXTA: DHT + multiple random walks Look up peers based on reputation Hristidis et. al. – Exploiting a context on existing RDBMS with reducing the schema loss of Keyword Search in DB

Related Works (2) MethodMetadata(XML)ContentsNote PlanetPNoYesGossiping Thousands peers ODDISEANoYes Dist. Global index Pastry Galanis and et al. YesNo Dist. Directories Chord, Thousands XRANKYesYes (in XML) No P2P

Conclusion We addressed the semantic loss of keyword-only search while remaining a simpler solution than the Semantic Web Low cost scalability over heterogeneous resource through customized overlay networks A practical bridging role on the road towards the ideal of information represented by Semantic Web?

Contributions Demonstration of a hybrid search – combining metadata search with a keyword search over unstructured context data A way to increase locality and integrate several dispersed resources through a data integration hub Extension of the scalability of a native XML database and performance improvement for some queries compared to those on a single machine Generalization of our hybrid search architecture on potentially more scalable P2P overlay network

Publications J. Kim and G. Fox. Scalable Hybrid Search on Distributed Databases. Accepted for presentation in 3rd International Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM) in conjunction with ICCS, To appear in Lecture Notes in Computer Science. May, J. Kim and G. Fox. A Hybrid Keyword Search across Peer-to-Peer Federated Databases. In Proceedings of 8th East-European Conference on Advances in Databases and Information Systems (ADBIS), September, J. Kim, O. Balsoy, M. Pierce, and G. Fox. Design of a Hybrid Search in the Online Knowledge Center. In Proceedings of IASTED International Conference on Information and Knowledge Sharing, November, G. Aydin, H. Altay, M. S. Aktas, M. N. Aysan, G. Fox, C. Ikibas, J. Kim, A. Kaplan, A. E. Topcu, M. Pierce, B. Yildiz, and O. Balsoy. Online Knowledge Center Tools for Metadata Management. Technical report, DoD HPCMP Users Group Meeting, June, O. Balsoy, M. S. Aktas, G. Aydin, M. N. Aysan, C. Ikibas, A. Kaplan, J. Kim, M. Pierce, A. Topcu, B. Yildiz, and G. Fox. The Online Knowledge Center: Building a Component Based Portal. In Proceedings of the International Conference on Information and Knowledge Engineering, June, G. Fox, S. Ko, M. Pierce, O. Balsoy, J. Kim, S. Lee, K. Kim, S. Oh, X. Rao, M. Varank, H. Bulut, G. Gunduz, X. Qiu, S. Pallickara, A. Uyar, and C. Youn. Grid services for earthquake science. Concurrency and Computation: Practice and Experience, 14: , May---June 2002.