P2P Content Search: Give the Web Back to the People Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics,

Slides:



Advertisements
Similar presentations
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Advertisements

1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.
Analysis of Computer Algorithms
SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Performance in Decentralized Filesharing Networks Theodore Hong Freenet Project.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
1 OpenFlow + : Extension for OpenFlow and its Implementation Hongyu Hu, Jun Bi, Tao Feng, You Wang, Pingping Lin Tsinghua University
Optimizing Cost and Performance for Multihoming Nick Feamster CS 6250 Fall 2011.
Copyright 2008 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute 1 From OntoSelect to OntoSelect-SWSE.
…to Ontology Repositories Mathieu dAquin Knowledge Media Institute, The Open University From…
IPv6 Transition for Enterprises Light Reading Live 14 July 2011 John Curran President and CEO ARIN.
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Human Performance Improvement Process
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –
Peer-to-peer and agent-based computing P2P Algorithms.
Peer-to-peer and agent-based computing Peer-to-Peer Computing: Introduction.
Xia Zhou*, Stratis Ioannidis ♯, and Laurent Massoulié + * University of California, Santa Barbara ♯ Technicolor Research Lab, Palo Alto + Technicolor Research.
Algorithms for Geometric Covering and Piercing Problems Robert Fraser PhD defence Nov. 23, 2012.
Scalable Data Partitioning Techniques for Parallel Sliding Window Processing over Data Streams DMSN 2011 Cagri Balkesen & Nesime Tatbul.
Voronoi-based Geospatial Query Processing with MapReduce
George Anadiotis, Spyros Kotoulas and Ronny Siebes VU University Amsterdam.
17 th International World Wide Web Conference 2008 Beijing, China XML Data Dissemination using Automata on top of Structured Overlay Networks Iris Miliaraki.
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Peer-to-Peer and Social Networks An overview of Gnutella.
Searching over Many Sites in Jeremy Stribling Joint work with: Jinyang Li, M. Frans Kaashoek, Robert Morris MIT Computer Science and Artificial Intelligence.
Minimum Weight Plastic Design For Steel-Frame Structures EN 131 Project By James Mahoney.
HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
1 Contract Inactivation & Replacement Fly-in Action ( Continue to Page Down/Click on each page…) Electronic Document Access (EDA)
© 2006 Cisco Systems, Inc. All rights reserved. MPLS v MPLS VPN Technology Introducing MPLS VPN Architecture.
IEEE/FIPA WG Mobile Agents Ulrich Pinsdorf Fraunhofer-Institute IGD, Germany Dept. Security Technology
Mobile RFID Service and Its Security in Korea 17 Nov Keon Woo Kim.
Luca Maria Aiello, Università degli Studi di Torino, Computer Science department 1 Tempering Kademlia with a robust identity based system.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Shortest Violation Traces in Model Checking Based on Petri Net Unfoldings and SAT Victor Khomenko University of Newcastle upon Tyne Supported by IST project.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Database System Concepts and Architecture
Executional Architecture
Splines IV – B-spline Curves
Jan SedmidubskyOctober 28, 2011Scalability and Robustness in a Self-organizing Retrieval System Jan Sedmidubsky Vlastislav Dohnal Pavel Zezula On Investigating.
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
A. S. Morse Yale University University of Minnesota June 4, 2014 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A.
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
On the Usage of Global Document Occurrences (GDO) in P2P Information Systems or… Avoiding overlapping results in P2P searching Odysseas Papapetrou 1,2.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.
Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.
ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.
MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik.
Paraskevi Raftopoulou 1,2 Paraskevi Raftopoulou 1,2 and Euripides G.M. Petrakis 2 1 Max-Planck Institute for Informatics, Saarbruecken, Germany
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Master Thesis Defense Jan Fiedler 04/17/98
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
PEERSPECTIVE.MPI-SWS.ORG ALAN MISLOVE KRISHNA P. GUMMADI PETER DRUSCHEL BY RAGHURAM KRISHNAMACHARI Exploiting Social Networks for Internet Search.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Evaluation Anisio Lacerda.
Bookmark-driven Query Routing in Peer-to-Peer Web Search
Presentation transcript:

P2P Content Search: Give the Web Back to the People Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics, Saarbrücken, Germany Peter Triantafillou University of Patras, Greece IPTPS The 5th International Workshop on Peer-to-Peer System Santa Barbara, California, USAFebruary 27-28, 2006 Outline of the Talk 1.Feasibility of P2P Web Search 2.Problem Statement 3.Learning from Queries 4.Exploiting Correlation 5.Experiments

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 2 P2P and Web Search: Marriage in Heaven But: Authors assume distribution of full term-document index  non-scalable! Better: light-weight approach with distributed term-peer directory Variety of projects following this line: PlanetP (Rutgers), Pepper (CMU), Galanx (Wisconsin), Odissea (Brooklyn), Minerva (MPII), and others P2P Web Search has potential advantages:  Highly distributed data  Better processing power Li, Loo, Hellerstein, Kaashoek, Karger, Morris questioned Feasibility of Peer- to-Peer Web Indexing and Search (IPTPS 2003)

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 3 Architectural Model Each peer has full-fledged local search engine (with crawler / importer, indexer, query processor) Each peer has autonomously compiled (e.g. crawled) its own content according to the user‘s thematic interests  peer-specific collections When a query is issued by a peer, it is first executed locally and then possibly routed to carefully selected other peers Peers are connected by overlay network (e.g. DHT, random graph) and IP Peers can post summaries / synopses / metadata / QoS info to (distr.) network-wide directory with efficient per-key lookup

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 4 Minerva System Architecture  Based on top of a scalable, churn-resilient DHT  Conceptually global but physically distributed meta-data directory P3P3 P6P6 P2P2 P7P7 P8P8 P5P5 P1P1 P4P4 query peer local index term a: P 1,P 4,P 8 term b: P 3,P 5,P 8 term f: P 2,P 4,P 6 term c: P 2,P 4,P 6 peer lists term d: P 1,P 3 term e: P 1,P 2,P 5 peer ranking and statistics peer ranking and statistics peer ranking and statistics a b c Query Routing driven by statistics on peer quality

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 5 Problem Statement What can happen?  Great results: top peers for q are selected!  Bad results: selected peers good for individual terms, mediocre for complete query. Example Query q: „native american music“  Ask global directory for three single-term PeerLists  Combine into single PeerList for complete query  Ask top peers for best documents  Combine all documents into single result documents PjPj PiPi PkPk PqPq native: P 27, P 4, P 8, P 112, P 36,... american: P 1, P 4, P 18, P 108, P 25,... music: P 13, P 4, P 88, P 36,... Doc 1 american music Doc 2 native american Post native Post music Post american

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 6 Problem: Term Correlations Architectural compromise:  Best peers for q={t 1, …, t |q| } may not be in  t  q PeerList(t) top-k and possibly not even in  t  q PeerList(t) top-k  Also possible:  t  q PeerList(t) top-k is empty!  Name and phrase recognition helps but insufficient  Lack of correlation-awareness is standard in IR, but more severe in P2P because of peer-granularity directory Queries with correlated or specifically „associated“ termsets:  „Michael Jordan“, „Lake Superior“, „Bell Labs“, „hurricane Katrina“, „Native American Music“, „PhD admission“, „black magic“, „ice hockey Honolulu“, „Natalya Kournikova“ The solution:  Special handling of correlated termsets as termset posts in the directory, but... ... efficiency & scalability are critical! Consider correlated termsets for query routing!

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 7 Critical Issues and what remains to be done? 1.How to decide that a termset is correlated? 2.How to store termset posts in the directory? 3.How to exploit termset posts for queries?

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 8 Possible Approaches Possible sources of correlated termsets  Names and phrases from dictionaries or thesauries  incomplete!  Frequent itemset mining on data  computationally expensive! Extraction of all possible term pairs out of the documents  Brute-force precomputation of termset posts  But: quadratic explosion and what about triples, quadruples,... Impossible to predict all correlated termsets of interest!

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 9 Our Approach... Exploit query logs to learn correlated termsets... driven by „Give the Web back to the people“ Advantages of query logs:  Reflect real behavior of millions of user  Only termsets of interest need to be learned as correlated  As we will see: Integration in existing architecture for free Looking at query logs... ... to validate that logs are useful to recognize correlated termsets  Excite Search Engine Log (1999) with about 2 million real web queries Queries are a gold mine!

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 10 Learning Correlated Termsets from Queries P3P3 P2P2 P7P7 P1P1 P6P6 P8P8 P5P5 P4P4 american: P 1,P 4,P 8 native: P 3,P 5,P 8 music: P 2,P 4,P 6 american native american music native native american music music native american music american music native american music american native music native  Peerlist request: piggybacking complete query  Directory peers remember query as termsets Learning included in Query Routing

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 11 Collecting and Storing Termset Posts P3P3 P2P2 P8P8 american: P 1,P 4,P 8 native: P 3,P 5,P 8 american music native american music american native P7P7 P1P1 P6P6 P5P5 P4P4 music: P 2,P 4,P 6 music native  Directory Peers manage termset posts  Posting procedure extended with termset posting Post american Post native american music native american music american native Post american native american native: P 8 No extra Communication Protocol needed!

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 12 Exploiting Termset Postings P3P3 P2P2 P7P7 P1P1 P8P8 P6P6 P5P5 P4P4  Integrated in standard query execution  Fallback-option always possible american: P 1,P 4,P 8 native: P 3,P 5,P 8,P 2 music: P 2,P 4,P 6,P 8 american music native: P 8 native music: P 8,P 4 american native american music native native american music music native american music PeerList american music native PeerList music native PeerList native PeerList for complete query No additional Communication Round!

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 13 No Termset for Complete Query P1P1 P3P3 P2P2 P7P7 P6P6 P5P5 P8P8 P4P4  Especially for large queries  Covering problem! a a b e c b b c e a b d a b c d e c e b c e a a b c d e b a b c d e c a b c d e d a b c d e e a b c d e a b d a b c b c e c e d e e a b c a b d b c e c e d e e a b c d e Integrated into Query Routing!

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 14 What about Networking Costs? Big Concern: too many messages, high bandwidth consumption, too? All messages piggybacked, no extra costs!  Learning correlated termsets integrated in the query routing process  Asking for termsets integrated in the posting process  Exploiting correlated termsets in the query processing for free and includes the fallback option, too Our approach is still scalable because It‘s all free!!

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 15 Experimental Evaluation  Experiments: 750 peers with.Gov partitions (~1.2 million web documents)  Running 50 expanded queries from TREC-2003 Web Track (example: „robots research artificial“ or „shipwrecks accident“) Major Gain in Benefit / Cost

Santa Barbara, California, USAFebruary 27-28, 2006 IPTPS The 5th International Workshop on Peer-to-Peer Systems P2P Content Search: Give the Web Back to the PeopleChristian Zimmer 16 Conclusion and Future Work  Reconcile scalability with good search-result quality  No extra networking costs and... ... greatly improved benefit/cost for query routing and processing  Consider and benefit from user and community behavior  Optimization of termset covers for queries with many terms  Real-life testbed with real users! Thank You for Your Attention!