Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Scalable Content-Addressable Network Lintao Liu
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Nuno Cardoso, Bruno Martins, Marcirio Chaves, Leonardo Andrade and Mário J. Silva XLDB Group - Department of Informatics Faculdade de Ciências da Universidade.
The XLDB Group at GeoCLEF 2005 Nuno Cardoso, Bruno Martins, Marcirio Chaves, Leonardo Andrade, Mário J. Silva XLDB Group - Department of Informatics Faculdade.
LOAD BALANCING IN A CENTRALIZED DISTRIBUTED SYSTEM BY ANILA JAGANNATHAM ELENA HARRIS.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
2/25/2004 The Google Cluster Architecture February 25, 2004.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Using the Semantic Web for Web Searches Norman Piedade de Noronha, Mário J. Silva XLDB / LaSIGE, Faculdade de Ciências, Universidade de Lisboa.
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Versus: A Web Repository Daniel Gomes, João P. Campos, Mário J. Silva XLDB Research Group University of Lisbon [dcg, jcampos, Versus is.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
DotSlash: Providing Dynamic Scalability to Web Applications Weibin Zhao and Henning Schulzrinne Department of Computer Science, Columbia University More.
Hands-On Microsoft Windows Server 2008 Chapter 8 Managing Windows Server 2008 Network Services.
Google App Engine and Java Application: Clustering Internet search results for a person Aleksandar Kartelj Faculty of Mathematics,
1 Content Distribution Networks. 2 Replication Issues Request distribution: how to transparently distribute requests for content among replication servers.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
5.1 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
CORE 2: Information systems and Databases CENTRALISED AND DISTRIBUTED DATABASES.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Data Structures & Algorithms and The Internet: A different way of thinking.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines.
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Modern Information Retrieval
Document Clustering and Collection Selection Diego Puppin Web Mining,
Developing GRID Applications GRACE Project
Distributed Server Scheduler Eyal Serero Alex Fishgate Supervisor : Vitaly Suchin.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Scaling Network Load Balancing Clusters
Statistics Visualizer for Crawler
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Network Load Balancing
CHAPTER 3 Architectures for Distributed Systems
The Search Engine Architecture
Presentation transcript:

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências, Departamento de Informática XLDB Research Group

Portuguese Web There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal This is NOT a small community web – 10M population PT – 3+ M users – 4+ M pages

Tumba! ( Temos um Motor de Busca Alternativo! ) Public service – Community Web Search Engine – Web Archive – Research infrastructure See it in action at

Statistics Up to 20,000 queries/day 3,5 million documents under.PT – the deepest crawl! 95% responses under 0.5 sec

Tumba! Web CrawlersRepositoryIndexing EngineRanking EnginePresentation Engine SIDRA

crawling+archiving WebStore (Contents Repository) Web ViúvaNegra (Crawling Engine) Versus (Meta-data Repository) Seed URLs.PT DNS Authority User Input

Query Processing Architecture (indexing phase) Word Index Page Attributes (Authority) Index DataStructs Generator Versus (Meta-data Repository) WebStore (Contents Repository)

SIDRA - Word Index Data Structure 2 files Term {docID} {hit} Hit = position + attrib DocID assigned in Static Rank order

SIDRA – Index Range Partitioning

SIDRA - Ranking Engine Word Index Word Index Word Index Query Server Query Broker Page Attributes Clients

Matching & Ranking Algorithm Phase 1: Query Matching QueryServers fetch matching docIDs (pre-sorted in static ranking order) QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking Pick N (1000) first results from phase 1 Compute final rank using hits data –Are terms also in title? –What is the distance among query terms in the page? –Terms in Bold, Italic?

Architecture

Index design horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork allow searches on different criteria in parallel (partition parallelism) Brokers merge results received in parallel as they are being produced (pipelline parallelism)

Addressing Multi-dimensionality Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts. Query Servers may index data according to other dimensions – time – Location –... Query Brokers perform the results fusion

Flexiblity / Scalability User requests may be balanced among multiple Presentation Engines Contents may be replicated Requests may be balanced among multiple Query Brokers Page Attributes may be replicated Query Brokers may balance requests to multiple Query Servers Multiple Query servers for a Word Index Word indexes may be replicated Word Index Word Index Word Index Query Server Query Broker Page Attributes Presentation Engine WebStore (Contents Repository)

Non-functional properties load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded) fault-tolerance ~ components can detect high response times and redirect requests.

Results With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of seconds With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of seconds Extensive discussion in upcoming dissertation

Tumba! Modest effort: – 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! – Fault-tolerance will require substantially more hardware (replication) – Periodic update willl demand more storage – Full-time operators? Encouraging feedback