Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências, Departamento de Informática XLDB Research Group mjs@di.fc.ul.pt

Portuguese Web There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal This is NOT a small community web – 10M population PT – 3+ M users – 4+ M pages

Tumba! ( Temos um Motor de Busca Alternativo! ) Public service – Community Web Search Engine – Web Archive – Research infrastructure See it in action at http://tumba.pt

Statistics Up to 20,000 queries/day 3,5 million documents under.PT – the deepest crawl! 95% responses under 0.5 sec

Tumba! Web CrawlersRepositoryIndexing EngineRanking EnginePresentation Engine SIDRA

crawling+archiving WebStore (Contents Repository) Web ViúvaNegra (Crawling Engine) Versus (Meta-data Repository) Seed URLs.PT DNS Authority User Input

Query Processing Architecture (indexing phase) Word Index Page Attributes (Authority) Index DataStructs Generator Versus (Meta-data Repository) WebStore (Contents Repository)

SIDRA - Word Index Data Structure 2 files Term {docID} {hit} Hit = position + attrib DocID assigned in Static Rank order

SIDRA – Index Range Partitioning

SIDRA - Ranking Engine Word Index Word Index Word Index Query Server Query Broker Page Attributes Clients

Matching & Ranking Algorithm Phase 1: Query Matching QueryServers fetch matching docIDs (pre-sorted in static ranking order) QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking Pick N (1000) first results from phase 1 Compute final rank using hits data –Are terms also in title? –What is the distance among query terms in the page? –Terms in Bold, Italic?

Architecture

Index design horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork allow searches on different criteria in parallel (partition parallelism) Brokers merge results received in parallel as they are being produced (pipelline parallelism)

Addressing Multi-dimensionality Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts. Query Servers may index data according to other dimensions – time – Location –... Query Brokers perform the results fusion

Flexiblity / Scalability User requests may be balanced among multiple Presentation Engines Contents may be replicated Requests may be balanced among multiple Query Brokers Page Attributes may be replicated Query Brokers may balance requests to multiple Query Servers Multiple Query servers for a Word Index Word indexes may be replicated Word Index Word Index Word Index Query Server Query Broker Page Attributes Presentation Engine WebStore (Contents Repository)

Non-functional properties load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded) fault-tolerance ~ components can detect high response times and redirect requests.

Results With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of 0.779 seconds With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of 0.871 seconds Extensive discussion in upcoming dissertation

Tumba! Modest effort: – 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! – Fault-tolerance will require substantially more hardware (replication) – Periodic update willl demand more storage – Full-time operators? Encouraging feedback http://tumba.pt

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Similar presentations

Presentation on theme: "Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,

Similar presentations

Presentation on theme: "Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,"— Presentation transcript:

Similar presentations

About project

Feedback