Download presentation
Presentation is loading. Please wait.
Published byKolton Hatchell Modified over 10 years ago
1
Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências, Departamento de Informática XLDB Research Group mjs@di.fc.ul.pt
2
Portuguese Web There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal This is NOT a small community web – 10M population PT – 3+ M users – 4+ M pages
3
Tumba! ( Temos um Motor de Busca Alternativo! ) Public service – Community Web Search Engine – Web Archive – Research infrastructure See it in action at http://tumba.pt
4
Statistics Up to 20,000 queries/day 3,5 million documents under.PT – the deepest crawl! 95% responses under 0.5 sec
5
Tumba! Web CrawlersRepositoryIndexing EngineRanking EnginePresentation Engine SIDRA
6
crawling+archiving WebStore (Contents Repository) Web ViúvaNegra (Crawling Engine) Versus (Meta-data Repository) Seed URLs.PT DNS Authority User Input
7
Query Processing Architecture (indexing phase) Word Index Page Attributes (Authority) Index DataStructs Generator Versus (Meta-data Repository) WebStore (Contents Repository)
8
SIDRA - Word Index Data Structure 2 files Term {docID} {hit} Hit = position + attrib DocID assigned in Static Rank order
9
SIDRA – Index Range Partitioning
10
SIDRA - Ranking Engine Word Index Word Index Word Index Query Server Query Broker Page Attributes Clients
11
Matching & Ranking Algorithm Phase 1: Query Matching QueryServers fetch matching docIDs (pre-sorted in static ranking order) QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking Pick N (1000) first results from phase 1 Compute final rank using hits data –Are terms also in title? –What is the distance among query terms in the page? –Terms in Bold, Italic?
12
Architecture
13
Index design horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork allow searches on different criteria in parallel (partition parallelism) Brokers merge results received in parallel as they are being produced (pipelline parallelism)
14
Addressing Multi-dimensionality Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts. Query Servers may index data according to other dimensions – time – Location –... Query Brokers perform the results fusion
15
Flexiblity / Scalability User requests may be balanced among multiple Presentation Engines Contents may be replicated Requests may be balanced among multiple Query Brokers Page Attributes may be replicated Query Brokers may balance requests to multiple Query Servers Multiple Query servers for a Word Index Word indexes may be replicated Word Index Word Index Word Index Query Server Query Broker Page Attributes Presentation Engine WebStore (Contents Repository)
16
Non-functional properties load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded) fault-tolerance ~ components can detect high response times and redirect requests.
17
Results With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of 0.779 seconds With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of 0.871 seconds Extensive discussion in upcoming dissertation
18
Tumba! Modest effort: – 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! – Fault-tolerance will require substantially more hardware (replication) – Periodic update willl demand more storage – Full-time operators? Encouraging feedback http://tumba.pt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.