Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências, Departamento de Informática XLDB Research Group
Portuguese Web There is an identifiable community Web, that we call the Portuguese Web – The web of the people directly related to Portugal This is NOT a small community web – 10M population PT – 3+ M users – 4+ M pages
Tumba! ( Temos um Motor de Busca Alternativo! ) Public service – Community Web Search Engine – Web Archive – Research infrastructure See it in action at
Statistics Up to 20,000 queries/day 3,5 million documents under.PT – the deepest crawl! 95% responses under 0.5 sec
Tumba! Web CrawlersRepositoryIndexing EngineRanking EnginePresentation Engine SIDRA
crawling+archiving WebStore (Contents Repository) Web ViúvaNegra (Crawling Engine) Versus (Meta-data Repository) Seed URLs.PT DNS Authority User Input
Query Processing Architecture (indexing phase) Word Index Page Attributes (Authority) Index DataStructs Generator Versus (Meta-data Repository) WebStore (Contents Repository)
SIDRA - Word Index Data Structure 2 files Term {docID} {hit} Hit = position + attrib DocID assigned in Static Rank order
SIDRA – Index Range Partitioning
SIDRA - Ranking Engine Word Index Word Index Word Index Query Server Query Broker Page Attributes Clients
Matching & Ranking Algorithm Phase 1: Query Matching QueryServers fetch matching docIDs (pre-sorted in static ranking order) QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking Pick N (1000) first results from phase 1 Compute final rank using hits data –Are terms also in title? –What is the distance among query terms in the page? –Terms in Bold, Italic?
Architecture
Index design horizontal/global partition ~ each QueryServer contains all documents of a criteria. e.g of a keywork allow searches on different criteria in parallel (partition parallelism) Brokers merge results received in parallel as they are being produced (pipelline parallelism)
Addressing Multi-dimensionality Generalization: page-rank (page importance measure) isn´t but one of possible ranking contexts. Query Servers may index data according to other dimensions – time – Location –... Query Brokers perform the results fusion
Flexiblity / Scalability User requests may be balanced among multiple Presentation Engines Contents may be replicated Requests may be balanced among multiple Query Brokers Page Attributes may be replicated Query Brokers may balance requests to multiple Query Servers Multiple Query servers for a Word Index Word indexes may be replicated Word Index Word Index Word Index Query Server Query Broker Page Attributes Presentation Engine WebStore (Contents Repository)
Non-functional properties load-balancing ~ components distribute requests to multiple replicas (round-robin or less loaded) fault-tolerance ~ components can detect high response times and redirect requests.
Results With 1 QueryServer and 1 Broker responds to workloads of 50 requests per second with an average time of seconds With 2 QueryServers and 1 Brokerresponds to workloads of 110 requests per second with an average time of seconds Extensive discussion in upcoming dissertation
Tumba! Modest effort: – 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! – Fault-tolerance will require substantially more hardware (replication) – Periodic update willl demand more storage – Full-time operators? Encouraging feedback