Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley.

Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Search and research Lots of research motivated by web search –Explore specific research questions –Small to moderate scale A few large-scale production engines –Many additional challenges –Not all purely algorithmic/technical What are the extra constraints for a production system?

Production search engines Scale up –Tens of billions of web pages, images, etc. –Tens of thousands to millions of computers Geographic distribution –For performance and reliability Continuous crawling and serving –No downtime, need fresh results Long-term test/maintenance –Simplicity a core goal

Disclaimer Not going to describe any particular web- scale search engine –No detailed public description of any engine But, general principles apply

Outline Anatomy of a search engine Query serving Link-based ranking Index generation

Structure of a search engine Query serving The Web User behavior analysis Document crawling Index building Link structure analysis Page feature training Auxiliary answers Ranker training

Some index statistics Tens of billions of documents –Each document contains thousands of terms –Plus metadata –Plus snippet information Billions of unique terms –Serial numbers, etc. Hundreds of billions of nodes in web graph Latency a few ms on average –Well under a second worst-case

Query serving pipeline The Web Front-end web servers, caches, etc. Index servers

Page relevance Query-dependent component –Query/document match, user metadata, etc. Query-independent component –Document rank, spam score, click rate, etc. Ranker needs: –Term frequencies and positions –Document metadata –Near-duplicate information –…

Single-box query outline Hello world + {EN-US,…} termposting list a1.2,1.10,1.16,…,1040.23,…, … hello3.76,…,45.48,…,1125.3,…, … world7.12,…,45.29,…,1125.4,…, (45.48,45.29), (1125.3,1125.4),… docmetadata 1foo.com/bar,EN-US,… … 45go.com/hw.txt,EN-US,… … 1125bar.com/a.html,EN-US,… Ranker docsnippet data 1“once a week …” … 1125.3,45.48,… Query Results

Query statistics Small number of terms (fewer than 10) Posting lists length 1 to 100s of millions –Most terms occur once Potentially millions of documents to rank –Response is needed in a few ms –Tens of thousands of near duplicates –Sorting documents by QI rank may help Tens or hundreds of snippets

Distributed index structure Tens of billions of documents Thousands of queries per second Index is constantly updated –Most pages turn over in at most a few weeks –Some very quickly (news sites) –Almost every page is never returned How to distribute?

Distributed index: split by term Each computer stores a subset of terms Each query goes only to a few computers Document metadata stored separately A-GH-MT-ZN-S Hello world + {EN-US,…} Ranker Metadata

Split by term: pros Short queries only touch a few computers –With high probability all are working Long posting lists improve compression –Most words occur many times in corpus

Split by term: cons (1) Must ship posting lists across network –Multi-term queries make things worse –But maybe pre-computing can help? Intersections of lists for common pairs of terms Needs to work with constantly updating index Extra network roundtrip for doc metadata –Too expensive to store in every posting list Where does the ranker run? –Hundreds of thousands of ranks to compute

Split by term: cons (2) Front-ends must map terms to computers –Simple hashing may be too unbalanced –Some terms may need to be split/replicated Long posting lists “Hot” posting lists Sorting by QI rank is a global operation –Needs to work with index updates

Distributed index: split by document Each computer stores a subset of docs Each query goes to many computers Document metadata stored inline Docs 1-1000 Docs 1001-2000 Docs 3001-4000 Docs 2001-3000 Hello world + {EN-US,…} Ranker Aggregator Ranker

Split by document: pros Ranker on same computer as document –All data for a given doc in the same place –Ranker computation is distributed Can get low latency Sorting by QI rank local to each computer Only ranks+scores need to be aggregated –Hundreds of results, not millions

Split by document: cons A query touches hundreds of computers –One slow computer makes query slow –Computers per query is linear in corpus size –But query speeds are not iid Shorter posting lists: worse compression –Each word split into many posting lists

Index replication Multiple copies of each partition –Needed for redundancy, performance Makes things more complicated –Can mitigate latency variability Ask two replicas, one will probably return quickly –Interacts with data layout Split by document may be simpler Consistency may not be essential

Splitting: word vs document Original Google paper split by word All major engines split by document now? –Tens of microseconds to rank a document

Link-based ranking Intuition: “quality” of a page is reflected somehow in the link structure of the web Made famous by PageRank –Can be seen as stationary distribution of a random walk on the web graph –Google’s original advantage over AltaVista?

Some hints PageRank is (no longer) very important Anchor text contains similar information –BM25F includes a lot of link structure Query-dependent link features may be useful

Comparing the Effectiveness of HITS and SALSA, M. Najork, CIKM 2007

Query-dependent link features ABCD EFGHI JKLMN

Real-time QD link information Lookup of neighborhood graph Followed by SALSA In a few ms Seems like a good topic for approximation/learning

Index building Catch-all term –Create inverted files –Compute document features –Compute global link-based statistics –Which documents to crawl next? –Which crawled documents to put in the index? Consistency may be needed here

Index lifecycle Query serving The Web Index selection Usage analysis Page crawling

Experimentation A/B testing is best –Ranking, UI, etc. –Immediate feedback on what works –Can be very fine-grained (millions of queries) Some things are very hard –Index selection, etc. –Can run parallel build processes Long time constants: not easy to do brute force

Implementing new features Document-specific features much “cheaper” –Spam probability, duplicate fingerprints, language Global features can be done, but with a higher bar –Distribute anchor text –PageRank et al. Danger of “butterfly effect” on system as a whole

Distributing anchor text Indexer Crawler Anchor text Docs f0-ff Anchor text Docs f0-ff Anchor text Docs f0-ff

Distributed infrastructure Things are improving –Large scale partitioned file systems Files commonly contain many TB of data Accessed in parallel –Large scale data-mining platforms –General-purpose data repositories Data-centric –Traditional supercomputing is cycle-centric

Software engineering Simple always wins Hysteresis –Prove a change will improve things Big improvement needed to justify big change –Experimental platforms are essential

Summary Search engines are big and complicated Some things are easier to change than others Harder changes need more convincing experiments Small datasets are not good predictors for large datasets Systems/learning may need to collaborate

Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley.

Similar presentations

Presentation on theme: "Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley.

Similar presentations

Presentation on theme: "Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley."— Presentation transcript:

Similar presentations

About project

Feedback