Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should be confident that the results you get back will be just as good as those on Google or Yahoo!, but only better because that vertical has additional relevant information
Yahoo! BOSS Developer Unrestricted RESTful APIs – Presentation/Ranking control & Query limits Off – Web, News, Spelling, Images, Site Explorer Disclosing once internal-only data – Delicious bookmarks metadata – Searchmonkey (microformats e.g. LinkedIn profiles) – Extracted Entities (with scores, term variants) – Larger Abstracts
100’s of Developer Apps
Model It’s not a Search API, it’s really a Data API Search happens to be an easy way to retrieve data from billions of varying documents Slowly moving beyond keyword match – searchmonkeyid, site restricts, doc type, inurl, intitle, lang, region, date, flickr Defer re-ordering, blending to user – Scale: Tens of millions BOSS QPD – Difficult to universalize ranking models
Yahoo! BOSS Custom Most Common Requests – (1) Search fresh data not on web, (2) Do thousands of site restricts Solution: Hosted Vertical Search in Yahoo!’s Cloud – Near real-time indexing of millions of documents – Data may be structured with fields, indexable properties Schemas, Schema-less, Filters, Range Queries Access to more search ranking features API primitives for federating custom & developer search results – Very basic priority stacking – Backfill developer results to capture comprehensiveness for tail vertical queries Create your own “view” of web, vertical search – More ranking control server-side – Logically, physically isolated from core web search engine
Blending Vertical + Web Key to comprehensiveness Right now TechCrunch search does basic backfilling Can we do better? Learning transfer functions – Normalizing two sets of results on same scale Ex. delicious + web – X: | Y: delicious count – Machine learn the delicious counts => f – Now do a web search, sort by f(web result); works well
Questions Ranking/Blending interfaces. Learning models. Which features to reveal? Spam concerns. Would Search APIs benefit from a standardized structured language? How much of research needs APIs versus raw web crawl dumps for specialized one-off analysis? Should ranking be done API server-side or client-side?