Data Cloud Yury Lifshits Yahoo! Research
My Beliefs The key challenge in web search is structured search Part 1: What is structured search? The key challenge in structured search is collecting data Part 2: Data distribution & idea of Data Cloud Part 3: Demo: numeric data distribution The key challenge in collecting data is incentive design Part 4: Economics of data distribution
Structured Search
Data Structured data Entity unit: Identifier Metadata: –Explicit key-value pairs –Relational properties –Evaluation Semi-structured data Content unit: Body: text, video, audio, or image Metadata: –Explicit key-value pairs –Relational properties –Evaluation Data = data of entities + data of content
Structured Search Factoid search “ what's the value of property X of object Y “ Entity hubs –Domain hubs Structured object search "all concerts this weekend in SF under 20$ sorted by popularity" –Time focus –Ranking focus –Relations focus Structured content search "all videos with Tom Brady" “ all comments and blog posts about Bing"
Yury ’ s Wishlist Business-generated data Products, services, news, wishlists, contact data Reality stream, sensors Where what have happened Expert knowledge Glossary, issues, typical solutions, object databases, related objects graph Events Sport, concerts, education, corporate, community, private Market graph & signals Like, interested, use, following, want to buy; votes and ratings
Search as a Platform App 4 Classic search App 1 App 2 App 3 Structured Data Web index Post analysis Query analysis
Data Cloud How to collect all structured data in one place?
Data Producers People: forums, wiki, mail groups, blogs, social networks Enterprizes: product profiles, corporate news, professional content Sensors: GPS modules, web cameras, traffic sensors, RFID Transactional data
Data Distributors Data distributor is any technical solution to accumulate, organize and provide access to structured and semi- structured data Data publisher: the original distributor of some data Data retailer: a consumer- facing distributor of some data
Data Consumers Humans – –Aggregators: news, friend feeds, RSS readers –Search –Browsing / random walks Intelligence projects –Recommendation systems –Trend mining
Data Cloud Data Cloud is a centralized fully-functional data distribution service Success metric for data cloud strategy = the total “ value ” of data on the cloud
To-Cloud Solutions Extraction – DBpedia.org, “ web tables ” Semantic markup, data APIs – Yahoo! SearchMonkey Feeds – Yahoo! Shopping – Disqus.com, js-kit.com, Facebook Connect Direct publishing
On-Cloud Solutions Ontology maintenance – Freebase Normalization, de-duplication, antispam Named entity recognition, metadata inference, ranking Data recycling (cross-references) – Amazon Public Data Sets – Viral license Hosted search –Yahoo! BOSS
From-Cloud Solutions Search, audience –Y! SearchMonkey, Google Base Data API, dump access, update stream Custom notifications –Gnip.com Data cloud as a primary backend Access control –Ad distribution. (AT&T and Yahoo! Local deal)
Demo: webNumbr.com Joint work with Paul Tarjan
webNumbr.com: Import Crawl numbers from the web URL + XPath + regex Create “ numbr pages ” Update their values every hour Keep the history Anyone can create a numbr
webNumbr.com: Export Embed code Graphs Search & browse RSS
Economics of Data Distribution Joint work with Ravi Kumar and Andrew Tomkins
Network Effect in Two-Sided Markets Two sided market = every product serves consumers of two types A and B Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa Examples: operating systems, credit cards, e-commerce marketplaces Two-sided network effects: A theory of information product design G. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne
Basic model Distributors D1, … Dk Producer/consumer joins only one distributor Initial shares (p1,c1) … (pk,ck) New consumer selects a distributor with a probability proportional to pi New producer selects a distributor with probability proportional to ci
Basic model a1 a4 a2 a3 a1 a4 a3 a2
Market Shares Dynamics Theorem 1 Market shares will stabilize Theorem 2 With super-liner preference rule one of distributors will tip Theorem 3 With sub-liner preference rule market shares will flatten
External Factor Preference rule with external factor: ei+ci/(c1+ … +ck) Theorem 4 Market shares will stabilize on e1 : e2 : … : ek
Coalition Data Cloud
Coalitions Theorem 5 If all market shares are below 1/sqrt(k) coalition (sharing data) is profitable for all distributors Corollary Coalitions are not monotone Example: 5 : 4 : 1 : 1
Model Variations Same-side network effect Different p-to-c and c-to-p rules Multi-homing (overlapping audiences) n^2 vs. nlog n revenue models Mature market: newcomer rate = departing rate Diverse market (many types of producers and consumers) Newcoming and departing distributors Directed coalitions
Challenges
Marketing Data demand? Data offerings? Requirements for distribution technology?
Incentive design Incentives for data sharing? Centralized or distributed? –For profit or non-profit? Data licensing and ownership? Monetizing data cloud?
More Challenges Prototyping: Data marketplace: open data & data demand Search plugins: related objects, glossaries, object timelines Publishing tools for structured data Data client: structured news, bookmarking, notifications Tech design: Access management Namespace design User interface: Structured search UI Discovery UI
Thanks! Follow my research: