1 Yahoo! Research Overview Marcus Fontoura Prabhakar Raghavan, Head.

Slides:



Advertisements
Similar presentations
Prediction Markets at Yahoo! David Pennock Yiling Chen, Tej Kasturi, Havi Hoffman, Dan Reeves Chao-Hsien Chu, Sandip Debnath, Mike Dooley, Rael Dornfest,
Advertisements

© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014.
1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004.
Chapter 5: Introduction to Information Retrieval
Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Efficient Indexing of Shared Content in IR Systems Andrei Broder, Nadav Eiron, Marcus Fontoura, Michael Herscovici, Ronny Lempel, John McPherson, Eugene.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Information Retrieval in Practice
Search Engines and Information Retrieval
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
Search Engines and Information Retrieval Chapter 1.
About Google Inc. is an American public corporation, founded in 4 th September 1998 by Sergey M. Brin, Lawrence E. Page. Earning revenue from advertising.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Multimedia Databases (MMDB)
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Mid-Term GBIF Committees Meetings eLearning Alberto González Talaván Global Biodiversity Information Facility (GBIF) May 2011.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Chapter 6: Information Retrieval and Web Search
Instant Information Access With Magnify Search Dr. Rado Kotorov Technical Director Strategic Product Mgt.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Information Retrieval
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
Web Search Architecture & The Deep Web
Week 1 Introduction to Search Engine Optimization.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Information Retrieval in Practice
Information Retrieval in Practice
Neighborhood - based Tag Prediction
Efficient Multi-User Indexing for Secure Keyword Search
Market Intelligence Analysis
Information Retrieval in Practice
Map Reduce.
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
1 SEO is short for search engine optimization. Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount.
Information Retrieval on the World Wide Web
MR Application with optimizations for performance and scalability
Multimedia Information Retrieval
CSCE 561 Information Retrieval System Models
MR Application with optimizations for performance and scalability
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Efficient Retrieval Document-term matrix t1 t tj tm nf
Presentation transcript:

1 Yahoo! Research Overview Marcus Fontoura Prabhakar Raghavan, Head

2 Mission & Vision Vision: Where the Internet’s future is invented –with innovative economic models for advertisers, publishers and consumers. Mission: Invent the Next generation Internet by defining the future media to Engage consumers and eXtend the economics for advertisers and publishers through new sciences that establish the Technical leadership of Yahoo!

3 How we get there Scientific excellence –World-recognized leadership through publications, keynotes, … Business impact –Tactical results from strategic behavior

4 Business needs vs. Disciplines Text Retrieval Machine Learning Human Computer Interaction Dist Computing  Economics Advertising Search + info Social media User experience

5 Business needs vs. Disciplines Text Retrieval Machine Learning Human Computer Interaction Dist Computing Economics Advertising Search + info Social media User experience

6 Where LA Silicon valley Berkeley New York Barcelona, Spain Santiago, Chile

7 At Y!R, prediction market theory/science since 2002 Yahoo!,O’Reilly launched Buzz Game Buy “stock” in hundreds of technologies Earn dividends based on actual search “buzz” Exchange mechanism new invention

8 Technology forecasts iPod phone What’s next? Another Apple unveiling: iPod Video? search buzz price 9/8-9/18: searches for iPod phone soar; early buyers profit 8/29: Apple invites press to “secret” unveiling 8/28: buzz gamers begin bidding up iPod phone 9/7: Apple announces Rokr 10/6: maybe not 10/5: maybe

9 Efficient Indexing of Shared Content in IR Systems Andrei Broder, Nadav Eiron, Marcus Fontoura, Michael Herscovici, Ronny Lempel, John McPherson, Eugene Shekita, Runping Qi

10 Motivation IR systems typically use inverted indices to facilitate efficient retrieval Web, , news, and other data contains significant amount of duplicated or shared content Indexing duplicate content is expensive

11 Scope of Work We assume duplicate or common content is already identified in the corpus We concern ourselves only with the efficient indexing of such content

12 Types of Shared Content Web duplicates: –Very common – on the order of 40% of all pages /news threads: –Whole messages are often quoted –Attachments are duplicated –Identical messages in multiple mailboxes

13 Some Statistics IBM Intranet has about 40% duplicate content. Internet crawls reveal similar statistics In the Enron dataset, 61% of messages are in threads. 31% quote other messages verbatim

14 Naïve Solution 1 : Index Everything Pros: –Simple to implement –Semantics are preserved Cons: –Index size blows up –Performance penalty (big index + post filtering)

15 Naïve Solution 2: Index Just One Copy Pros: –Best performance –Not too difficult to implement Cons: –Only applies to the duplicates scenario –Semantics are changed, and relevant results may not be returned for a query

16 The Web Duplicate Case: Meta Data Vs. Content Removal of web duplicates changes the semantics of the query text almaden.ibm.com /... text watson.ibm.com /... Query: text url:watson

17 Our Solution Content is split to shared and private parts Shared content is indexed only once Private content (such as metadata in the Web duplicates case) is indexed for each document Index provides virtual cursors that simulate having all content indexed

18 Advantages Index size, build time, and query efficiency Precise semantics No need for post-filtering

19 Inverted Indices Index is sorted by term For each term, a sorted list of documents in which it appears is maintained (postings list) Each occurrence (posting) contains additional payload T 1 :, … T 2 :, …

20 Document Sharing Model Each document is partitioned into private and shared content. The two types are differentiated by posting payload Documents exist in a tree – shared content is shared with all descendents Document IDs (and hence index order) are dictated by a DFS traversal of document trees

21 The Document Tree Content is shared from ancestor to descendants:

22 Example: docid = 1: From: andrei To: ronny, marcus did you read it? docid = 2: From: ronny To: marcus did you, marcus? docid = 3: From: marcus To: ronny not yet! andrei: did:, it: marcus:,,, not: read: ronny:,, yet: you:, DocumentsInverted index posting lists

23 Querying Inverted Indexes Queries contain mandatory terms, forbidden terms, and optional terms (such as +term1 – term2) Typically a zigzag algorithm is used Uses cursors on postings list. Cursors support two operations: –next() – Moves to the next posting –fwdBeyond(d) – Moves to the first posting for a document with id >= d

24 Top Level Query Algorithm 1.while (more results required) { 2.Invoke zigzag algorithm 3.Forward optional term cursors 4.Score document 5.Advance required/forbidden cursors 6.} In our solution, this algorithm, uses virtual cursors

25 Additional Information In The Index Tree information is encoded by two attributes for each document: –root(d) – The docid for the document at the root of the tree containing d –lastDescendent(d) – The highest- numbered document that is a descendent of d

26 fwdShared(d) example: p p p s s fwdShared(10)fwdBeyond(root(10))next()fwdBeyond(lastDescendent(6)+1) T:,,,,

27 Virtual Cursors Two types of cursors: –Regular (positive) virtual cursors. These behave as if all shared content was indexed for all documents that contain it –Negated virtual cursors, represent the complement of the postings list (used for forbidden terms) Implemented on top of a physical cursor with the additional fwdShared method

28 Virtual Positive Cursors Maintain a physical and logical positions. Support next() and fwdBeyond(d) p p p s s next()fwdBeyond(10)

29 Virtual Negative Cursors Support next() and fwdBeyond(d). Physical cursor ahead of logical cursor p p p s next()fwdBeyond(7) p

30 Web Duplicates Application Trees are flat, with the masters at the root. Leaves only have private content: docid = 1 root = 1 lastDescendant = 4 docid = 2 root = 1 lastDescendant = 2 docid = 3 root = 1 lastDescendant = 3 docid = 4 root = 1 lastDescendant = 4 S1S1 P1P1 P2P2 P3P3 P4P4 docid = 6 root = 5 lastDescendant = 6 S5S5 P5P5 P6P6

31 Build Performance Evaluation Subsets of IBM Intranet (36-44% dups): # docsIS1 (GB) IS2 (GB) Space saved IT1 (s)IT2 (s)Speedu p 500K % % 1000K % % 1500K % % 2000K % % 2500K % %

32 Runtime Performance: Single Terms Queries

33 Runtime Performance: Two Term Queries