1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.
EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Data-oriented Content Query System: Searching for Data into Text on the Web Mianwei Zhou, Kevin Chen-Chuan Chang Department of Computer Science UIUC 1.
Information Retrieval in Practice
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
 Fatemeh Lashkari UNB University May 7 th  Indexing  Semantic Search  Semantic Search Architecture  Index process  Index Maintenance.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Master Thesis Defense Jan Fiedler 04/17/98
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Proposal for Term Project J. H. Wang Mar. 2, 2015.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1.
Chapter 6: Information Retrieval and Web Search
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Question Answering over Implicitly Structured Web Content
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Search Engine Architecture
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Entity Search Are you searching for what you want? Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Chengkai Li, Govind Kabra, Shui-Lung Chuang, Joe.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
RoundTripRank Graph-based Proximity with Importance and Specificity Yuan FangUniv. of Illinois at Urbana-Champaign Kevin C.-C. ChangUniv. of Illinois at.
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Presented by: Shahab Helmi Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Information Retrieval in Practice
Proposal for Term Project
Statistical Learning Methods for Natural Language Processing on the Internet 徐丹云.
Search Engine Architecture
Implementation Issues & IR Systems
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer Science Department University of Illinois at Urbana-Champaign

2 Customer service phone number of Amazon? Users in Frustration Search on Amazon? Search on Search Engine?

3 Professors in the area of data mining Even More Frustration cs.uiuc.edu cs.uiuc.edu/research cs.uiuc.edu/research/data cs.stanford.edu … … cs.stanford.edu/research cs.stanford.edu/research/faculty

4 Many many such cases: The of Kevin Chang? The papers and presentations of ICDE 2010? Conferences and their due dates on databases in 2010? Sale price of “Canon PowerShot A400”? Often times, we are looking for data entities, e.g., s, dates, prices, etc., not pages. Indeed, according to a recent survey, 52.9% of queries are directly targeting at structured entities [DE Bulletin’09] [DE Bulletin’09]: R. Kumar and A. Tomkins, “A Characterization of Online Search Behavior”

Recent Trends: WQA Web-based Question Answering (WQA) (Wu 2007, Lin 2003, Brill 2002) Who is CEO of Dell? Keywords: “CEO Dell” Parse Top-k results Michael Dell 5

Recent Trends: WIE 6 Specialized Information Extractors Web Information Extraction (WIE) (Marius 2006, Cafarella 2005, Etzioni 2004) Pattern: “X is CEO of Y” CompanyCEO GoogleEric Schmidt IBMS. Palmisano ……

Recent Trends: TAS 7 Typed-Annotated Search (TAS) (Cheng 2007, Cafarella 2007, Chakrabarti 2006) Inventor of television? …… Ranked Entity List Finding person names near Keywords “invent” and “television” Finding person names near Keywords “invent” and “television” Typed-Annotated Search

8 From Pages to Data Entities Traditional SearchEntity Search Keywords Keywords & Entity Type Results Support

9 Concretely, what do we mean by Entity Search? Online Demo. 3TB Corpus of 150M pages 16 -machine cluster 24 entity types

10 Entity Search Problem Abstraction   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone # )  Output: Ranked list of sorted by Score(q(t)), the query score of t   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone # )  Output: Ranked list of sorted by Score(q(t)), the query score of t Input: Keywords & Entity Type (optionally with a pattern) E.g. Amazon Customer Service #phone Output: Ranked Entity Instances Ordered by: Score(e) where e is an entity instance …… Given: D

Unanimous Requirements across the Trends Context Matching (in document)  Match the target type (say # location ) by keywords (e.g., “louvre museum”) that appear in its surrounding context, in certain desired patterns Global Aggregation (across documents)  Match an entity (say, #location = Paris) for as many times as it appears in numerous pages 11

Computation Challenges Expensive Context Matching (Join )  Need to perform proximity matching in documents Beyond simple containment checking Extensive Global Aggregation ( G )  Need to perform corpus scale aggregation A layer that is non-existent in online page retrieval 12 

Traditional Page Retrieval based Approach 13 Who is the CEO of Dell? Keywords: “CEO Dell” Analyze top-k results Michael Dell Limitation Only top-k documents Many random seeks

Our Proposal: Entity-aware Indexing Inspired by the success of inverted index in enabling efficient IR for searching documents However, traditional inverted index only aware of keywords and documents  How can we make index entity aware ? Our proposal: Dual-Inversion Index  Principle I : Document-inverted Index  Principle II : Entity-inverted Index 14

Entity-as-keyword: Document-inverted Index 15 : : keyword pos doc id

Document Space Partitioning Node 10 Node 1

Distributed Query Processing over D-inverted Index 17 Join …… Aggregation Local Ranking Global  Join  … results, scores …… … Node 1 Node 10

Entity-as-document: Entity-inverted Index 18 keyword posentity id entity pos

Entity Space Partitioning 19 Node 1 Node 9

Distributed Query Processing over E-inverted Index 20 … Local Ranking Global … results, scores … Node 1 Node 9 Join Aggregation  Join Aggregation  … …

21 Experiment Setup Corpus: General crawl of the Web (Aug, 2007), around 3TB with 150M pages. Entities: 24 diverse entity types Concrete Applications (Benchmark queries) :  Yellowpage: # , #phone, #state, #location, #zipcode  CSAcademia: #university, #professor, #research, # , #phone

Metrics Used for Evaluation to Measure Throughput & Response Time Local Processing Time  Overall local processing time.  Max local processing time Transfer Time  Overall transfer time  Max transfer time Global Processing Time 22

Local Processing Time Comparison 23

Network Transfer Comparison 24

Global Processing Time Comparison 25

Overall Time/Space Summary 26 Generally, ~2 to 4 orders of speedup, with reasonable space overhead

Dual-Inversion Index 27 Dual-Inversion Index: The two types of indexes can co-exist, and complement each other

Indexing Configuration 28 Entity Type Level Configuration: Create E-Inverted Index only for popular, space efficient entities D-Inverted Index for less popular, space expensive entities Keyword Level Configuration: Only create E-Inverted Index for pairs, when they are related, e.g., queried often from query log

Conclusion Identify essential computation requirements for entity search Dual-inversion indexing and partition schemes for efficient and scalable query processing  Document-inverted index  Entity-inverted index Verify over large-scale corpus with real applications 29

30 Thanks much for coming! Questions?

TopK Convergence 31

References of Related Work Index Design  Junghoo Cho and Sridhar Rajagopalan. A fast regular expression indexing engine. In ICDE,  Hugh E. Williams, Justin Zobel, and Dirk Bahle. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 22(4):573–594,  Xiaohui Long and Torsten Suel. Three-level caching for efficient query processing in large web search engines. In WWW,  Michael Cafarella and Oren Etzioni. A search engine for large-corpus language applications. In WWW, Question Answering  S. Abney, M. Collins, and A. Singhal. Answer extraction. In ANLP,  E. Brill, S. Dumais, and M. Banko. An analysis of the askmsr question-answering system. In EMNLP,  Cody C. T. Kwok, Oren Etzioni, and Daniel S. Weld. Scaling question answering to the web. In WWW,  Jimmy J. Lin and Boris Katz. Question answering from the web using knowledge annotation and knowledge mining techniques. In CIKM,

Search Interface 33

Query I: Amazon Customer Service Phone 34 Results # of Supporting Page Representative Supporting Pages

Query II: Professors in Data Mining 35

Query III: University of California Locations 36