EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.
EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Data-oriented Content Query System: Searching for Data into Text on the Web Mianwei Zhou, Kevin Chen-Chuan Chang Department of Computer Science UIUC 1.
Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search.
Honglei Zhuang1, Jing Zhang2, George Brova1,
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Speaker: Ruirui Li 1 The University of Hong Kong.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Querying Structured Text in an XML Database By Xuemei Luo.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.
A Generic Solution for Warehousing Business Process Data Malu Castellanos Joint work with Fabio Casati, Umesh Dayal, Norman Salazar Dayal, Norman Salazar.
Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Entity Search Are you searching for what you want? Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Chengkai Li, Govind Kabra, Shui-Lung Chuang, Joe.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
Copyright  2009 by CEBT Meeting  Lab. 이사 3 월 28( 토 )~29( 일 ) 잠정 예정 포장이사 견적 & 냉난방기 이전 설치 견적  정보과학회 데이터베이스 논문지 1 차 심사 완료 오타 수정 수식 설명 추가 요구  STFSSD 발표자료.
Vector Space Models.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Combining Systems and Databases: A Search Engine Retrospective By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Presented By – Yogesh A. Vaidya. Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign RankFP : A Framework for Rank Formulation and Processing Hwanjo Yu, Seung-won.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Databases Chapter 16.
Information Retrieval and Web Design
Presentation transcript:

EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois at Urbana-Champaign VLDB ’07, September 23-28, 2007, Vienna, Austria Presented by Sangkeun Lee, IDS Lab., Seoul National University

Copyright  2007 by CEBT Motivating Scenario 2 Customer Service phone number of Amazon?

Copyright  2007 by CEBT 3 Search on Amazon?

Copyright  2007 by CEBT 4 Search on Google?

Copyright  2007 by CEBT Many many similar cases:  The of Luis Gravano?  What profs are doing databases at UIUC?  The papers and presentations of ICDE 2007?  Due date of SIGMOD 2008?  Sale price of “Canon PowerShot A400”?  “Hamlet” books available at bookstores?  Often times, we are looking for data entities, e.g. s, dates, prices, etc, not pages. 5

Copyright  2007 by CEBT 6 What you search is not what you want

Copyright  2007 by CEBT 7 Traditional SearchEntity Search Keywords Entities Results Support Entity Search Problem

Copyright  2007 by CEBT Entity Search Problem 8

Copyright  2007 by CEBT Challenge How to rank Entities? Why a novel Problem? 9

Copyright  2007 by CEBT Core Challenges  Contextual: pattern (phrase, uw, ow) & proximity  Holistic: aggregated occurrences  Uncertainty: extraction confidence probability  Associative: distinguish true associations from accidental  Discriminative: entity instances matched on more popul ar pages should receive higher scores than entity instances from less popular pages  A novel problem: solve all together, probabilistic 10

Copyright  2007 by CEBT Impression Model 11

Copyright  2007 by CEBT Recognition Layer: Local Assessment  Given a document d, how to assess a particular tuple t= matches the query q = α (E 1,…, E m, k 1,…, k l ) = α (γ):  Two orthogonal factors Extraction uncertainty Association context –Boolean Pattern Qualification Doc, phrase, uw, ow –Probabilistic Proximity Quantification  * s: the span length-the shortest window that covers the entire occurence

Copyright  2007 by CEBT 13 Recognition Layer: Local Assessment C ontextual U ncertain H olistic D iscriminative A ssociative Input: L1L1 L2L2 Extraction Conf = 1.0Extraction Conf = 0.3 Output:

Copyright  2007 by CEBT 14 Access Layer: Global Aggregation C ontextual U ncertain H olistic D iscriminative A ssociative Holistic Discriminative Output: Input: e.g

Copyright  2007 by CEBT Validation Layer: Hypothesis Testing  Accidental association E.g: appears very frequently with keywords “Luis”, “G ravano”. However, such association is only accidental as org appears on many org  Validate if the association is not accidental

Copyright  2007 by CEBT EntityRank: The Scoring Function Local RecognitionGlobal Aggregation Validation

Copyright  2007 by CEBT Comparison … EntityRank Naïve approch Local only Global only Combine L by simple summation L+G without hypothesis testing %Satisfied Queries at #Rank Query Type I: Phone for Top-30 Fortune500 Companies Query Type II: for 51 of 88 SIGMOD07 PC Corpus: General crawl of the Web(Aug, 2006), around 2TB with 93M pages. Entities: Phone (8.8M distinctive instances) (4.6M distinctive instances) System: A cluster of 34 machines

Copyright  2007 by CEBT Conclusions  Formulate the entity search problem  Study and define the characteristics and requirements of entity search  Propose Impression Model and EntityRank framework for ranking entities  Implement a prototype with real Web