Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Slides:



Advertisements
Similar presentations
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Advertisements

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
DQR : A Probabilistic Approach to Diversified Query recommendation Date: 2013/05/20 Author: Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Source:
Chapter 5: Introduction to Information Retrieval
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
A Machine Learning Approach for Improved BM25 Retrieval
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
Overview of Search Engines
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Querying Structured Text in an XML Database By Xuemei Luo.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
RANKING SUPPORT FOR KEYWORD SEARCH ON STRUCTURED DATA USING RELEVANCE MODEL Date: 2012/06/04 Source: Veli Bicer(CIKM’11) Speaker: Er-gang Liu Advisor:
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Learning from Negative Examples in Set-Expansion Authors: Prateek Jindal and Dan Roth Dept. of Computer Science, UIUC Presenting in: ICDM 2011.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Ranking Related Entities Components and Analyses CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Data Integration for Relational Web
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Connecting the Dots Between News Article
Presentation transcript:

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling Koh 1

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 2

3 Introduction - Motivation  Web Tables Web consists of a huge number of document Apply an HTML parser to a page crawl to obtain billion instances of the tag Large amount of HTML tables containing relational-style data and schema-like descriptions

4 Introduction - Motivation  Structured data comes in many forms. focusing on HTML tables for this paper.  Searching related tables (identify “good” relations)  What does “related” really mean?  What is the similarity function? (note: tables are semi-­ ‐structured)

5 Introduction - Motivation A: Top 100 Tennis Players in 2010 B: Players Ranked in 2010  We can union Tables A and B.  Their schema is identical, and they provide complementary sets of entities

6 Introduction - Motivation A: Top 100 Tennis Players in 2010C : Top 100 in 2010 : ATP Ranking  We can join Tables C with Table A.  The two tables describe the same entities, but the table in Figure A adds more information (Tournaments Played )

7 Introduction - Motivation C : Top 100 in 2010 : ATP RankingD : Top 100 in 2009 : ATP Ranking  Can’t directly joined or union D with C.  They both can be seen as the result of a selection and projection on a larger table that contains the rankings of tennis players in different years.

8 Introduction - Goal  Describes a framework for defining relatedness of tables and algorithms for finding related tables.  Given a table T, return a ranked list of tables that are related to T.  If table T’ is related to T that is high in the search result ranking, T’ could also be good and relevant.  Tables can be related in various ways: Joinable, unionable, fragments of one table

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 9

10 Entity Complement  Entity Consistency (Coherence)  Entity Expansion (Substantially add new entities)  Schema Consistency (Similar Schema)

11 Entity Complement - Sources of signal  This paper infer entity groups based on several signals can mine from the Web:  WeblsA database  Freebase types  Table co-occurrence  Constructing a single measure for the relatedness of two tables’ (T 1 and T 2 ’s) entity sets  E(T 1 ) = { India, Korea, Malaysia }  E(T 2 ) = { Japan, Taiwan}  EX: E(T 2 ) = { Taiwan }  L(Taiwan) = { Asian countries, developed countries, ……. }

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 12

 Relatedness between entity sets  Entity sets E(T 1 )  Entity sets E(T 2 )  Relatedness between a pair of entities  L(India) = { Asian countries, developing country, location, cricket ……. }  L(Japan) = { Asian countries, developed countries, location, baseball ……. } 13 Entity Consistency and Expansion  Entity Set  E(T 1 ) = { India, Korea, Malaysia }  E(T 2 ) = { Japan, Taiwan }

14 Relatedness between a pair of entities  L(Japan) = { Asian countries :1, developed countries:1, location:1, baseball:1 ……. }  L(India) = { Asian countries :1, developing country:1, location:1, cricket:1 ……. }  L(Korea) = { Asian countries:1, developed countries:1, location :1, baseball:1 ……. }  E(T 1 ) = { India, Korea, Malaysia }  E(T 2 ) = { Japan, Taiwan}  re u (India,Japan) = 2  re u (Korea,Japan) = 4

15 Relatedness between entity sets – aggregate relatedness  E(T 1 ) = { India, Korea, Malaysia }  E(T 2 ) = { Japan, Taiwan }

16  E(T 1 ) = { India, Korea, Malaysia }  E(T 2 ) = { Japan, Taiwan }  L( E(T1) ) = { Asian countries:3, developed countries:2, developing:1 country :1,baseball:1, cricket :1,soccer:1……. }  L( E(T2) ) = { Asian countries:2, developed countries:2, soccer:1, baseball:2, ……. } Relatedness between entity sets – sets of entities  the parameters are set to m 1 = n 1 = m 2 = n 2 = 2.0

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 17

18 Schema Similarity  Computing the schema similarity between the query table T 1 and a candidate related table T 2, denoted S SS (T 1, T 2 ).  The set of labels and attribute name on each table to representing string similarity (Jaccard Similarity) A 1-1 A 1-2 A 1-3 A 1-4 A 1-5 A 2-1 A 2-2 A 2-3 A 2-4 A 2-5  n i = number of attribute of T i  N = the number of attribute in the matching  W = weight of this matching T1 T2

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 19

20 Schema Complement  Given a table T 1, we are interested in finding tables T 2 that provide the best set of additional columns.  Adding as many properties as possible to the entities in T 1 while preserving the “consistency” of its schema.  This Paper consider the following factors:  Coverage of entity set  Benefits of additional attributes

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 21

 T2 should consist of most of the entities in T1, if not all of them  Compute the coverage of T2’s entity set with respect to T1 22 Schema Complement - Coverage of entity set

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 23

 T 2 should contain additional attributes that are not described by T 1 ’s schema.  Using S(T i ) to denote the set of attributes in T i and S(T 2 ) \ S(T 1 ) to represent the additional attributes in T2’s schema.  A baseline measure for the benefit of T 2 simply counts the number of additional attributes: 24 Schema Complement - Benefits of additional attributes T1 T2 = 1 A 1-1 A 1-2 A 1-3 A 1-4 A 1-5 A 2-1 A 2-2 A 2-3 A 2-4 A 2-5

25 Schema Complement - Benefits of additional attributes  AcsDB, a data structure that summarizes the frequencies of all possible schemas in the Web table corpus.  Given a set of attributes S, the AcsDB provides the frequency, freq(S), of S in the table corpus.  Assume:  freq( S(T1) ) = freq( id,name) = 20  freq( {age} U S(T 1 ) ) = freq( {age} U {id, name} ) = 8  The freq(S) in AcsDB is not as meaningful for large schema because they appear very few times in the web table corpus.

26 Schema Complement - Benefits of additional attributes  A more effective measure is to derive the benefit measure by considering co-occurrence of pairs of attributes, rather than the entire schemas.  Determining the consistency of a new attribute a 2 to an existing attribute a 1, denotedy by cs(a 1, a 2 ), using AcsDB schema id,name, age id, name T1 T2  Assume:  freq( id ) = 10  freq( id, age ) = 4

27 Schema Complement - Benefits of additional attributes  Considering three kinds of aggregation:  sum (giving importance to the amount of extension)  average (normalizing the total extension by the number of new attributes)  max (considering the most consistent attribute as representative)

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 28

29 Experiment  Experimental setup:  This paper evaluated the relatedness algorithms on 18 queries.  Generate top-5 related tables for the query table  Results are based on aggregating the feedback of 8 users.  Metric:  For each query table, we generate top-k (k = 1 to 5) related tables based on it  Each user to provide a score from 0 to 5 indicating how closely T 2 is related to T 1 with respect to R (0 being not related and 5 being most related).

30 Experiment - Entity Complement Results  Freebase labels are most effective if we consider only one source of labels.  Interestingly, adding WebIsA and WebTable labels does not have an apparent impact.

31 Experiment - Entity Complement Results  The measure AvgPair, which rewards consistency over entity- set expansion is significantly better (up to 25% improvement) compared with SumPair

32 Experiment  The best result combines set-based relatedness with labels from WebTable

33 Experiment – Schema Complement Results  sum achieves the best results, while max and count also perform reasonably well.  Sum aggregation is the most effective, and it indicates that when considering schema complement  Users indeed focus on the number of additional attributes

Outline Introduction Entity Complement Entity Consistency and Expansion Schema Similarity Schema Complement Coverage of entity set Benefits of additional attributes Experiment Conclusion 34

Conclusion 35  This paper introduced the problem of finding related tables from a large heterogeneous corpus that are related to an input table.  Presenting a framework that captures a multitude of relatedness types, and described algorithms for ranking tables based on entity complement and schema complement  Describing user studies that evaluated the quality of our related tables detection algorithms, and showed how related tables discovery may enhance table search results