Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon.

Slides:



Advertisements
Similar presentations
ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Advertisements

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Web Intelligence Text Mining, and web-related Applications
CLASSIFYING ENTITIES INTO AN INCOMPLETE ONTOLOGY Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer Science, Carnegie Mellon University.
Data Visualization STAT 890, STAT 442, CM 462
A Very Fast Method for Clustering Big Text Datasets Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ECAI ,
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Welcome to Business Intelligence Demo. Oracle Partner  We are an Oracle Partner.  We are professionals in Oracle Products.  Our Expertise… Oracle Applications.
WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies.
Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie.
Exploratory Learning Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School Of Computer.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
FINDING RELEVANT INFORMATION OF CERTAIN TYPES FROM ENTERPRISE DATA Date: 2012/04/30 Source: Xitong Liu (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling.
EXPLORATORY LEARNING Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
MICHAL TVAROŽEK, MICHAL BARLA, GYÖRGY FRIVOLT, MAREK TOMŠA, MÁRIA BIELIKOVÁ Improving Semantic Search via Integrated Personalized Faceted and Visual Graph.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
Measuring Semantic Similarity between Words Using Web Search Engines WWW 07.
Data Abstraction and Time-Series Data CS 4390/5390 Data Visualization Shirley Moore, Instructor September 15,
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.
DeepWalk: Online Learning of Social Representations
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Einat Minkov University of Haifa, Israel CL course, U
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
On Dataless Hierarchical Text Classification
العدد تذكيره وتأنيثه مقدمة
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
Research at Open Systems Lab IIIT Bangalore
Distributed Representations of Subgraphs
Graph Based Multi-Modality Learning
Visualizing Document Collections
Topic Oriented Semi-supervised Document Clustering
Web Information retrieval (Web IR)
Word Embedding Word2Vec.
Ernest Valveny Computer Vision Center
Topic: Semantic Text Mining
Peng Cui Tsinghua University
Presentation transcript:

Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 1 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA C-7058.

Motivation  Entities on the Web can be present in multiple datasets. E.g. HTML tables, text documents etc.  Traditional systems : Entities as sparse vector of document Ids in which it occurs.  We propose a low-dimensional representation for such entities.  Helps to efficiently perform different tasks with a small number of primitive operations : Semi-supervised Learning (SSL) Set Expansion (SE) Automatic Class Instance Acquisition (ASIA) 2

Entities in HTML tables 3 TC-2 TC-3 CountrySports IndiaHockey UKCricket USATennis CountryCapital City IndiaDelhi USAWashington DC CanadaOttawa FranceParis USA India Hockey Cricket Tennis TC-1 TC-2 TC-3 TC-4 Entity Table-column Entity-Column Bi-partite Graph

Entities in unstructured text 4 USA India Hockey Cricket Tennis Country Location Sports Suchas Entity “Such as” Bi-partite Graph Countries such as India are developing rapidly in terms of infrastructure. Outdoor sports include Tennis and Cricket.

Resultant Tri-partite Graph 5 USA India Hockey Cricket Tennis Country Location Sports TC-1 TC-2 TC-3 TC-4 Suchas Entity Table-column “Such as” Bi-partite Graph Entity-Column Bi-partite Graph

Encoding the graph 6 “Entity-Column” Bi-partite Graph EntityX1X2 USA India Hockey Cricket Tennis Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) USA India Hockey Cricket Tennis TC-1 TC-2 TC-3 TC-4 Entity Table-column Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence

Encoding the graph 7 USA India Hockey Cricket Tennis Country Location Sports Suchas Entity “Such as” Bi-partite Graph EntityY1Y2 USA India Hockey Cricket Tennis Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence

Low-dimensional PIC3 embedding n * t entity-tableColumn Bipartite graph n * s entity-suchas Bipartite graph n * m PIC embedding m << t n * m PIC embedding m << s n * 2m PIC3 embedding PIC Concatenate EntityX1X2 USA India Hockey Cricket Tennis Y1Y

Using PIC3 Representation Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points. Set Expansion : Given a set of seed entities, find more entities similar to seed entities. Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept. 9

Quantitative Evaluation: Datasets DatasetToy_AppleDelicious_Sports #entities14, # table-columns #entity-table column edges176,5989,192 #suchas concepts2,3481,649 #entity-suchas edges7,6834,799 #general entity classes (NELL KB)11 3 #entities in general classes #hand-coded column types31 30 #columns in labeled types Link to dataset:

11 TaskTrainingTesting Semi- Supervised Learning PIC3 + Train SVM classifier Predict using learnt SVM model SSL using PIC3 Input : Few seed examples for each class label Output : Class-labels for unlabeled data-points PIC clusters similar entities together  better SVM classifier on unlabeled data (use of background data)

SSL Task - I 12 # dimensions : 2504  10

SSL Task - II 13 # dimensions : 2574  10

14 TaskTrainingTesting Set Expansion PIC3Centroid(entity set) + K-NN (centroid) Set Expansion using PIC3 Input : Few seed entities e.g. Football, Hockey, Tennis Output : More entities of same type as seeds e.g. Baseball, Badminton, Cricket, Golf …. K-NN operation is extremely efficient using KD-trees.

Query Times PIC3 preprocessing : 0.02 sec # SE queries = 881 Precision Recall Curve : K-NN+PIC3 consistently beats K-NN- Baseline. Modified Adsorption method is better on 2/5 query classes at the expense of larger query time. 15 MethodTotal Query Time (s) K-NN + PIC312.7 K-NN-Baseline80.1 MAD38.2 Modified Adsorption : Graph based label propagation algorithm

16 TaskTrainingTesting Automatic Set Instance Acquisition PIC3 + Inverted index (suchasConcept  entities) seeds = top-k-entities (lookup concept in index) + Set Expansion (seeds) Automatic Set Instance Acquisition (ASIA) : using PIC3 Input : Class label e.g. Country Output : Entities belonging to the given class label e.g. India, China, USA, Canada, Japan ….. Previously described Set Expansion algorithm is used as a subroutine here.

Query Times PIC3 preprocessing : 0.02 sec # ASIA queries = 25 Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time. 17 MethodTotal Query Time (s) K-NN + PIC30.5 K-NN-Baseline1.4 MAD150.0

Conclusions & Future Work  Presented a novel low-dimensional PIC3 representation for entities on the Web using Power Iteration Clustering (PIC).  Simple primitive operations on PIC3 to perform following tasks : Semi-Supervised Learning Set Expansion Automatic Set Instance Acquisition  Future work : Use PIC3 representation for Named entity disambiguation and Unsupervised class-instance pair acquisition 18

Thank You !! 19 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA C Please visit our poster ID : 02

Examples : Set Expansion 20

Examples : ASIA 21

Set Expansion 22

ASIA Task 23