Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.

Advertisements

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Nathan Wiebe, Ashish Kapoor and Krysta Svore Microsoft Research ASCR Workshop Washington DC Quantum Deep Learning.

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu.

CLASSIFYING ENTITIES INTO AN INCOMPLETE ONTOLOGY Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer Science, Carnegie Mellon University.

Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.

Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.

The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.

WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.

Traditional Information Extraction -- Summary CS652 Spring 2004.

Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.

Web Mining Research: A Survey

Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.

WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.

Distributed Representations of Sentences and Documents

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.

DOG I : an Annotation System for Images of Dog Breeds Antonis Dimas Pyrros Koletsis Euripides Petrakis Intelligent Systems Laboratory Technical University.

Feature Selection for Automatic Taxonomy Induction The Features Input: Two terms Output: A numeric score, or. Lexical-Syntactic Patterns Co-occurrence.

WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies.

嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.

Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong

Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Text Classification, Active/Interactive learning.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie.

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.

A Tensorial Approach to Access Cognitive Workload related to Mental Arithmetic from EEG Functional Connectivity Estimates S.I. Dimitriadis, Yu Sun, K.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Source-Selection-Free Transfer Learning

Exploratory Learning Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School Of Computer.

Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison February 2, 2010 Acknowledgments:

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.

Panel Discussion on Foundations of Data Mining at RSCTC2004 J. T. Yao University of Regina Web:

Use of FCA in the Ontology Extraction Step for the Improvement of the Semantic Information Retrieval Peter Butka TU Košice, Slovakia.

Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.

Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.

Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.

Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon.

Einat Minkov University of Haifa, Israel CL course, U

User Modeling for Personal Assistant

Extending Analogical Generalization with Near-Misses (ALIGN)

Research at Open Systems Lab IIIT Bangalore

Graph Based Multi-Modality Learning

Acquiring Comparative Commonsense Knowledge from the Web

Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.

Hierarchical, Perceptron-like Learning for OBIE

Enhancing ER Diagrams to View Data Transformations Computed with Queries Carlos Ordonez, Ladjel Bellatreche UH (USA), ENSMA (France) 1.

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Topic: Semantic Text Mining

Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^

Peng Cui Tsinghua University

Presentation transcript:

Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University Contributions Preprocessing to create PIC-D PIC-D Representation for Entities on the Web Query Runtime Speedup vs. Results Quality Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA C Conclusions  PIC-D : A single low-dim. representation for entities on the Web using Power Iteration Clustering (PIC) by Lin and Cohen ICML  #dimensions in PIC-D = √(total number of dimensions)  Time to create PIC-D is linear in total number of dimensions  Information extraction tasks posed as similarity queries on PIC-D  Comparable precision recall w.r.t. high-dimensional baseline  Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre-processing time to create PIC-D. |X| * m PIC embedding, m << n_1 |X| * n_1 Bipartite graph |X| * n_2 Bipartite graph PIC concatenate |X| * D * m PIC-D embedding |X| * n_D Bipartite graph E.g. Entities in HTML tables E.g. Entities with Hearst patterns E.g. Entities in Subj-Verb-Obj triples PIC |X| * m PIC embedding, m << n_2 |X| * m PIC embedding, m << n_D PIC Hypothesis : PIC-D embeddings will cluster similar entities (entities belonging to same class) together. USA India Football Hockey Baseball Country Location Sports TC-1 TC-2 TC-3 TC-4 Entity occurrences In text with Hearst-patterns Entity occurrences in HTML Table columns CountryX1X2 USA India Football Hockey Baseball Y1Y Example PIC-3 embedding, m = 2 PropertyDescriptionDataset Toy_ Apple Delicious_ Sports ASIA_ INT Clueweb_ Sports #HTML pages57421K121K918K |X|# Entities15K43815K30K |C|# table columns K78K |(x,c)|# (x, c) edges70.5K5.5K91K566K |Ys|# suchas concepts2.3K1.6K3.8K21.4K |(x, Ys)|# (x, Ys) edges7.7K4.8K18.3K107.8K |Yn|# NELL classes11323 |(x, Yn)|#(x, Yn) edges |Yc|# manual column labels |(c, Yc)|# (c, Yc) pairs #PIC-D dimensions Total time to create PIC-D (msec) Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label. IE Tasks as Similarity Queries Set Expansion task on Clueweb _Sports ASIA task on Clueweb_Sports Similarity queries on PIC-D are up to 2 orders of magnitude faster. PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes.  We Present a single, efficiently-constructible representation, named PIC-D representation for entities on the Web.  IE tasks can be posed as similarity queries on the PIC-D representation: Set Expansion, Automatic Set Instance Acquisition and Column Classification  PIC-D results in huge savings in query run-time with comparable quality of results.  Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived from KBs etc. for unsupervised class-instance pair acquisition. ASIA Column Classification Aggregate results over  Set expansion : 272 queries (Delicious_Sports) and 152 queries (Toy_Apple)  ASIA : 25 queries (Delicious_Sports)  COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple) How many PIC-D dimensions are enough? How much time does it take to create PIC-D? m = √ n and time = O(n) Set Expansion