Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Slides:



Advertisements
Similar presentations
Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta.
Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Information Retrieval in Practice
Chapter 5: Introduction to Information Retrieval
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Large-Scale Entity-Based Online Social Network Profile Linkage.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Recommender systems Ram Akella November 26 th 2008.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Oral Defense by Sunny Tang 15 Aug 2003
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
1 An Efficient Classification Approach Based on Grid Code Transformation and Mask-Matching Method Presenter: Yo-Ping Huang Tatung University.
Pattern Matching in DAME using AURA technology Jim Austin, Robert Davis, Bojian Liang, Andy Pasley University of York.
Presented by Tienwei Tsai July, 2005
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Automated Face Detection Peter Brende David Black-Schaffer Veni Bourakov.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
Presenter: Shanshan Lu 03/04/2010
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Latent Dirichlet Allocation
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Data Mining and Decision Support
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Reza Yazdani Albert Segura José-María Arnau Antonio González
Optimizing Parallel Algorithms for All Pairs Similarity Search
Record Storage, File Organization, and Indexes
Online Multiscale Dynamic Topic Models
Text Based Information Retrieval
CS 430: Information Discovery
Efficient Image Classification on Vertically Decomposed Data
Discriminative Frequent Pattern Analysis for Effective Classification
Packet Classification Using Coarse-Grained Tuple Spaces
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Topic: Semantic Text Mining
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections

Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. … Donald Knuth works in research … is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc. Focus: Open relationship extraction vs. targeted extraction Context Entity

Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. … Donald Knuth works in research … is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc. Focus: targeted Open relationship extraction vs. targeted extraction Large document collections (> 10 7 Documents) Context Entity

Using Aggregate Context Single-context Extraction: ([Entity], is-a-researcher) Multi-context Extraction: …[Entity] works in research… …[Entity] published… …[Entity]s paper… …[Entity] gave a talk… ([Entity], is-a-researcher) Multi-Feature Relation Extractor Extraction logic: [E] works … research We track an entity across contexts, allowing us to combine less predictive features. [Entity], paper [Entity], talk [Entity], published Aggregate Context Features

Using Co-occurrence Features Leverage co-occurrence of entity classes (e.g. directors likely co- occur with actors) for extraction. Example: Extraction of is-a-director relation: Alan Alba Richard Gere Julia Roberts … Actor-List … Julia Roberts starred in a Robert Altman film in 1994 … Co-occurrence features can be between Entities of different classes. Entities of one class. Combination with text-features possible: e.g., [Entity] plays for [Team_Name]. Robert_Altman, co-occurs with actor name … Aggregate Context Features Two Questions: (a) What difference do the aggregate contexts make for extraction accuracy? (b) This means keeping track of contexts across documents - can we make this efficient? Two Questions: (a) What difference do the aggregate contexts make for extraction accuracy? (b) This means keeping track of contexts across documents - can we make this efficient?

Relationship Extraction from Text Task: Extraction of structured relations from text. …Donald Knuth works in research… is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao Ming, Houston Rockets) Applications: Bridging chasm from unstructured to structured data Advanced search Business intelligence Focus: Open Relation extraction vs. targeted extraction Context Entity

Relationship Extraction from Text Task: Extraction of structured relations from text. …Donald Knuth works in research… is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao Ming, Houston Rockets) Applications: Bridging chasm from unstructured to structured data Advanced search Business intelligence Focus: targeted Open Relation extraction vs. targeted extraction Unary vs. n-nary relations Context Entity

Relationship Extraction from Text Task: Extraction of structured relations from text. …Donald Knuth works in research… is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao Ming, Houston Rockets) Applications: Bridging chasm from unstructured to structured data Advanced search Business intelligence Focus: targeted Open Relation extraction vs. targeted extraction Unary Unary vs. n-nary relations Context Entity

Aggregate Features

Using Aggregate Context Each context is not in itself sufficient to infer the category researcher Single-context Extraction: ([Entity], is-a-researcher) We can track an entity across pages, allowing us to combine less predictive features. Multi-context Extraction: …[Entity] works in research… …[Entity] published… …[Entity]s paper… …[Entity] gave a talk at… ([Entity], is-a-researcher) Multi-Feature Classifier Extraction Rule: [E] works in research

Using Co-occurrence In targeted extraction, we can leverage co-occurrences of entities. Example: Extraction of is-a-movie relation Alan Alba Richard Gere Julia Roberts … Actor-List … Julia Roberts starred in Pretty Woman in 1988 … Multi-Feature Classifier Feature: Co-occurrence between entity and actor name in context. Co-occurrence features can be between Entities of different classes. Entities of one class (e.g., actors) Combination with text-features: e.g., [E] plays for [Team_Name].

Processing large Document Collections

Architecture Problem setting: | D | > available memory |M| Document Corpus D Entity-Relation Pairs Aggregation Rule-based Extraction COUNT(entity, relation) > Δ …

Architecture Problem setting: - | D | > available memory |M| - Co-occurrence lists |L| > |M| Document Corpus D Entity-Relation Pairs Aggregation Rule-based Extraction COUNT(entity, relation) > Δ … Entity-Feature Pairs Classification Feature Extraction List corpus L Aggregate Feature Extraction

Single-Context Extraction Agg. Feature Extraction Architecture Context Feature Extraction Document Corpus D Entity-Relation Pairs Aggregation COUNT(entity, relation) > Δ Entity-Feature Pairs Classification Co-Occurrence List corpus L Co-Occurrence Detection Co-Occurrence Detection Co-Occurrence Detection Co-Occurrence Detection Duplicated overhead from - Document scanning - Document processing - Entity Extraction. New Architecture

Challenges: 1. Fast & accurate co- occurrence detection using the synopsis. 2. Pruning of redundant output. Context Feature Extraction New Architecture Document Corpus D Aggregation Rule-based Extraction Classification Agg. Feature Extraction Synopsis of L Delete false Positives Co-Occurrence List corpus L Aggregation List-Member Extraction Co-Occurrence Detection Entity – Candidate Context Pairs Entity-List Pairs Entity-Feature Pairs Fast identification of candidate matches through 2-stage filtering. Use of Bloom-Filters to trade off memory footprint with false positive rate. Fast identification of candidate matches through 2-stage filtering. Use of Bloom-Filters to trade off memory footprint with false positive rate. Frequency-distribution of entities very skewed. Pruning based on retaining most frequent entities and list members in memory. Challenge: Determining frequencies online. => Compact hash-synopses of frequencies (CM-Sketch) perform well. Frequency-distribution of entities very skewed. Pruning based on retaining most frequent entities and list members in memory. Challenge: Determining frequencies online. => Compact hash-synopses of frequencies (CM-Sketch) perform well. Potentially very large output: Duplication via very many co-occurrences, e.g. actor- actor. Potentially very large output: Duplication, e.g. Entity: George Bush Feature: President

Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Test only: {starred, pretty, woman, pretty woman}. Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Test only: {starred, pretty, woman, pretty woman}. Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. Realization of each Filter i as well as token filters based on Bloom Filters, allowing trading of false-positive rate against memory footprint.

Pruning redundant output

Pruning Candidate Contexts Retain most frequent list-members in memory. List-Membership is known and frequency can be estimated off-line. Pruning Entity-List pairs Retain most frequent entities and list IDs they co-occur with in memory. Challenge: dynamically determining entity frequencies. Distribution of entity-frequencies and list-member frequencies very skewed. => Compact hash-synopsis of frequencies (CM-Sketch). Pruning of entity-feature pairs Similar to online caching problems, with very small size of items. Very large space of entities x features => few repetitions. Simple algorithm – Write-On-Full - performs within 10% of best caching approaches.

Architecture Problem setting: - | D | > available memory |M| - Co-occurrence lists |L| > |M| Document Corpus D Entity-Relation Pairs Aggregation Rule-based Extraction COUNT(entity, relation) > Δ … Entity-Feature Pairs Classification Feature Extraction List corpus L Aggregate Feature Extraction List-Member Detection List-Member Detection List-Member Detection List-Member Detection

Architecture Synopsis of L Steam Documents D Verification List corpus L Edges(G E,F ) Aggregation Classifiers C List-Member Extraction Feature Extraction List-Member Detection Edges(G E,C ) Edges(G E,L )

Experiments

Experimental Evaluation Task: Categorization of entities into professions (actor, writer, painter, etc.) Document-Corpus: 3.2 Million Wikipedia pages Training data generated using Wikipedia lists of famous painters, writers, etc… Aggregate-Context Classifier: linear SVM using text n- gram & co-occurrence features (binary) Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD06]). Co-occurrence list: contains 10% of entity strings in training data.

Experimental Evaluation: Accuracy

Reducing the correlation required for extraction rules trades off recall and precision.

Experimental Evaluation: Accuracy

Experimental Evaluation: Overhead Skew in Co-occurrence-List member frequency: efficient pruning As we scale up |D|, pruning efficiency increases.

Experimental Evaluation: Overhead Main remaining overhead: writing of entity-features pairs. Simple caching strategy reduces this overhead by an order of magnitude.

Conclusions Studied the effect of aggregate context in relation extraction. Proposed efficient processing techniques for large text corpora. Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers. The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.

Questions?