Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

ADBIS 2007 A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred Dimitar Kazakov Artificial.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
SVM—Support Vector Machines
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Spring 2003Data Mining by H. Liu, ASU1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules.
A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.
Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.
Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.
A Probabilistic Classifier for Table Visual Analysis William Silversmith TANGO Research Project NSF Grant # and Greetings Prof. Embley!
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.
Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
CODD’s 12 RULES OF RELATIONAL DATABASE
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Hidden Markov Models for Information Extraction CSE 454.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
1 Information Extraction using HMMs Sunita Sarawagi.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Copyright 2006 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Third Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter.
Speech Recognition MIT SMA 5508 Spring 2004 Larry Rudolph (MIT)
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Processing Sequential Sensor Data The “John Krumm perspective” Thomas Plötz November 29 th, 2011.
Introduction to Neural Networks and Example Applications in HCI Nick Gentile.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Data Mining and Decision Support
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Type your project title here Your name Mueller Park Junior High
What Is Cluster Analysis?
Mining Reference Tables for Automatic Text Segmentation E. Agichtein V
Chapter 6 Classification and Prediction
Week 12 Option 3: Database Design
Modern Systems Analysis and Design Third Edition
Semantic Interoperability and Data Warehouse Design
Discriminative Frequent Pattern Analysis for Effective Classification
Classification and Prediction
CONTEXT DEPENDENT CLASSIFICATION
CSCI N317 Computation for Scientific Applications Unit Weka
presented by Thomas L. Packer
Topological Signatures For Fast Mobility Analysis
Presentation transcript:

Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research

Scenarios Importing unformatted strings into a target structured database – Data warehousing – Data integration Requires each string to be segmented into the target relation schema Input strings are prone to errors (e.g., data warehousing, data exchange)

Current Approaches Rule-based – Hard to develop, maintain, and deploy comprehensive sets of rules for every domain Supervised – E.g., [BSD01] – Hard to obtain comprehensive datasets needed to train robust models

Our Approach Exploit large reference tables – Learn domain-specific dictionaries – Learn structure within attribute values Challenges – Order of attribute concatenation in future test input is unknown – Robustness to errors in test input after training on clean and standardized reference tables

Problem Statement Target schema: R[A 1,…,A n ] For a given string s (a sequence of tokens) – segment s into s 1,…,s n substrings at token boundaries – map s 1,…,s n to A i1,…,A in – maximize P(A i1 |s 1 )*…*P(A in |s n ) among all possible segmentations of s Product combination function handles arbitrary concatenation order of attribute values P(A i |x) that a string x belongs to A i estimated by an Attribute Recognition Model ARM i ARMs are learned from a reference relation r[A 1,…,A n ]

Segmentation Architecture

ARMs Design goals – Accurately distinguish an attribute value from other attributes – Generalize to unobserved/new attribute values – Robust to input errors – Able to learn over large reference tables

ARM: Instantiation of HMMs Purpose: Estimate probabilities of token sequences belonging to attributes ARM: instantiation of HMMs (sequential models) Acceptance probability: product of emission and transition probabilities

Instantiating HMMs Instantiation has to define – Topology: states & transitions – Emission & transition probabilities Current automatic approaches for topology search from among a pre-defined class of topologies are based on cross validation [FC00, BSD01] – Expensive – Number of states in the ARM is small to keep the search space tractable

Intuition behind ARM Design Street address examples – [nw 57 th St], [Redmond Woodinville Rd] Album names – [The best of eagles], [The fury of aquabats], [Colors Soundtrack] Large dictionaries (e.g., aquabats, soundtrack, st…) to exploit Begin and end tokens are very important to distinguish values of an attribute (nw, st, the,…) Can learn patterns on tokens (e.g., 57th generalizes to *th) Need robustness to input errors – [Best of eagles] for [The best of eagles], [nw 57th] for [nw 57th st]

Large Number of States Associate a state per token: Each state only emits a single base token – More accurate transition probabilities Model sizes for many large reference tables are still within a few megabytes – Not a problem with current main memory sizes! Prune the number of states (say, remove low frequency tokens) to limit the ARM size

BMT Topology: Relax Positional Specificity A single state per distinct symbol within a category -- emission probability of a symbol within a category is same

Feature Hierarchy: Relax Token Specificity [BSD01]

Example ARM for Address

Robustness Operations: Relax Sequential Specificity Make ARMs robust to common errors in the input, i.e., maintain high probability of acceptance despite these errors Common types of errors [HS98] – Token deletions – Token insertions – Missing values Intuition: Simulate the effects of such erroneous values over each ARM

Robustness Operations Simulating the effect of token insertions: token and corresponding transition probabilities are copied from BEGIN to MIDDLE state

Transition Probabilities Transitions from B  M and B  T and M  M and M  T allowed Learned from examples in reference table Transition probabilities are also weighted by their ability to distinguish an attribute – A transition “*”  “*” which is common across many attributes gets low weight

Summary of ARM Instantiation BMT topology Token hierarchy to generalize observed patterns Robustness operations on HMMs to address input errors One state per token in reference table to exploit large dictionaries

Attribute Order Determination If attribute order is known – Can use dynamic programming algorithm to segment [Rabiner89] If attribute order is unknown – Can ask the user to provide attribute order – Can discover attribute order Naïve expensive strategy: evaluate all concatenation orders and segmentations for each input string Consistent Attribute Order Assumption: the attribute order is the same across a batch of input tuples – Several datasets on the web satisfy this assumption – Allows us to efficiently Determine the attribute order over a batch of tuples Segment input strings (using dynamic programming)

Segmentation Algorithm (runtime)

Experimental Evaluation Reference relations from several domains – Addresses: 1,000,000 tuples [Name, #1, #2, Street Address, City, State, Zip] – Media: 280,000 tuples [ArtistName, AlbumName, TrackName] – Bibliography: 100,000 tuples [Title, Author, Journal, Volume, Month, Year] Compare CRAM (our system) with DataMold [BSD01]

Test Datasets Naturally erroneous datasets: unformatted input strings seen in operational databases – Media – Customer addresses Controlled error injection: – Clean reference table tuples  [Inject errors]  Concatenate to generate input strings Evaluate whether a segmentation algorithm recovered the original tuple – Accuracy Measure: % of attribute values correctly recognized

Overall Accuracy Addresses DBLP

Topology & Robustness Operations Addresses

Training on Hypothetical Error Models

Exploiting Dictionaries Accuracy vs Reference Table size

Conclusions Reference tables leveraged for segmentation Combining ARMs based on independence allows segmenting input strings with unknown attribute order ARM models learned over clean reference relations can accurately segment erroneous input strings – BMT topology – Robustness operations – Exploiting large dictionaries

Model Sizes & Pruning Accuracy #States & Transitions Model Size in MB

Order Determination Accuracy

Topology Media

Specificities of HMM Models Model “specificity” restricts accepted token sequences Positional specificity – Number ending in ‘th|st’ can only be the 2 nd token in an address value Token specificity – Last state only accepts “st, rd, wy, blvd” Sequential specificity – “st, rd, wy, blvd” have to follow a number in ‘st|th’

Robustness Operations Token insertionToken deletionMissing values