Mining Reference Tables for Automatic Text Segmentation E. Agichtein V

Slides:



Advertisements
Similar presentations
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Advertisements

 Do I know my topic for research? (Do I have my topic chosen before I get on a computer? If not, I have to make that decision and write it down first.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Reference Manager Making your life easier! Updated September 2007.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Webpage Understanding: an Integrated Approach
Chapter 4-1. Chapter 4-2 Database Management Systems Overview  Not a database  Separate software system Functions  Enables users to utilize database.
METADATA Research Data Management. What is metadata? Metadata is additional information that is required to make sense of your files – it’s data about.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
MAIL MERGE Designing Documents with. Terms Mail Merge: A process that inserts variable information into a standardized document to produce a personalized.
Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
Recent Trends in Text Mining Girish Keswani
Database Management System Lecture 4 The Relational Database Model- Introduction, Relational Database Concepts.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Evidence & Research Jeffrey Miller Marist School 2015 Georgia Debate Institutes.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
Presenter: Shanshan Lu 03/04/2010
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Word Editing Tools. Word Automatic Editing Tools §Word has three features that automatically change or insert text and graphics as you type §You can easily.
The Annotated Bibliography MLA Style. What is an Annotated Bibliography? An annotated bibliography is a summary, evaluation, and reflection of each source.
Windows Live Movie Maker. Making a Title In your ribbon click “Title” You can now type what you would like the title of your movie to be.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CS320 Web and Internet Programming SQL and MySQL Chengyu Sun California State University, Los Angeles.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Recent Trends in Text Mining
Data Mining – Intro.
Chapter 2: The Visual Studio .NET Development Environment
Finding Literature for Research
CASS, Fall 2015 APA Style: A Primer.
Some Simple Design Modeling Techniques
CS320 Web and Internet Programming SQL and MySQL
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Chapter 3 Lexical Analysis.
Unit 16 – Database Systems
Holdings – vital to library success
Yi-Chia Wang LTI 2nd year Master student
Word Editing Tools.
Microsoft Word Illustrated
Tools and Techniques to Clean Up your Database
Tools and Techniques to Clean Up your Database
Database Implementation Issues
Database Implementation Issues
Discovery Learning by Investigation
GEOCODING Creates map features from addresses or place-names.
Updating GML datasets S-100 WG TSM September 2017
Scribbles Jim Li Roy Lim Matt McKenzie Anthony Wu.
Data Warehousing and Data Mining
Standard Design Process (SDP) Software Tom Czerniewski Entergy Nuclear
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall.
Kriti Chauhan CSE6339 Spring 2009
DATABASE IMPLEMENTATION ISSUES
presented by Thomas L. Packer
Intro to Machine Learning
ISI Web of Knowledge New Features, April 2007
CS3220 Web and Internet Programming SQL and MySQL
Evaluating sources.
TOPIC: (insert here) INSERT STUDENT NAMES HERE.
CS3220 Web and Internet Programming SQL and MySQL
Database Implementation Issues
Research Paper Step-by-step Process.
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Database Implementation Issues
Database Implementation Issues
Topic: Is about… Introduction Fun facts, organizes paper FACTS
Presentation transcript:

Mining Reference Tables for Automatic Text Segmentation E. Agichtein V Mining Reference Tables for Automatic Text Segmentation E. Agichtein V. Ganti Columbia Univ. Microsoft R. KDD’04 Shui-Lung Chuang Oct 27, 2004

Text Segmentation A (short)-text string N attributes Conventional approaches Rule-based — human creates rules Supervised model-based — human labels data Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD Null [ Authors , Title , Conference , Year ]

The Approach Utilize the existing (large, clean) reference data E.g, DBLP  Papers, US Addresses, … Author Title Conference Year Mark Steyvers, Padhraic Smyth Probabilistic Author-Topic Models for SIGKDD 2004 Lotlikar, S. Roy A Hierarchical Document Clustering WWW Cimiano, S. Handschuh Towards the Self-Annotating Web … 2003 …… ……. …. ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

Segmentation Model Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD To find s1 s2 s3 s4 ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

Challenges Robust to input error Adaptive to varied attribute orders The ref. data may be clean, but Input may contain various errors: Missing values, spelling error, extraneous or unknown tokens, etc Adaptive to varied attribute orders Reference data don’t contain info for attribute order in input Efficient in training Reference data is large Engineer features Adjust model topology Determine attribute order from early input strings Fix model topology Don’t use advanced learning (e.g., EM)

Feature Hierarchy High-level features considered: Token classes (words, numbers, mixed, delimiters) + Token length

Attribute Recognition Model 57th n sixth st 1010 s fifth st 201 n goodwin ave

Model Training … … 57th n sixth st 1010 s fifth st 201 n goodwin ave Transition: B  { M, T, END } M  { M, T, END } T  { T, END } Emission: p(x|e)=(x=e) ? 1 : 0 Mixed [a-z0-9]{1,-} … … [a-z0-9]{1,5} [a-z0-9]{1,4} 57th

Sequential Specificity Relaxation Token insertion e.g., 57th 57th n sixth st Token deletion e.g., n sixth Missing attribute value e.g., <null>

Determining Attribute Value Order Attribute order is usually preserved in the same batch of input strings

Determining Attribute Value Order s = walmart 20205 s. randall ave madison 53715 wi. 1 2 3 4 5 6 7 8 pos v(s,Ai): [ 0.05, 0.01, 0.02, 0.1, 0.01, 0.8, 0.01, 0.07 ]  city attr. [ 0.1, 0.7, 0.8, 0.7, 0.9, 0.5, 0.4, 0.1 ]  street attr. (partial order) (total order) Search all permutation for the best total order

Experiment Data Reference relations Addresses: 1,000,000 tuples Schema; [ Name,Number1,Number2,Address, City, State, Zip ] Media: 280,000 music tracks Schema: [ ArtistName, AlbumName, TrackName ] Bibliography: 100,000 records from DBLP Schema: [ Title, Author, Journal, Volume, Month, Year ] Test datasets – Naturally concatenated test sets Addresses: from RISE repository Media: from Microsoft Papers: 100 most cited papers from Citeseer

Experiment Data (cont.) Test datasets – Controlled test data sets Randomly chosen order Error injection

Experiment Results

Experiment Results 1-Pos vs BMT vs BMT-robust

Comments The idea of using reference tables is good The approach is well engineered to deal with issues of robustness and efficiency Experiment is thorough The approach is somewhat still ad hoc, and every component seems replaceable