Download presentation
Presentation is loading. Please wait.
Published byHomer Sharp Modified over 6 years ago
1
Mining Reference Tables for Automatic Text Segmentation E. Agichtein V
Mining Reference Tables for Automatic Text Segmentation E. Agichtein V. Ganti Columbia Univ. Microsoft R. KDD’04 Shui-Lung Chuang Oct 27, 2004
2
Text Segmentation A (short)-text string N attributes
Conventional approaches Rule-based — human creates rules Supervised model-based — human labels data Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD Null [ Authors , Title , Conference , Year ]
3
The Approach Utilize the existing (large, clean) reference data
E.g, DBLP Papers, US Addresses, … Author Title Conference Year Mark Steyvers, Padhraic Smyth Probabilistic Author-Topic Models for SIGKDD 2004 Lotlikar, S. Roy A Hierarchical Document Clustering WWW Cimiano, S. Handschuh Towards the Self-Annotating Web … 2003 …… ……. …. ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model
4
Segmentation Model Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD To find s1 s2 s3 s4 ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model
5
Challenges Robust to input error Adaptive to varied attribute orders
The ref. data may be clean, but Input may contain various errors: Missing values, spelling error, extraneous or unknown tokens, etc Adaptive to varied attribute orders Reference data don’t contain info for attribute order in input Efficient in training Reference data is large Engineer features Adjust model topology Determine attribute order from early input strings Fix model topology Don’t use advanced learning (e.g., EM)
6
Feature Hierarchy High-level features considered:
Token classes (words, numbers, mixed, delimiters) + Token length
7
Attribute Recognition Model
57th n sixth st s fifth st n goodwin ave
8
Model Training … … 57th n sixth st 1010 s fifth st 201 n goodwin ave
Transition: B { M, T, END } M { M, T, END } T { T, END } Emission: p(x|e)=(x=e) ? 1 : 0 Mixed [a-z0-9]{1,-} … … [a-z0-9]{1,5} [a-z0-9]{1,4} 57th
9
Sequential Specificity Relaxation
Token insertion e.g., 57th 57th n sixth st Token deletion e.g., n sixth Missing attribute value e.g., <null>
10
Determining Attribute Value Order
Attribute order is usually preserved in the same batch of input strings
11
Determining Attribute Value Order
s = walmart s. randall ave madison wi. pos v(s,Ai): [ 0.05, , 0.02, 0.1, , 0.8, , ] city attr. [ 0.1, , 0.8, 0.7, 0.9, , , ] street attr. (partial order) (total order) Search all permutation for the best total order
12
Experiment Data Reference relations
Addresses: 1,000,000 tuples Schema; [ Name,Number1,Number2,Address, City, State, Zip ] Media: 280,000 music tracks Schema: [ ArtistName, AlbumName, TrackName ] Bibliography: 100,000 records from DBLP Schema: [ Title, Author, Journal, Volume, Month, Year ] Test datasets – Naturally concatenated test sets Addresses: from RISE repository Media: from Microsoft Papers: 100 most cited papers from Citeseer
13
Experiment Data (cont.)
Test datasets – Controlled test data sets Randomly chosen order Error injection
14
Experiment Results
15
Experiment Results 1-Pos vs BMT vs BMT-robust
16
Comments The idea of using reference tables is good
The approach is well engineered to deal with issues of robustness and efficiency Experiment is thorough The approach is somewhat still ad hoc, and every component seems replaceable
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.