Mining Reference Tables for Automatic Text Segmentation E. Agichtein V

Mining Reference Tables for Automatic Text Segmentation E. Agichtein V
Mining Reference Tables for Automatic Text Segmentation E. Agichtein V. Ganti Columbia Univ. Microsoft R. KDD’04 Shui-Lung Chuang Oct 27, 2004

Text Segmentation A (short)-text string N attributes
Conventional approaches Rule-based — human creates rules Supervised model-based — human labels data Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD Null [ Authors , Title , Conference , Year ]

The Approach Utilize the existing (large, clean) reference data
E.g, DBLP  Papers, US Addresses, … Author Title Conference Year Mark Steyvers, Padhraic Smyth Probabilistic Author-Topic Models for SIGKDD 2004 Lotlikar, S. Roy A Hierarchical Document Clustering WWW Cimiano, S. Handschuh Towards the Self-Annotating Web … 2003 …… ……. …. ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

Segmentation Model Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD To find s1 s2 s3 s4 ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

Challenges Robust to input error Adaptive to varied attribute orders
The ref. data may be clean, but Input may contain various errors: Missing values, spelling error, extraneous or unknown tokens, etc Adaptive to varied attribute orders Reference data don’t contain info for attribute order in input Efficient in training Reference data is large Engineer features Adjust model topology Determine attribute order from early input strings Fix model topology Don’t use advanced learning (e.g., EM)

Feature Hierarchy High-level features considered:
Token classes (words, numbers, mixed, delimiters) + Token length

Attribute Recognition Model
57th n sixth st s fifth st n goodwin ave

Model Training … … 57th n sixth st 1010 s fifth st 201 n goodwin ave
Transition: B  { M, T, END } M  { M, T, END } T  { T, END } Emission: p(x|e)=(x=e) ? 1 : 0 Mixed [a-z0-9]{1,-} … … [a-z0-9]{1,5} [a-z0-9]{1,4} 57th

Sequential Specificity Relaxation
Token insertion e.g., 57th 57th n sixth st Token deletion e.g., n sixth Missing attribute value e.g., <null>

Determining Attribute Value Order
Attribute order is usually preserved in the same batch of input strings

Determining Attribute Value Order
s = walmart s. randall ave madison wi. pos v(s,Ai): [ 0.05, , 0.02, 0.1, , 0.8, , ]  city attr. [ 0.1, , 0.8, 0.7, 0.9, , , ]  street attr. (partial order) (total order) Search all permutation for the best total order

Experiment Data Reference relations
Addresses: 1,000,000 tuples Schema; [ Name,Number1,Number2,Address, City, State, Zip ] Media: 280,000 music tracks Schema: [ ArtistName, AlbumName, TrackName ] Bibliography: 100,000 records from DBLP Schema: [ Title, Author, Journal, Volume, Month, Year ] Test datasets – Naturally concatenated test sets Addresses: from RISE repository Media: from Microsoft Papers: 100 most cited papers from Citeseer

Experiment Data (cont.)
Test datasets – Controlled test data sets Randomly chosen order Error injection

Experiment Results

Experiment Results 1-Pos vs BMT vs BMT-robust

Comments The idea of using reference tables is good
The approach is well engineered to deal with issues of robustness and efficiency Experiment is thorough The approach is somewhat still ad hoc, and every component seems replaceable

Mining Reference Tables for Automatic Text Segmentation E. Agichtein V

Similar presentations

Presentation on theme: "Mining Reference Tables for Automatic Text Segmentation E. Agichtein V"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Reference Tables for Automatic Text Segmentation E. Agichtein V

Similar presentations

Presentation on theme: "Mining Reference Tables for Automatic Text Segmentation E. Agichtein V"— Presentation transcript:

Similar presentations

About project

Feedback