Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Reference Tables for Automatic Text Segmentation E. Agichtein V

Similar presentations


Presentation on theme: "Mining Reference Tables for Automatic Text Segmentation E. Agichtein V"— Presentation transcript:

1 Mining Reference Tables for Automatic Text Segmentation E. Agichtein V
Mining Reference Tables for Automatic Text Segmentation E. Agichtein V. Ganti Columbia Univ. Microsoft R. KDD’04 Shui-Lung Chuang Oct 27, 2004

2 Text Segmentation A (short)-text string N attributes
Conventional approaches Rule-based — human creates rules Supervised model-based — human labels data Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD Null [ Authors , Title , Conference , Year ]

3 The Approach Utilize the existing (large, clean) reference data
E.g, DBLP  Papers, US Addresses, … Author Title Conference Year Mark Steyvers, Padhraic Smyth Probabilistic Author-Topic Models for SIGKDD 2004 Lotlikar, S. Roy A Hierarchical Document Clustering WWW Cimiano, S. Handschuh Towards the Self-Annotating Web … 2003 …… ……. …. ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

4 Segmentation Model Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD To find s1 s2 s3 s4 ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

5 Challenges Robust to input error Adaptive to varied attribute orders
The ref. data may be clean, but Input may contain various errors: Missing values, spelling error, extraneous or unknown tokens, etc Adaptive to varied attribute orders Reference data don’t contain info for attribute order in input Efficient in training Reference data is large Engineer features Adjust model topology Determine attribute order from early input strings Fix model topology Don’t use advanced learning (e.g., EM)

6 Feature Hierarchy High-level features considered:
Token classes (words, numbers, mixed, delimiters) + Token length

7 Attribute Recognition Model
57th n sixth st s fifth st n goodwin ave

8 Model Training … … 57th n sixth st 1010 s fifth st 201 n goodwin ave
Transition: B  { M, T, END } M  { M, T, END } T  { T, END } Emission: p(x|e)=(x=e) ? 1 : 0 Mixed [a-z0-9]{1,-} … … [a-z0-9]{1,5} [a-z0-9]{1,4} 57th

9 Sequential Specificity Relaxation
Token insertion e.g., 57th 57th n sixth st Token deletion e.g., n sixth Missing attribute value e.g., <null>

10 Determining Attribute Value Order
Attribute order is usually preserved in the same batch of input strings

11 Determining Attribute Value Order
s = walmart s. randall ave madison wi. pos v(s,Ai): [ 0.05, , 0.02, 0.1, , 0.8, , ]  city attr. [ 0.1, , 0.8, 0.7, 0.9, , , ]  street attr. (partial order) (total order) Search all permutation for the best total order

12 Experiment Data Reference relations
Addresses: 1,000,000 tuples Schema; [ Name,Number1,Number2,Address, City, State, Zip ] Media: 280,000 music tracks Schema: [ ArtistName, AlbumName, TrackName ] Bibliography: 100,000 records from DBLP Schema: [ Title, Author, Journal, Volume, Month, Year ] Test datasets – Naturally concatenated test sets Addresses: from RISE repository Media: from Microsoft Papers: 100 most cited papers from Citeseer

13 Experiment Data (cont.)
Test datasets – Controlled test data sets Randomly chosen order Error injection

14 Experiment Results

15 Experiment Results 1-Pos vs BMT vs BMT-robust

16 Comments The idea of using reference tables is good
The approach is well engineered to deal with issues of robustness and efficiency Experiment is thorough The approach is somewhat still ad hoc, and every component seems replaceable


Download ppt "Mining Reference Tables for Automatic Text Segmentation E. Agichtein V"

Similar presentations


Ads by Google