Mining Reference Tables for Automatic Text Segmentation E. Agichtein V Mining Reference Tables for Automatic Text Segmentation E. Agichtein V. Ganti Columbia Univ. Microsoft R. KDD’04 Shui-Lung Chuang Oct 27, 2004
Text Segmentation A (short)-text string N attributes Conventional approaches Rule-based — human creates rules Supervised model-based — human labels data Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD Null [ Authors , Title , Conference , Year ]
The Approach Utilize the existing (large, clean) reference data E.g, DBLP Papers, US Addresses, … Author Title Conference Year Mark Steyvers, Padhraic Smyth Probabilistic Author-Topic Models for SIGKDD 2004 Lotlikar, S. Roy A Hierarchical Document Clustering WWW Cimiano, S. Handschuh Towards the Self-Annotating Web … 2003 …… ……. …. ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model
Segmentation Model Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD To find s1 s2 s3 s4 ARM1 ARM2 ARM3 ARM3 s: a sub-string prob. s is generated ARM: Attribute Recognition Model
Challenges Robust to input error Adaptive to varied attribute orders The ref. data may be clean, but Input may contain various errors: Missing values, spelling error, extraneous or unknown tokens, etc Adaptive to varied attribute orders Reference data don’t contain info for attribute order in input Efficient in training Reference data is large Engineer features Adjust model topology Determine attribute order from early input strings Fix model topology Don’t use advanced learning (e.g., EM)
Feature Hierarchy High-level features considered: Token classes (words, numbers, mixed, delimiters) + Token length
Attribute Recognition Model 57th n sixth st 1010 s fifth st 201 n goodwin ave
Model Training … … 57th n sixth st 1010 s fifth st 201 n goodwin ave Transition: B { M, T, END } M { M, T, END } T { T, END } Emission: p(x|e)=(x=e) ? 1 : 0 Mixed [a-z0-9]{1,-} … … [a-z0-9]{1,5} [a-z0-9]{1,4} 57th
Sequential Specificity Relaxation Token insertion e.g., 57th 57th n sixth st Token deletion e.g., n sixth Missing attribute value e.g., <null>
Determining Attribute Value Order Attribute order is usually preserved in the same batch of input strings
Determining Attribute Value Order s = walmart 20205 s. randall ave madison 53715 wi. 1 2 3 4 5 6 7 8 pos v(s,Ai): [ 0.05, 0.01, 0.02, 0.1, 0.01, 0.8, 0.01, 0.07 ] city attr. [ 0.1, 0.7, 0.8, 0.7, 0.9, 0.5, 0.4, 0.1 ] street attr. (partial order) (total order) Search all permutation for the best total order
Experiment Data Reference relations Addresses: 1,000,000 tuples Schema; [ Name,Number1,Number2,Address, City, State, Zip ] Media: 280,000 music tracks Schema: [ ArtistName, AlbumName, TrackName ] Bibliography: 100,000 records from DBLP Schema: [ Title, Author, Journal, Volume, Month, Year ] Test datasets – Naturally concatenated test sets Addresses: from RISE repository Media: from Microsoft Papers: 100 most cited papers from Citeseer
Experiment Data (cont.) Test datasets – Controlled test data sets Randomly chosen order Error injection
Experiment Results
Experiment Results 1-Pos vs BMT vs BMT-robust
Comments The idea of using reference tables is good The approach is well engineered to deal with issues of robustness and efficiency Experiment is thorough The approach is somewhat still ad hoc, and every component seems replaceable