Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.

Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Outline Information Extraction by Text Segmentation (IETS) ◦ Scenario and Problem ◦ Challenges and Motivation ◦ Related Work ONDUX ◦ Preliminary Experiments Next Steps

Information Extraction by Text Segmentation Text documents containing implicit semi- structured data records  Addresses  Bibliographic References  Classified Ads  Product Descriptions

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Classified Ad Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214 Address Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006 Bibliographic Reference Information Extraction by Text Segmentation Neighborhood, Price, Number, Street,..., Phone

Why extracting information?  Database Storage, Query…  Data Mining  Record Linkage. Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Classified Ad : Regent Square : $228,900 : 1028 : Mifflin Ave, : 6 Bedrooms : 2 Bathrooms : 412-638-7273 Information Extraction by Text Segmentation

Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in: 1. Segmenting 2. Assigning to each segment a label corresponding to an attribute Information Extraction by Text Segmentation

IETS – Challenges(I) Information Extraction by Text Segmentation (IETS) ◦ Borkar@SIGMOD'01, McCallum@ICML'01, Agichtein@SIGKDD'04, Mansuri@ICDE'06, Zhao@SICDM'08, Cortez@JASIST'09 Diversity of templates and styles  Attribute Ordering  Capitalization  Abbreviations. Different applications share similar domains  Ex.: Address and Ads  Records from both domains contain address information

IETS – Challenges(II) Diversity of templates and styles  Attribute Ordering; Capitalization; Abbreviations. HomePage DBLP ACM Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006 Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006) Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Existing approaches deal with this problem use Machine Learning techniques  Hidden Markov Models (HMM)  Conditional Random Fields (CRF)  Support Vector Machines (SVM) (SSVM) Supervised approaches require a hand-labeled training set created by an expert. Each generated model is particular to a given application High computational cost IETS – Challenges(III)

Related Work (Semi) Supervised Approaches [Borkar et. al @ SIGMOD 2001] ◦ Supervised extraction method based on Hidden Markov Models (HMM) [McCallum et. al @ ICML 2001] ◦ Proposed the usage of Conditional Random Fields (CRF), an supervised model – (S-CRF) [Mansuri et. al @ ICDE 2006] ◦ Semi-supervised approach based on CRF models All of these approaches require an expert to create a hand- labeled training set for each application.

Related Work (Semi) Supervised Approaches Hand-labeled examples Regent Square $228,900 1028 Mifflin Ave, 6 Bedrooms 2 Bathrooms 412-638-7273 Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 CRF and HMM learn from the given examples, lexical, style, positioning and sequecing features Examples are source-dependent Scalability problem, Reusing pre-existing models?

Related Work UN Supervised Approaches Semi- structured Records Wikipedia Infobox DBpedia FreeBase Knowledge Bases Structured Records

Related Work UN Supervised Approaches Supervised X UNsupervised Hand-labeled examples Source Dependent Scalability Problem Reusability Pre-existing information Domain Representation Easily adaptable

[Agichtein et. al @ SIGKDD 2004] ◦ Usage of Reference Tables to create an unsupervised model using Hidden Markov Models (HMM) [Zhao et. al @ SIAM ICDM 2008] ◦ Usage of reference tables to create unsupervised CRF models - (U-CRF) [Cortez et. al @ JASIST 2009] ◦ Unsupervised method to extract bibliographic information Domain-specific heuristics, not general application. Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?) Related Work UN Supervised Approaches

Basic Concepts(I1) Knowledge Base ◦ Set of pairs KB = ◦ Building process trivial ◦ Web Databases (Freebase, Googlebase) KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )} O = { “Regent Square”, “Milenight Park”} O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”} O = { “323 462-6252”, “(171) 289-7527”} Neigh.Street Neigh. Street Phone KB: Domain RepresentationHand-labeled examples: Source representation

Proposed Method ONDUX [Cortez et. al. @ SIGMOD 2010] ◦ Blocking ◦ Matching ◦ Reinforcement

ONDUX (II) Overview 3 12

ONDUX (III) Blocking ◦ Split the input text in substrings called blocks; ◦ Consider the co-occurrence of consecutive terms based in the KB Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273

ONDUX (IV) Matching ◦ Associate each block generated in the previous phase with an attribute according to the Knowledge Base ◦ We use distinct matching functions:  Textual Values: FF Function (Field Frequency)  Numeric Values : NM Function (Numeric Matching)

ONDUX (V) Matching Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Street Price No. ??? Street Bed. Bath. Phone

ONDUX (VI) How can we deal with blocks that were incorrectly labeled or were not associated to any attribute? Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Street Price No. ??? Street Bed. Bath. Phone

ONDUX (VII) Reinforcement ◦ Review the labeling task performed in the Matching step  Unmatched blocks must receive a label of a given attribute  Mismatching blocks must be correctly labeled ◦ How to handle this cases?  Using positioning and sequencing information that are obtained On-Demand.

ONDUX (VIII) Reinforcement ◦ Given the extraction output of the matching step  ONDUX automatically build a graphical structure, the PSM.  PSM: Positioning and Sequencing Model.

ONDUX (IX) Reinforcement ◦ Extraction Result Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Price No. Bed. Bath. Phone Street ??? NeighborhoodStreet

Experiments (1) Setup ◦ We tested our proposed approach in:  Bibilographic Data (CORA, PersonalBib)  Collections are available in the Web Dataset#Attributes#recordsSource#Attributes#records CORA1..13150Cora1..13350 CORA1..13150PersonalBib7395 Test SetKB, Reference Table, …

Experiments (II) Evaluation ◦ Metrics  Precision, Recall and F-Measure  T-Test for the statistical validation of the results ◦ Baseline  Conditional Random Fields (CRF)  U-CRF (Unsupervised method)  S-CRF (Classical supervised method)

Experiments (III) Extraction Quality S-CRF achieves higher results than U-CRF due to the hand-labeled training CORA includes a variety of styles and information (jconference, books) In general, Matching and Reinforcement Step of ONDUX outperforms CRF models

Experiments (IV) Extraction Quality As discussed earlier, U-CRF is able to deal with different attribute orderings Due to the Matching and Reinforcement Strategies, ONDUX outperforms CRF models

Conclusions and Future Work (I) Partial results of our research on unsupervised strategies for information extraction ONDUX ◦ Flexible: Do not consider any particular style ◦ Unsupervised: Do not require any human effort to create a training set ◦ On-Demand: Ordering and Positioning Information are learned trough the Matching Phase

Proposed strategy achieve good results of precision and recall ◦ Comparison with the state-of-art As a Future Work ◦ Investigate different matching functions; ◦ Multi-Record Extraction; ◦ Active Learning and Feedback; ◦ Error Detection; ◦ Nested structures? Conclusions and Future Work (II)

Questions?

Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.

Similar presentations

Presentation on theme: "Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.

Similar presentations

Presentation on theme: "Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL."— Presentation transcript:

Similar presentations

About project

Feedback