Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.

Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock

Abstract Extracting unstructured data is difficult Traditional methods do not apply Solution: Unsupervised extraction Results are competitive to supervised methods

Introduction Web data could be useful if extracted (i.e Craigslist)

Introduction Posts are not structured The Phoebus method works on this data But it requires much user input (supervised) The paper presents and optional unsupervised method This work extends on unsupervised semantic annotation

Introduction This approach does not use structural assumptions This approach relies on similarity no redundancy This approach creates relational data Current work on UIE relies on redundancy

Unsupervised Extraction Steps of the algorithm: Automatically choosing the Reference Set Matching Posts to the Reference Set Unsupervised Extraction

Automatically choosing the Reference Sets - They choose a reference set based on similarity - They calculate a similarity score and sort the sets - They use percent difference and average score - The algorithm scales linearly with size - They use multiple metrics as similarity score

Unsupervised Extraction Matching Posts to the Reference Set - A vector-space model is used to match posts - The Jaro-Winkler metric is used to match tokens - Attributes that do not agree are removed - Now we can query the posts (Yay !)

Unsupervised Extraction - A baseline is created between extracted field and reference set field - We remove tokens based on the baseline

Experimental Results Reference Sets Post Sets

Experimental Results Jensen-Shannon similarity

Experimental Results TF/IDF similarity

Experimental Results Jaccard similarity

Experimental Results Jaro-Winkler TF/IDF similarity

Experimental Results Results

Experimental Results Dice similarity

Experimental Results Jaccard similarity

Experimental Results TF/IDF similarity

Experimental Results Dice vs Phoebus

Experimental Results Jaro-Winkler vs Smith-Waterman

Experimental Results Comparison with other methods

Related Work SemTag is a similar system But it uses a crafted taxonomy In contrast, SemTag focuses on disambiguation CRAM is also similar but it requires labeling

Conclusion This paper introduces an unsupervised information extraction technique The Jensen-Shannon distance metric is better Using text acronyms would be beneficial Entity extraction could be a good idea

Questions

Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.

Similar presentations

Presentation on theme: "Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.

Similar presentations

Presentation on theme: "Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock."— Presentation transcript:

Similar presentations

About project

Feedback