Download presentation
Presentation is loading. Please wait.
1
Sunita Sarawagi
2
Enables richer forms of queries Facilitates source integration and queries spanning sources “Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.”
3
Roots in NLP Now many communities Machine learning Information retrieval Databases Web (web science) Document analysis Sarawagi’s categorization of methods Rule-based Statistical Hybrid models
4
News Tracking Customer Care (e.g., unstructured data from insurance claim forms) Data Cleaning (e.g., converting address strings into structured strings) Classified Ads Personal Information Management Scientific (e.g., bio-informatics) Citation Databases Opinion Databases (e.g., enhanced if organized along structured fields) Community Websites (e.g., conferences, projects, events) Comparison Shopping Ad Placement (e.g., product ads next to text mentioning the product) Structured Web Search Grand Challenge Allow structured search queries involving entities and their relationships over the WWW
5
Entities Relationships Adjective Descriptors Structures Aggregates Lists Tables Hierarchies
6
Granularity Record or Sentence Paragraphs Documents Heterogeneity Machine Generated Pages Partially Structured Domain Specific Open Ended
7
Structured Databases “In many applications unstructured data needs to be integrated with structured databases.” Labeled Unstructured Text Labeling for machine learning Labeling to establish ground truth Preprocessor Libraries (NLP tools) Sentence analyzer to identify sentence boundaries Part of speech tagger Parser to group tagged text into phrases Dependency analyzer (subject/object) Formatted text (table & list structures) Lexical Resources (e.g., WordNet)
8
Identify all instances in the unstructured text Populate a database For both, the core extraction work remains the same
9
Accuracy (foremost challenge) Diversity of Clues Required to be Successful Inherent complexity demands combining evidence Optimally combining is non-trivial Problem—far from solved Difficulty of Detecting Missed Extractions Recall: percent of actual entities extracted correctly – but without ground truth, can’t know the actual entities Precision: percent of extracted entities that are correct – easier to tune, can usually know correct/incorrect. Increased Complexity of Structures Extracted (e.g., parts of a blog that assert an opinion)
10
Running Time Lots of documents – just finding the set from which to extract is challenging Expensive processing steps to apply to many documents Other System Issues Dynamically changing sources Data integration (when extracting the same objects from different sites) Extraction errors Attaching confidence But computing the confidence is non-trivial
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.