BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,
BYU 12/7/20072 Grand challenge: new generation World Wide Web The current Web Enormous amount content Feasible for humans to read/write But … Content is simply too much to read The future Web Even more content but machine-processable Feasible for humans and machines to read/write Key issue Converting non-machine-processable content to machine- processable content, i.e., semantic annotation
BYU 12/7/20073 Semantic annotation, the general picture Data Extraction/Instance Recognition Engine AptRental Ontology
BYU 12/7/20074 Semantic annotation, the general picture AptRental Ontology
BYU 12/7/20075 Ontology Definition: Explicit, formal specifications of conceptualizations Unique identity of each concept Unique identity of each relationship among concepts Logic derivation rules underneath every declared relationship Annotation: is-a AptRental:ContactPhone $1250 is-a AptRental:MonthlyRate is-about AptRentalAd-instance-1 $1250 is-about AptRentalAd-instance-1 Ontology: AptRentalAd hasContactPhone AptRentalAd hasMonthlyRate Logic derivation: To rent the apartment that costs $1250 monthly please call (machine understanding)
BYU 12/7/20076 Automated semantic annotation, methods Layout-driven method (e.g. [Mukherjee et. al. 03]) Machine-learning-based method (e.g. [Handschuh et. al. 02]) Rule-based method (e.g. [Dill et. al. 03]) NLP-based method (e.g. [Popov et. al. 03]) Ontology-based method (e.g. [Ding et. al. 06])
BYU 12/7/20077 Ontology-based annotation
BYU 12/7/20078 Data extraction ontology Standard Ontology BedroomNr epistemological extension (instance recognizer) CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views, 1700 sq ft. $1250 mo. Call BedroomNr External representation Context Phrase Exception Phrase X
BYU 12/7/20079 Ontology-based annotation BedroomNr External representation Context Phrase BathNr External representation Context Phrase Feature External representation MonthRate External representation Context Phrase ContactPhone External representation CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views, 1700 sq ft. $1250 mo. Call Context Keyword
BYU 12/7/ Ontology-based annotation: strength and weakness Strengths Ignore layout difference Ignore layout change Less maintenance once built Weakness Expensive to build instance recognizers
BYU 12/7/ Layout-driven annotation
BYU 12/7/ Layout-driven annotation
BYU 12/7/ Layout-driven annotation, strength and weakness Strengths Accurate Simple and straightforward Less domain knowledge requirement Weakness Expensive in layout-pattern maintenance
BYU 12/7/ Problem How to overcome the weaknesses but retaining the strengths at the same time?
BYU 12/7/ Observation Extraction Domain ontology A Document Conceptual Annotator (ontology-based annotation) Annotated Document Layout Patterns Structural Annotator (layout-driven annotation) Domain ontology A Document Annotated Document accurate resilient
BYU 12/7/ Synergistic model Extraction Domain ontology A Document Conceptual Annotator (ontology-based annotation) Annotated Document Pattern Generation Layout Patterns Structural Annotator (layout-driven annotation) Annotated Document Instance Recognizer Enrichment
BYU 12/7/ Pattern Generation Get the annotated outputs from ontology-based annotator Apply HTML-structure analysis and produce a typical layout pattern for each extracted field If applicable, produce a sequential dependency between the generated layouts If applicable, produce simple heuristic rules such as “if A then B” between the generated layouts
BYU 12/7/ Instance recognizer enrichment Get the annotated outputs from layout-driven annotator Apply the results to the current corresponding instance recognizers If recognized, continue; Otherwise, if dictionary-type recognizers, insert. if regular-expression-type recognizers, try to generate a new regular expression and alert the user to check
BYU 12/7/ Preliminary results Apartment Rental domain Ontology-based annotation 90% accuracy in average on both precision and recall for nearly all fields Except Location and Contact Name Layout-driven annotation Nearly 100% accuracy on both precision and recall on Location and Contact Name Less recall on fields such as BedroomNr Pattern generation Great on well structured fields such as Location Less successful on semi-structured fields such as BedroomNr Instance recognizer enrichment Good results even with poorly constructed initial instance recognizers
BYU 12/7/ Summary Automatically produce layout patterns using outputs of ontology-based annotation Automatically enrich domain-specific instance recognizers using outputs of layout-driven annotation A new synergistic annotation model that retains original strengths and minimizes original weaknesses An annotation system that self-improves its performance during its execution
BYU 12/7/ Future work Dynamical tuning annotation based on user perspectives Ensemble of various annotators Collaborative annotation
BYU 12/7/ Thank you Yihong Ding (801) TMCB, Brigham Young University Provo, UT Data Extraction Research Lab at Brigham Young University Homepage, my virtual home on Web Thinking Space, my virtual home on Web 2.0