Ontology-based Annotation Sergey Sosnovsky
Outline O-based Annotation Conclusion Questions
Why Do We Need Annotation Annotation-based Services Integration of Disperse Information (knowledge-based linking) Better Indexing and Retrieval (based on the document semantics) Content-based Adaptation (modeling document content in terms of domain model) Knowledge Management Organization’s Repositories as mini Webs (Boeing, Rolls Royce, Fiat, GlaxoSmithKline, Merck, NPSA, …) Collaboration Support Knowledge sharing and communication What is Added by O-based Annotation Ontology-driven processing (effective formal reasoning) Connecting other O-based Services (O-mapping, O-visualization…) Unified vocabulary Connecting to the rest of SW knowledge
Definition O-based Annotation is a process of creating a mark-up of Web-documents using a pre-existing ontology and/or populating knowledge bases by marked up documents “Michael Jordan plays basketball” our: Athlete our: plays our: Sports Michael JordanBasketball our: plays rdf: type
List of Tools AeroDAML / AeroSWARM Annotea / Annozilla Armadillo AktiveDoc COHSE GOA KIM Semantic Annotation Platform MagPie Melita MnM OntoAnnotate Ontobroker OntoGloss ONTO-H Ont-O-Mat / S-CREAM / CREAM Ontoseek Pankow SHOE Knowledge Annotator Seeker Semantik SemTag SMORE Yawas … Information Extraction Tools: Alembic Amilcare / T-REX Annie Fastus Lasie Poteus SIFT …
Important Characteristics Automation of Annotation ( manual / semiautomatic / automatic / editable ) Ontology-related issues: pluggable ontology (yes/no); ontology language (RDFS / DAML+OIL / OWL / …); local / anywhere access; ontology elements available for annotation (concept / instances / relations / triples); where annotations are stored (in the annotated document / on the dedicated server / where specified) annotation format (XML / RDF / OWL / …). Annotated Documents: document kinds (text / multimedia) document formats (plain text / html / pdf / …) documents access (local / web) Architecture / Interface / Interoperability Standalone tool / web interface / web component / API / … Annotation Scale ( large – the WWW size / small - a hundred ) Existing Documentation / Tutorial Availability
SMORE Manual Annotation OWL-based Markup Simultaneous O modification (if necessary) ScreenScraper mines metadata from annotated pages and suggests as candidates for the mark-up Post-annotation O-based Inference “Michael Jordan plays basketball” our: Athlete our: plays our: Sports Michael JordanBasketball our: plays rdf: type
Problems of Manual Annotation Expensive / Time-consuming Difficult / Error prone Subjective ( two people annotating the same documents have in 15–30% annotate them differently ) Never ending new documents new versions of ontologies Annotation storage problem where? Trust owner’s annotation incompetence Spam (Google does not use info) Solution: Dedicated Automatic Annotation Services (“Search Engine”- like)
Automatic O-based Annotation Supervised MnM S-Cream Melita & AktiveDoc Unsupervised SemTag - Seeker Armadillo AeroSWARM
MnM Ontology-based Annotation Interface: Ontology browser (rich navigation capabilities) Document browser (usually Web-browser) The annotation is mainly based on select-drag-N-drop association of text fragments with ontology elements Built-in or External ML component classifies the main corpus of documents Activity Flow: Markup (A human user manually annotate training set of documents by ontology elements) Learn (A learning algorithm is run over the marked up corpus to learn the extraction rules) Extract (An IE mechanism is selected and run over a set of documents) Review (A human user observes the results and correct them if necessary)
Amilcare and T-REX Amilcare: Automatic IE component Is used in at least five O-based A tools (Melita, MnM, Ontoannotate, Ontomat, SemantiK) Released to about 50 Industrial and Academic sites Java API Recently succeeded by T-REX
Input: A web page. Step 1: Web page is scanned for phrases that might be categorized as instances of the ontology (partof-speech tagger to find candidate proper nouns) Result 1: set of candidate proper nouns Step 2: The system iterates through all candidate proper nouns and all candidate ontology concepts to derive hypothesis phrases using preset linguistic patterns. Result 2: Set of hypothesis phrases. Step 3: Google is queried for the hypothesis phrases through Result 3: the number of hits for each hypothesis phrase. Step 4: The system sums up the query results to a total for each instance-concept pair. Then the system categorizes the candidate proper nouns into their highest ranked concepts Result 4: an ontologically annotated web page. Pankow
SemTag - Seeker IBM-developed ~264 million web pages ~72 thousand of concepts (TAP taxonomy) 434 million automatically disambiguated semantic tags Spotting pass Documents are retrieved from the Seeker store, and tokenized Tokens are matched against the TAP concepts. Each resulting label is saved with ten words to either side as a ``window'' of context around the particular candidate object. Learning pass A representative sample of the data is scanned to determine the corpus- wide distribution of terms at each internal node of the taxonomy. TBD (taxonomy-based disambiguation) algorithm is used. Tagging pass “Windows” are scanned once more to disambiguate each reference determine an TAP object A record is entered into a database of final results containing the URL, the reference, and any other associated metadata.
Conclusions Web-document A is a necessary thing O-based A benefits (O-based post-processing, unified vocabularies, etc.) Manual A is a bad thing Automatic A is a good thing: Supervised O-based A: Useful O-based interface for annotating training set Traditional IE tools for textual classification Unsupervised O-based A: COHSE – matches concept names from the ontology and a thesaurus against tokens from the text Pankow – uses ontology to build candidate queries, then uses community wisdom to choose the best candidate SemTag – uses concept names to match tokens and hierarchical relations in the ontology to disambiguate between candidate concepts for a text fragment
? ? ? Questions