Ontology-based Annotation Sergey Sosnovsky

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Hussain Sattuwala Stephen Dill, Nadav Eiron, David Gibson,
NERC DataGrid Vocabulary Workshop, RAL, February 25, 2009 NERC DataGrid Vocabulary Server Description.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Information and Business Work
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
OntoBlog: Linking Ontology and Blogs Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of Informatics, Japan 2 Asian.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
Annotation for the Semantic Web Yihong Ding A PhD Research Area Background Study.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
1 DCS861A-2007 Emerging IT II Rinaldo Di Giorgio Andres Nieto Chris Nwosisi Richard Washington March 17, 2007.
SemanTic Interoperability To access Cultural Heritage Frank van Harmelen Henk Matthezing Peter Wittenburg Marjolein van Gendt Antoine Isaac Lourens van.
Overview of Search Engines
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
What Can Do for You! Fabian Christ
06/03/'07 upd 04/03/08CmpE 588 Spring 2008 EMU1 Tools for Semantic Annotation Atilla ELÇİ Dept. of Computer Engineering Eastern Mediterranean University.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Break Out Session on Infrastructure and Technology: A Report Vipul Kashyap AOS Workshop, Rome, 15 November 2001
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
Information Systems & Semantic Web University of Koblenz ▪ Landau, Germany Semantic Web - Multimedia Annotation – Steffen Staab
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Populating Ontologies for the Semantic Web Alexiei Dingli.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Ontology-Based Information Extraction: Current Approaches.
1 Dr Alexiei Dingli Introduction to Web Science Harvesting the SW.
SemSearch: A Search Engine for the Semantic Web Yuangui Lei, Victoria Uren, Enrico Motta Knowledge Media Institute The Open University EKAW 2006 Presented.
© Copyright 2008 STI INNSBRUCK Semantic Annotation Semantic Web Lecture Dieter Fensel.
Semantic Web - an introduction By Daniel Wu (danielwujr)
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
World Wide Web Library 150 Week 8. The Web The World Wide Web is one part of the Internet. No one controls the web Diverse kinds of services accessed.
CREAM: Semantic annotation system May 24, 2013 Hee-gook Jun.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Automated Access to Statistical Facts via Statline4 Web Services Olav ten Bosch Statistics Netherlands UN-ECE conference, Bratislava April.
Selected Semantic Web UMBC CoBrA – Context Broker Architecture  Using OWL to define ontologies for context modeling and reasoning  Taking.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Folksonomy-based Course Authoring for Flexible Student Modeling Sergey Sosnovsky, Michael Yudelson
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
June 30, 2005 Public Web Site Search Project Update: 6/30/2005 Linda Busdiecker & Andy Nguyen Department of Information Technology.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
Information Retrieval and Web Search
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Presented by: Hassan Sayyadi
Information Retrieval and Web Search
Semantic Web Annotation
Part of the Multilingual Web-LT Program
Cataloging the Internet
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Ontology-based Annotation Sergey Sosnovsky

Outline O-based Annotation Conclusion Questions

Why Do We Need Annotation Annotation-based Services Integration of Disperse Information (knowledge-based linking) Better Indexing and Retrieval (based on the document semantics) Content-based Adaptation (modeling document content in terms of domain model) Knowledge Management Organization’s Repositories as mini Webs (Boeing, Rolls Royce, Fiat, GlaxoSmithKline, Merck, NPSA, …) Collaboration Support Knowledge sharing and communication What is Added by O-based Annotation Ontology-driven processing (effective formal reasoning) Connecting other O-based Services (O-mapping, O-visualization…) Unified vocabulary Connecting to the rest of SW knowledge

Definition O-based Annotation is a process of creating a mark-up of Web-documents using a pre-existing ontology and/or populating knowledge bases by marked up documents “Michael Jordan plays basketball” our: Athlete our: plays our: Sports Michael JordanBasketball our: plays rdf: type

List of Tools AeroDAML / AeroSWARM Annotea / Annozilla Armadillo AktiveDoc COHSE GOA KIM Semantic Annotation Platform MagPie Melita MnM OntoAnnotate Ontobroker OntoGloss ONTO-H Ont-O-Mat / S-CREAM / CREAM Ontoseek Pankow SHOE Knowledge Annotator Seeker Semantik SemTag SMORE Yawas … Information Extraction Tools: Alembic Amilcare / T-REX Annie Fastus Lasie Poteus SIFT …

Important Characteristics Automation of Annotation ( manual / semiautomatic / automatic / editable ) Ontology-related issues: pluggable ontology (yes/no); ontology language (RDFS / DAML+OIL / OWL / …); local / anywhere access; ontology elements available for annotation (concept / instances / relations / triples); where annotations are stored (in the annotated document / on the dedicated server / where specified) annotation format (XML / RDF / OWL / …). Annotated Documents: document kinds (text / multimedia) document formats (plain text / html / pdf / …) documents access (local / web) Architecture / Interface / Interoperability Standalone tool / web interface / web component / API / … Annotation Scale ( large – the WWW size / small - a hundred ) Existing Documentation / Tutorial Availability

SMORE Manual Annotation OWL-based Markup Simultaneous O modification (if necessary) ScreenScraper mines metadata from annotated pages and suggests as candidates for the mark-up Post-annotation O-based Inference “Michael Jordan plays basketball” our: Athlete our: plays our: Sports Michael JordanBasketball our: plays rdf: type

Problems of Manual Annotation Expensive / Time-consuming Difficult / Error prone Subjective ( two people annotating the same documents have in 15–30% annotate them differently ) Never ending new documents new versions of ontologies Annotation storage problem where? Trust owner’s annotation incompetence Spam (Google does not use info) Solution: Dedicated Automatic Annotation Services (“Search Engine”- like)

Automatic O-based Annotation Supervised MnM S-Cream Melita & AktiveDoc Unsupervised SemTag - Seeker Armadillo AeroSWARM

MnM Ontology-based Annotation Interface: Ontology browser (rich navigation capabilities) Document browser (usually Web-browser) The annotation is mainly based on select-drag-N-drop association of text fragments with ontology elements Built-in or External ML component classifies the main corpus of documents Activity Flow: Markup (A human user manually annotate training set of documents by ontology elements) Learn (A learning algorithm is run over the marked up corpus to learn the extraction rules) Extract (An IE mechanism is selected and run over a set of documents) Review (A human user observes the results and correct them if necessary)

Amilcare and T-REX Amilcare: Automatic IE component Is used in at least five O-based A tools (Melita, MnM, Ontoannotate, Ontomat, SemantiK) Released to about 50 Industrial and Academic sites Java API Recently succeeded by T-REX

Input: A web page. Step 1: Web page is scanned for phrases that might be categorized as instances of the ontology (partof-speech tagger to find candidate proper nouns) Result 1: set of candidate proper nouns Step 2: The system iterates through all candidate proper nouns and all candidate ontology concepts to derive hypothesis phrases using preset linguistic patterns. Result 2: Set of hypothesis phrases. Step 3: Google is queried for the hypothesis phrases through Result 3: the number of hits for each hypothesis phrase. Step 4: The system sums up the query results to a total for each instance-concept pair. Then the system categorizes the candidate proper nouns into their highest ranked concepts Result 4: an ontologically annotated web page. Pankow

SemTag - Seeker IBM-developed ~264 million web pages ~72 thousand of concepts (TAP taxonomy) 434 million automatically disambiguated semantic tags Spotting pass Documents are retrieved from the Seeker store, and tokenized Tokens are matched against the TAP concepts. Each resulting label is saved with ten words to either side as a ``window'' of context around the particular candidate object. Learning pass A representative sample of the data is scanned to determine the corpus- wide distribution of terms at each internal node of the taxonomy. TBD (taxonomy-based disambiguation) algorithm is used. Tagging pass “Windows” are scanned once more to disambiguate each reference determine an TAP object A record is entered into a database of final results containing the URL, the reference, and any other associated metadata.

Conclusions Web-document A is a necessary thing O-based A benefits (O-based post-processing, unified vocabularies, etc.) Manual A is a bad thing Automatic A is a good thing: Supervised O-based A: Useful O-based interface for annotating training set Traditional IE tools for textual classification Unsupervised O-based A: COHSE – matches concept names from the ontology and a thesaurus against tokens from the text Pankow – uses ontology to build candidate queries, then uses community wisdom to choose the best candidate SemTag – uses concept names to match tokens and hierarchical relations in the ontology to disambiguate between candidate concepts for a text fragment

? ? ? Questions