Download presentation
Presentation is loading. Please wait.
Published byBernice McGee Modified over 9 years ago
1
Populating Ontologies for the Semantic Web Alexiei Dingli
2
What’s the problem?
3
Towards a solution … (1) Ask intelligent agents to do the job for us!! But they don’t understand the WWW !!!
4
Towards a solution … (2) But there’s another way in which this can be achieved, by supplying the missing semantic information For the Web to reach its full potential, it must evolve into a Semantic Web, providing a universally accessible platform that allows data to be shared and processed by automated tools as well as by people. (W3C Web Guru) Creating the Semantic Web !!
5
Towards a solution … (3) Why do many believe this solution will fail? It requires lots of time and effort It needs lots of people willing to do it Not everyone can do it
6
Our approaches Active learning to reduce annotation burden Supervised learning Adaptive IE The Melita methodology Automatic annotation of large repositories Largely unsupervised Armadillo
7
Adaptive IE What is AIE? Performs tasks of traditional IE Exploits the power of Machine Learning in order to adapt to complex domains having large amounts of domain dependent data different sub-languages features different text genres Considers important the Usability and Accessibility of the system
8
Amilcare Tool for adaptive IE from Web-related texts Specifically designed for document annotation Based on (LP) 2 algorithm Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types free texts semi-structured texts structured texts Uses Gate and Annie for preprocessing
9
CMU: detailed results 1.Best overall accuracy 2.Best result on speaker field 3.No results below 75%
10
Gate General Architecture for Text Engineering provides a software infrastructure for researchers and developers working in NLP Contains Tokeniser Gazetteers Sentence Splitter POS Tagger Semantic Tagger (ANNIE) Orthographic Coreference http://www.gate.ac.uk Pronominal Coreference Multi lingual support Protégé WEKA many more exist and can be added
11
Annotation Current practice of annotation for knowledge identification and extraction is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge Management
12
Different Annotation Systems SGML T E X Xanadu CoNote ComMentor JotBot Third Voice Annotate.net The Annotation Engine Visual Text Alembic Annotea CritLink The Gate Annotation Tool iMarkup MnM S-CREAM Yawas
13
Melita Tool for assisted automatic annotation Uses an Adaptive IE engine to learn how to annotate (no use of rule writing for adapting the system) Users: annotates document samples IE System: Trains while users annotate Generalizes over seen cases Provides preliminary annotation for new documents Performs smart ordering of documents Advantages Annotates trivial or previously seen cases Focuses slow/expensive user activity on unseen cases User mainly validates extracted information Simpler & less error prone / Speeds up corpus annotation The system learns how to improve its capabilities
14
Methodology: Melita Bootstrap Phase Bare Text Amilcare Learns in background User Annotates
15
Methodology: Melita Checking Phase Bare Text Learning in background from missing tags, mistakes User Annotates Amilcare Annotates
16
Methodology: Melita Support Phase Bare Text Corrections used to retrain Amilcare Annotates User Corrects
17
Intrusivity An evolving system is difficult to control Goal: Avoiding unwelcome/unreliable suggestions Adapting proactivity to user’s needs Method: Allow users to tune proactivity Monitor user reactions to suggestions
18
Smart ordering of Documents Bare Text Tries to annotate all the documents and selects the document with partial annotations Learns annotations User Annotates
19
Methodology: Melita Ontology defining concepts Control Panel Docume nt Panel
20
Results TagAmount of Texts needed for training PrecRec stime208463 etime209672 location308261 speaker1007570 30 60
21
Future Work Research better ways of annotating concepts in documents Optimise document ordering to maximise the discovery of new tags Allow users to edit the rules Learn to discover relationships !! Not only suggest but also corrects user annotations !!
22
Annotation for the Semantic Web Semantic Web requires document annotation Current approaches Manual (e.g. Ontomat) or semi-automatic (MnM, S-Cream, Melita) BUT: Manual/Semi-automatic annotation of Large diverse repositories Containing different and sparse information is unfeasible E.g. a Web site (So: 1,600 pages)
23
Redundancy Information on the Web (or large repositories) is Redundant Information repeated in different superficial formats Databases/ontologies Structured pages (e.g. produced by databases) Largely structured pages (bibliography pages) Unstructured pages (free texts)
24
Our Proposal Largely unsupervised annotation of documents Based on Adaptive Information Extraction Bootstrapped using redundancy of information Method Use the structured information (easier to extract) to bootstrap learning on less structured sources ( more difficult to extract )
25
Example: Extracting Bibliographies Mines web-sites to extract biblios from personal pages Tasks: Finding people’s names Finding home pages Finding personal biblio pages Extract biblio references Sources NE Recognition (Gate’s Annie) Citeseer/Unitrier (largely incomplete biblios) Google Homepagesearch
26
AKT Reference Ontology Developed by the AKT partners Represent the knowledge used in the CS AKTive Portal testbed Consists of several sub-ontologies Available in several flavours … DAML+OIL OWL Has 9,000,000 RDF triples !! Available at Ontology http://www.aktors.org/publications/ontology/ RDF Triples http://triplestore.aktors.org/
27
Mining Web sites (1) Mines the site looking for People’s names Uses Generic patterns (NER) Citeseer for likely bigrams Looks for structured lists of names Annotates known names Trains on annotations to discover the HTML structure of the page Recovers all names and hyperlinks
28
Experimental Results (1) People discovering who works in the department using Information Integration Total present in site 129 Using generic patterns + online repositories 48 correct, 3 wrong Precision48 / 51 = 94 % Recall48 / 129 = 37 % F-measure 51 % Errors A. Schriffin Eugenio Moggi Peter Gray
29
Experimental Results (2) People using Information Extraction Total present in site 129 96 correct, 9 wrong Precision96 / 105 = 91 % Recall96 / 129 = 74 % F-measure 87 % Errors Speech and Hearing European Network Department Of Position Paper The Network To System
30
Mining Web sites (2) Annotates known papers Trains on annotations to discover the HTML structure Recovers co-authoring information
31
Experimental Results (1) Papers discovering publications in the department using Information Integration Total present in site 320 Using generic patterns + online repositories 151 correct, 1 wrong Precision151 / 152 = 99 % Recall151 / 320 = 47 % F-measure 64 % Errors - Garbage in database!! @misc{ computer-mining, author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/582939.html" }
32
Experimental Results (2) Papers using Information Extraction Total present in site 320 214 correct, 3 wrong Precision214 / 217 = 99 % Recall214 / 320 = 67 % F-measure 80 % Errors Wrong boundaries in detection of paper names! Names of workshops mistaken as paper names!
33
User Role Providing … A URL List of services Already wrapped (e.g. Google is in default library) Train wrappers using examples Examples of fillers (e.g. project names) In case … Correcting intermediate results Reactivating Armadillo when paused
34
Armadillo Library of known services (e.g. Google, Citeseer) Tools for training learners for other structured sources Tools for bootstrapping learning From un/structured sources No user annotation Multi-strategy acquisition of information using redundancy User-driven revision of results With re-learning after user correction
35
Rationale Armadillo learns how to extract information From large repositories By integrating information from diverse and distributed resources Use: Ontology population Information highlighting Document enrichment Enhancing user experience
36
Data Navigation (1)
37
Data Navigation (2)
38
Data Navigation (3)
39
What’s so new about Armadillo? In other systems … User defined examples are used Generic patters are used that work independently of the site In our system … We also make use of generic patterns & some user defined examples We learn page specific patterns And we integrate information from different sources
40
IE for SW: The Vision Automatic annotation services For a specific ontology Constantly re-indexing/re-annotating documents Semantic search engine Effects: No annotation in the document As today’s indexes are not stored in the documents No legacy with the past Annotation with the latest version of the ontology Multiple annotations for a single document Simplifies maintenance Page changed but not re-annotated
41
Links Melita http://nlp.shef.ac.uk/melita/ Armadillo http://nlp.shef.ac.uk/armadillo/ Amilcare http://nlp.shef.ac.uk/amilcare/ Gate http://www.gate.ac.uk AKT Reference Ontology http://www.aktors.org/publications/ontology/ AKT 3Store http://triplestore.aktors.org/ More than 40 semantic web technologies http://www.aktors.org/technologies/ Most of them can be freely downloaded Range from IE tools, semantic portals, annotation tools, semantic web services, dialogue systems, etc
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.