After OWL: defacto standards for semantic technologies (or: what do you get for €40m EU research money?)

Slides:



Advertisements
Similar presentations
OMV Ontology Metadata Vocabulary April 10, 2008 Peter Haase.
Advertisements

Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.
Large Scale Knowledge Management across Media Prof. Fabio Ciravegna, Department of Computer Science University of Sheffield
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
An Introduction to GATE
GATE, Human Language and Machine Learning Hamish Cunningham, Valentin.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
Where the Web Went Wrong Hamish Cunningham Dept. Computer Science, University.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
GATE, SWAN and Semantic TV Hamish Cunningham Department of Computer Science, University of Sheffield.
Mining the web to improve semantic-based multimedia search and digital libraries
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
AceMedia Personal content management in a mobile environment Jonathan Teh Motorola Labs.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse 2.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
Knowledge Management in Geodise Geodise Knowledge Management Team Liming Chen, Barry Tao, Colin Puleston, Paul Smart University of Southampton University.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment)
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
GATE technical workshop: introduction Hamish Cunningham Sheffield, March.
Software Architecture for Language Engineering (SALE) – where next? Hamish.
GATE, a General Architecture for Text Engineering Hamish Cunningham Department.
Rutherford Appleton Laboratory SKOS Ecoterm 2006 Alistair Miles CCLRC Rutherford Appleton Laboratory Semantic Web Best Practices and Deployment.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Semantics, Syndication and Social Networks: Mechanisms for Future Structured Information Spaces Hamish Cunningham (University of Sheffield) Werner Haas.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Ontology Summit2007 Survey Response Analysis -- Issues Ken Baclawski Northeastern University.
GATE: an AKT success story [GATE: open source language technology component architecture and many tools, with a number of AKT roles]
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
Future Learning Landscapes Yvan Peter – Université Lille 1 Serge Garlatti – Telecom Bretagne.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
1 Language Technologies (1) Diana Maynard University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
OWL Representing Information Using the Web Ontology Language.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
SDMX IT Tools Introduction
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
NeOn Components for Ontology Sharing and Reuse Mathieu d’Aquin (and the NeOn Consortium) KMi, the Open Univeristy, UK
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
ESWC 2005, Crete, Greece Semantically Enhanced Television News through Web and Video Integration Multimedia and the Semantic Web workshop Borislav PopovMike.
GATE and the Semantic Web
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

After OWL: defacto standards for semantic technologies (or: what do you get for €40m EU research money?) Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Wim Peters, Niraj Aswani, Milena Yankova, Yaoyong Li, Akshay Java, Michael Dowman ILASH workshop, March 2004

2(24) Structure of the talk Context: increasing use of “semantic” technology in IT the role(s) of human language technology substantial investment in the next phase of semantic web research Semantic Web: moving on from formal standards Acronym soup: GATE: HLT API 4 SDK SW & KT An application: Ontology-Based IE in KIM Issues in API design, next steps

3(24) The Knowledge Economy and Human Language Gartner, December 2002: taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications through 2012 more than 95% of human-to-computer information input will involve textual language A contradiction: to deal with the information deluge we need formal knowledge in semantics-based systems our information spaces are in informal and ambiguous natural language The challenge: to reconcile these two phenomena

4(24) Human Language Formal Knowledge (ontologies and instance bases) (A)IE CLIE (M)NLG Controlled Language OIE Semantic Web; Semantic Grid; Semantic Web Services KEY MNLG: Multilingual Natural Language Generation OIE: Ontology-aware Information Extraction AIE: Adaptive IE CLIE: Controlled Language IE HLT: Closing the Loop

5(24) SEKT: Semantic Knowledge Technology 6th framework IP project Duration: 36 months from 1/1/4, €12.5m Improve automation of ontology and metadata generation Develop highly-scalable solutions Research sound inferencing despite inconsistent models Develop semantic knowledge access tools Develop methodology for deployment

6(24) PrestoSpace (20 th Century Rot) 20 th Century audio-visual media is rapidly disappearing Preservation and restoration are high cost The costs must be justified by increased access “Metadata”: descriptive information about content PrestoSpace (€9m IP, 40 months from 02/04): –rich metadata and semantic access –cross-lingual access –syndicated delivery –repurposeable content

7(24) The “SDK” research cluster “Building the European Research Area” in KM through collaboration with related IP and NoE projects in this area for a coordinated impact strategy SEKT, DIP, KnowledgeWeb – SDK cluster: Other related projects: AceMedia IP (semantic knowledge systems) PrestoSpace IP (cultural heritage / digital libraries) BRICKS IP (cultural heritage / digital libraries) Total EU/6FP investment in semantic tech. research €40m: potential to influence the emergence of defacto standards

8(24) Next step for Semantics tech: from formal to defacto standards? Computer scientists love standards, so we have many For any given problem there are usually 3 “standards” OWL is no exception: Lite, DL, Full There are good reasons, but cf. RDF(S) implementation history: applications will of necessity mix and match If we can achieve standard practice and libraries in applications we will have made a next step and will promote takeup (Pathological) example: TCP/IP vs. OSI

9(24) HLT API 4 SDK SW & KT What sorts of software do we need? Ontology and metadata management: storage; versionning; caching, inferencing; etc. (below) Human language technology components and services (not monolithic systems, not unproven research prototypes) The role of measurement in scaling and robustness: in HLT this means MUC, TREC, ACE, TIDES,... Here’s one we baked earlier....

10(24) GATE (the Volkswagen Beetle of Language Processing) is: Eight years old, with the largest user constituency of its type An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, a graphical development environment. Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at

11(24) Critical mass: 000s people 00s sites GATE team projects. Past: Conceptual indexing: MUMIS: automatic semantic indices for sports video MUSE, cross-genre entitiy finder HSL, Health-and-safety IE Old Bailey: collaboration with HRI on 17th century court reports Multiflora: plant taxonomy text analysis for biodiversity research e- science EMILLE: S. Asian language corpus ACE / TIDES: Arabic, Chinese NE JHU summer w/s on semtagging Present: Advanced Knowledge Technologies: €12m UK five site collaborative project ETCSL: Sumerian digital library MiAKT: medical informatics / AKT SEKT: Semantic Knowledge Tech PrestoSpace: AV Preservation KnowledgeWeb; h-TechSight GATE users = significant proportion of community. A small sample: the American National Corpus project the Perseus Digital Library project, Tufts University, US Longman Pearson publishing, UK Merck KgAa, Germany Canon Europe, UK Knight Ridder, US BBN (leading HLT research lab), US SMEs: Melandra, SG-MediaStyle,... Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, CubReporter, Poesia...

12(24) Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Weka, interoperation with SCHUG in MUMIS) (Almost) everything is a component, and component sets are user-extendable (Almost) all operations are available both from API and GUI Why does this matter? It means that GATE works well with other tools, embeds easily, and achieves robustness through focus (API requirements)

13(24) All the world’s a Java Bean.... CREOLE: a Collection of REusable Objects for Language Engineering: GATE components: modified Java Beans with XML configuration The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Why bother? Allows the system to load arbitrary language processing components

14(24) NOTES everything is a replaceable bean all communication via fixed APIs low coupling, high modularity, high extensibility … HTML docs RTF docs XML docs PDF docs XML Document Format HTML Document Format PDF Document Format … Document Format Layer (LRs) XML Oracle Postgre Sql.ser DataStore Layer Corpus Document Document Content Annotation Set Annotation Feature Map Corpus Layer (LRs) NOTES (2) eg: Protégé LR & VR both wrapped in Res. (bean) API ontology repositories and inference are the same: KAON + Sesame + Orenge + ? GATE APIs Processing Layer (PRs) NE Co-ref TEs TRs POS … Onto- logy Protégé Onto- logy Word- net Gaz- etteers Language Resource Layer (LRs)... Application Layer ANNIE OBIE … IDE GUI Layer (VRs) ADiff OntolVR DocVR... Web Services

15(24) Issues (1): a common HLT API OGSA, WMSO in the web services layer? Eclipse: less code for us, more services for users? (A free OWL/UML drawing tool, for example) ISO TC37/SC4: JNLE special; LIRICS consortium

16(24) API Application: Ontology-based IE XYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Company type HQ establOn CityCountry Location partOf type “03/11/1978” XYZ London UK Bulgaria HQ partOf

17(24) Entity Person … Job-title president chancellor minister … G.Brown “ Gordon Brown met George Bush during his two day visit. Classes, instances & metadata Classes+instances before Bush 1.html 0 12 Gordon Brown …#Person …#Person George Bush …#Person …#Person67890 Classes+ instances after

18(24) OBIE in KIM Popov et al. KIM. ISWC’03 An ontology (KIMO) and 200K instances KB High ambiguity of instances with the same label – uses disambiguation step Lookup phase marks mentions from the ontology Combined with GATE-based IE system to recognise new instances of concepts and relations KB enrichment stage where some of these new instances are added to the KB Disambiguation uses an Entity Ranking algorithm, i.e., priority ordering of entities with the same label based on corpus statistics (e.g., Paris)

19(24) OBIE in KIM (2) Popov et al. KIM. ISWC’03

20(24) KIM demo... Continue to exploit the pluggability and community effects of GATE (and Sesame, Lucene,...) SWAN: Semantic Web Annotator at DERI/Galway Syndication Social networking Evaluation (below) Next steps in OBIE

21(24) (The “P” in OLP) Challenge: Evaluating Richer NE Tagging Need for new metrics when evaluating hierarchy/ontology- based NE tagging Need to take into account distance in the hierarchy Tagging a company as a charity is less wrong than tagging it as a person

22(24) SW IE Evaluation tasks Detection of entities and events, given a target ontology of the domain. Disambiguation of the entities and events from the documents with respect to instances in the given ontology. For example, measuring whether the IE correctly disambiguated “Cambridge” in the text to the correct instance: Cambridge, UK vs Cambridge, MA. Decision when a new instance needs to be added to the ontology, because the text contains a new instance, that does not already exist in the ontology.

23(24) Issues (2): a common OMM API Two design approaches: A.the “richest set of features” approach pool experience, cover all the bases, be relevant to very many users (“top-down”) B.the “highest common factors” approach analyse software, pick common features, create plugability layer (“bottom-up”) Both useful; can be combined Approach B. has some key advantages: –leads to quicker version 1.0 –minimises arguments (criteria: feature exists in several sys, not is “good”) Problems: –features present several places but not all – “operation not supported”? –new work not prefigured in version 1.0 – roadmaps, placeholders

24(24) The end Tutorial on HLT for the Semantic Web at European Semantic Web Symposium: These slides: More information:

25(24) What’s the difference between Tony Blair and Mother Theresa? There’s good news and bad news... The good news: the Semantic Web is now a major focus of some of the world leaders in AI research The bad news: AI always fails (Or: what succeeds doesn’t get called AI any more) How does the machine tell the difference between “Mother Theresa is a saint” and “Tony Blair is a saint”? (It doesn’t: it has no sense of irony!) Needed: clever applications of simple semantics (contrast the success of RSS or DC with more complex schemes) Defacto standards when we do the simple stuff robustly and in the large