Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Opinion Mapping Travelblogs Efthymios Drymonas Alexandros Efentakis Dieter Pfoser Research Center Athena Institute for the Management of Information Systems.
Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Text Analytics on UIMA and UIMA Semantic Search Engine ISM209 David Lewis Student Project Presentation
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
UIMA Introduction SHARPn Summit June 11, 2012
Overview of Search Engines
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
DEiXTo.
Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Chapter 7 Requirement Modeling : Flow, Behaviour, Patterns And WebApps.
Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Information Extraction From Medical Records by Alexander Barsky.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Practical Project of the 2006 Joint International Master’s Degree.
UIMA SHARP 4 - NLP May 25, Outline UIMA Terminology (not just TLAs) Parts of a UIMA pipeline Running a pipeline Viewing annotations Creating a new.
Master Thesis Defense Jan Fiedler 04/17/98
1 Peter Fox Xinformatics 4400/6400 Week 11, April 16, 2013 Information Audit and dealing with Unstructured Information.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Open Health Natural Language Processing Consortium (OHNLP)
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Hyper/J and Concern Manipulation Environment. The need for AOSD tools and development environment AOSD requires a variety of tools Life cycle – support.
10/18/2015 NORTEL NETWORKS CONFIDENTIAL – FOR TRAINING PURPOSES ONLY Global Documentation Evolution System Overview and End-to-End Process Training.
CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
Selected Topics in Software Engineering - Distributed Software Development.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Tool for Ontology Paraphrasing, Querying and Visualization on the Semantic Web Project By Senthil Kumar K III MCA (SS)‏
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
MedKAT Medical Knowledge Analysis Tool December 2009.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Service Creation, Advertisement and Discovery Including caCORE SDK and ISO21090 William Stephens Operations Manager caGrid Knowledge Center February.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de A.
 Programming - the process of creating computer programs.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
® IBM Software Group © 2007 IBM Corporation Module 1: Getting Started with Rational Software Architect Essentials of Modeling with IBM Rational Software.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
TRIUMF HLA Development High Level Applications Perform tasks of accelerator and beam control at control- room level, directly interfacing with operators.
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
Visual Basic.NET Comprehensive Concepts and Techniques Chapter 1 An Introduction to Visual Basic.NET and Program Design.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Information Retrieval in Practice
Search Engine Architecture
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
Presentation transcript:

Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document Processing

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 2 Overview Introduction GATE UIMA Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 3 Introduction "IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies." November 2005; Version of UIMA is available

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 4 Introduction really?

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 5 Introduction similarity/comparison of GATE and UIMA –frameworks –results are documents + annotations –pipeline processing steps: –task definition –one corpus

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 6 Evaluation Topics/Points ease of getting acquainted with system?: –quality of docus: completeness, clarity, up-to-date, …? –tutorials, use cases, …? processing and linguistic resources? –lexica, Gazetteer lists, tools tools for resource maintenance and extension? –quality: selfexplanatory, robust, comfortable speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? support for im-/export of a variety of document formats?

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 7 Task of the Experiment process a corpus of websites –to detect and extract information relevant for tourists opening times of museum, prices of hotels,… corpus: –30 tourism web sites of Egypt –additional 20 web sites of Washington, New York, London output: –Prolog facts for a reasoner –Questions: Which museum is now open? …

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 8 Excerpts from the Corpus The Egyptian Museum is open the hours: 9am-5pm daily The Military Museum is open the hours: Summer: 8am- 5:30pm; winter: 8am-4:30pm Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter) 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri …

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 9 Overview Introduction GATE UIMA Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 10 GATE: General Architecture for Text Engineering a suite of tools for language processing and information extraction rule-based modular IE system (ANNIE) language and domain-independent processing resources open and extensible architecture aims to provide uniform access to various linguistic and ontological resources

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 11 a software infrastructure for NLP researchers; based on three main elements: –an architecture describing the components composing a language processing system –a framework could be used as a basis for building such systems –a graphical development environment a set of tools and components for language engineers GATE: General Architecture for Text Engineering

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 12 GATE distributed with IE system called ANNIE –relies on finite state algorithms and the Java Annotation Pattern Engine (JAPE) language –comprising a set of core Processing Resources (PRs): Tokeniser Gazetteers POS tagger Sentence Splitter Semantic Tagger (JAPE transducer) Orthomatcher (orthographic coreference) … GATE: General Architecture for Text Engineering

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 13 GATE: ANNIE [Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)]

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 14 Gate Application several Processing Resources: Tokenizer, Hash Gazetteer (with new/extended Gazetteer lists), JAPE Transducer Gazetteer lists JAPE Transducer... * The Military Museum* Summer: 8am-5:30pm; Winter: 9pm-5pm … names of museums, fragments of times and restrictions JAPE rules: to annotate interval of times and restrictions museum ANNIE English Tokenizer

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 15 Museum information in JAPE Rule: egyptmuseums ( ({SpaceToken}) ({Token.kind == word}) ({SpaceToken}) {Lookup.majorType ==org_base} // from gazetteer lists ({SpaceToken})? (({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))* ({timeinfo}) // annotation by jape transducer ) :museum --> :museum.sight = {rule ="egyptmuseums"} timeinfo defined by JAPE rules detects patterns like: 9am-5pm, 6pm-9pm 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm 5:00PM-7:00PM, 10:00am-5:00pm ….

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 16 GATE: Presentation of Results Type and location of every extracted annotation on document Annotations Museums Information

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 17 GATE: Results information annotated in the documents: –names of museums, hotels –names of tourist places in Egypt –times, time intervals –time restrictions –prices, intervals of prices (hotel prices and museum prices) –names of pharaohs, queens

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 18 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -good -illustrative examples (tutorial) but not enough specialy about JAPE rules -can deal with it without know of Java programming -but is advantage to have experinces with Java programming to use it in JAPE rules

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 19 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -many processing resources available (ANNIE) -tokenisers -POS taggers -parsers -gazetteers -sentence splitter -… -additional PRs : -gazetteer collector -PRs for Machine Learning -various exporters -annotation set transfer etc...

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 20 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -editor for gazetteer list -corpus manager -text editor and debugger for JAPE rules

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 21 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -there is no measurement of processing time in the GATE tool

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 22 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -corpus pipeline vs document pipeline

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 23 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -no limitations: -all is possible but it is not necessary to implement by yourself -for beginning: -processing and linguistic resources available within the distribution

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 24 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -import: -supports a variety of document formats: HTML, rtf, , SGML and plain text -In all cases the format is analysed and converted into a single unified model of annotation -export: -documents, corpora and annotations in databases of various sorts -required: Java application (CREOLE)

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 25 Overview Introduction GATE UIMA Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 26 UIMA: Unstructured Information Management Architecture a software architecture for developing and deploying unstructured information management (UIM) applications UIM application: a software system –analyse large volumes of unstructured information to discover, organize, and deliver relevant knowledge to the end user software architecture which specifies –component interfaces, data representations, …

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 27 UIMA: Unstructured Information Management Architecture … interfaces to a collection of data items (e.g., documents) to be analyzed. Collection Readers return CASes that contain the documents to analyze, possibly along with additional metadata. … takes a CAS, analyzes its contents, and produces an enriched CAS. Analysis Engines can be recursively composed of other Analysis Engines (called an Aggregate Analysis Engine). Aggregates may also contain CAS Consumers. … may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from tags in the original HTML) into the CAS. CAS: Common Analysis Structure CPM: Collecting Processing Manager … consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 28 Analysis Engine (AE): –a component that analyzes artifacts (e.g. documents) and infers information about them –consists of two parts: Java classes (typically packaged as one or more JAR files) and AE descriptors (one or more XML files) –the configuration settings for the Analysis Engine as well as –a description of the AE’s input and output requirements. UIMA: Unstructured Information Management Architecture [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 29 UIMA Application several annotators (like a pipeline) museum pattern time pattern interval of times restrictions museum information... *Fraunces Tavern Museum* 54 Pearl St Tuesday-Friday, 12pm?5pm; … regular expressions window covering two time intervals and a restriction window covering a museum and opening hours Prolog facts: museumopen('Fraunces Tavern Museum ', ' T12:00:00',' T17:00:00'). museumopen('Fraunces Tavern Museum ', ' T12:00:00',' T17:00:00'). museumopen('Fraunces Tavern Museum ', ' T12:00:00',' T17:00:00').

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 30 UIMA: Results information annotated in the documents: –names of museums, hotels –times, time intervals –time restrictions –prices, intervals of prices (hotel prices) –keywords for museum category –names of pharaohs (annotated with a correction of mispellings) hotel and museum information are exported into Prolog facts and into a short textual summary –templates filled with the detected information hotels: Price information about Cosmopolitan Hotel : $157 museums: *** *Fraunces Tavern Museum* *** Open from 12:00:00 to 17:00:00; Restriction: Tuesday-Friday

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 31 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -good -illustrative examples (tutorial) -completeness: sometimes it is very shortly described -prior knowledge about Java and Eclipse is helpful

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 32 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -annotators only from tutorial -sentence annotation -word annotation -date/time annotators -examples for using regular expressions etc. -external resources can be integrated: -lexical resources as external resources (text files) -existing processing resources -implementation of an interface is necessary

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 33 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -specific Eclipse component editors or -simple text Editors

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 34 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -faster than GATE? -in CPE detailed information about processing time for each module

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 35 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -Collection Reader -document(s) from a directory -adapt extensions into Preprocessing (CAS Initializer) -e.g., extraction of text fragments from a HTML document

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 36 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? no limitations: –all is possible, but implementation or interfacing by user wish: –more processing and linguistic resources within the distribution

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 37 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -import: CAS Initializer -export: CAS Consumer -transform annotations in any other format -export of -document + annotations -only annotations -required: Java application

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 38 Overview Introduction GATE UIMA Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 39 Conclusion intended use –GATE: academic/scientific application tools available comfortable GUI –UIMA: more commercial plain framework simplified definition of (complex) results structures simplified pre- and postprocessing of annotations in sum: incommensurable

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 40 Conclusion both are extensible no final judgement about: use GATE or UIMA –depends on your task –task description –expected results –which processing resources are necessary your preferences for interface –prefer the Eclispe environment (or other Java editors) –prefer a comfortable GUI or use both

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 41 Conclusion found in the UIMA Forum: I see UIMA and GATE as complementary rather than competitive, and each can gain from the strengths of the other. GATE was originally developed as a research tool, and has features suited to rapid prototyping of text processing code, like JAPE (a language for defining finite-state transducers over annotations on a document). UIMA is more targetted at robust deployment of applications, with strong typing of feature structures and better support for distributed processing. We're currently working on writing a translation layer to allow UIMA analysis components to be used in GATE and vice-versa. It's not in a releasable state just yet, but we hope to release something in the near future. Keep your eye on for details. Ian Roberts (GATE developer)