Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document.

Similar presentations


Presentation on theme: "Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document."— Presentation transcript:

1 Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document Processing

2 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 2 Overview Introduction GATE UIMA Conclusion

3 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 3 Introduction "IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies." November 2005; Version 1.2.3 of UIMA is available

4 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 4 Introduction really?

5 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 5 Introduction similarity/comparison of GATE and UIMA –frameworks –results are documents + annotations –pipeline processing steps: –task definition –one corpus

6 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 6 Evaluation Topics/Points ease of getting acquainted with system?: –quality of docus: completeness, clarity, up-to-date, …? –tutorials, use cases, …? processing and linguistic resources? –lexica, Gazetteer lists, tools tools for resource maintenance and extension? –quality: selfexplanatory, robust, comfortable speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? support for im-/export of a variety of document formats?

7 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 7 Task of the Experiment process a corpus of websites –to detect and extract information relevant for tourists opening times of museum, prices of hotels,… corpus: –30 tourism web sites of Egypt –additional 20 web sites of Washington, New York, London output: –Prolog facts for a reasoner –Questions: Which museum is now open? …

8 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 8 Excerpts from the Corpus The Egyptian Museum is open the hours: 9am-5pm daily The Military Museum is open the hours: Summer: 8am- 5:30pm; winter: 8am-4:30pm Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter) 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri …

9 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 9 Overview Introduction GATE UIMA Conclusion

10 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 10 GATE: General Architecture for Text Engineering a suite of tools for language processing and information extraction rule-based modular IE system (ANNIE) language and domain-independent processing resources open and extensible architecture aims to provide uniform access to various linguistic and ontological resources http://gate.ac.uk/

11 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 11 a software infrastructure for NLP researchers; based on three main elements: –an architecture describing the components composing a language processing system –a framework could be used as a basis for building such systems –a graphical development environment a set of tools and components for language engineers GATE: General Architecture for Text Engineering

12 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 12 GATE distributed with IE system called ANNIE –relies on finite state algorithms and the Java Annotation Pattern Engine (JAPE) language –comprising a set of core Processing Resources (PRs): Tokeniser Gazetteers POS tagger Sentence Splitter Semantic Tagger (JAPE transducer) Orthomatcher (orthographic coreference) … GATE: General Architecture for Text Engineering

13 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 13 GATE: ANNIE [Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)]

14 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 14 Gate Application several Processing Resources: Tokenizer, Hash Gazetteer (with new/extended Gazetteer lists), JAPE Transducer Gazetteer lists JAPE Transducer... * The Military Museum* Summer: 8am-5:30pm; Winter: 9pm-5pm … names of museums, fragments of times and restrictions JAPE rules: to annotate interval of times and restrictions museum ANNIE English Tokenizer

15 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 15 Museum information in JAPE Rule: egyptmuseums ( ({SpaceToken}) ({Token.kind == word}) ({SpaceToken}) {Lookup.majorType ==org_base} // from gazetteer lists ({SpaceToken})? (({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))* ({timeinfo}) // annotation by jape transducer ) :museum --> :museum.sight = {rule ="egyptmuseums"} timeinfo defined by JAPE rules detects patterns like: 9am-5pm, 6pm-9pm 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm 5:00PM-7:00PM, 10:00am-5:00pm ….

16 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 16 GATE: Presentation of Results Type and location of every extracted annotation on document Annotations Museums Information

17 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 17 GATE: Results information annotated in the documents: –names of museums, hotels –names of tourist places in Egypt –times, time intervals –time restrictions –prices, intervals of prices (hotel prices and museum prices) –names of pharaohs, queens

18 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 18 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -good -illustrative examples (tutorial) but not enough specialy about JAPE rules -can deal with it without know of Java programming -but is advantage to have experinces with Java programming to use it in JAPE rules

19 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 19 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -many processing resources available (ANNIE) -tokenisers -POS taggers -parsers -gazetteers -sentence splitter -… -additional PRs : -gazetteer collector -PRs for Machine Learning -various exporters -annotation set transfer etc...

20 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 20 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -editor for gazetteer list -corpus manager -text editor and debugger for JAPE rules

21 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 21 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -there is no measurement of processing time in the GATE tool

22 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 22 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -corpus pipeline vs document pipeline

23 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 23 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -no limitations: -all is possible but it is not necessary to implement by yourself -for beginning: -processing and linguistic resources available within the distribution

24 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 24 GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -import: -supports a variety of document formats: HTML, rtf, email, SGML and plain text -In all cases the format is analysed and converted into a single unified model of annotation -export: -documents, corpora and annotations in databases of various sorts -required: Java application (CREOLE)

25 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 25 Overview Introduction GATE UIMA Conclusion

26 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 26 UIMA: Unstructured Information Management Architecture a software architecture for developing and deploying unstructured information management (UIM) applications UIM application: a software system –analyse large volumes of unstructured information to discover, organize, and deliver relevant knowledge to the end user software architecture which specifies –component interfaces, data representations, … http://www.research.ibm.com/UIMA/

27 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 27 UIMA: Unstructured Information Management Architecture … interfaces to a collection of data items (e.g., documents) to be analyzed. Collection Readers return CASes that contain the documents to analyze, possibly along with additional metadata. … takes a CAS, analyzes its contents, and produces an enriched CAS. Analysis Engines can be recursively composed of other Analysis Engines (called an Aggregate Analysis Engine). Aggregates may also contain CAS Consumers. … may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from tags in the original HTML) into the CAS. CAS: Common Analysis Structure CPM: Collecting Processing Manager … consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

28 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 28 Analysis Engine (AE): –a component that analyzes artifacts (e.g. documents) and infers information about them –consists of two parts: Java classes (typically packaged as one or more JAR files) and AE descriptors (one or more XML files) –the configuration settings for the Analysis Engine as well as –a description of the AE’s input and output requirements. UIMA: Unstructured Information Management Architecture [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

29 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 29 UIMA Application several annotators (like a pipeline) museum pattern time pattern interval of times restrictions museum information... *Fraunces Tavern Museum* 54 Pearl St. - 1-212-425-1778 Tuesday-Friday, 12pm?5pm; … regular expressions window covering two time intervals and a restriction window covering a museum and opening hours Prolog facts: museumopen('Fraunces Tavern Museum ', '2005-12-01T12:00:00','2005-12-01T17:00:00'). museumopen('Fraunces Tavern Museum ', '2005-12-02T12:00:00','2005-12-02T17:00:00'). museumopen('Fraunces Tavern Museum ', '2005-12-03T12:00:00','2005-12-03T17:00:00').

30 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 30 UIMA: Results information annotated in the documents: –names of museums, hotels –times, time intervals –time restrictions –prices, intervals of prices (hotel prices) –keywords for museum category –names of pharaohs (annotated with a correction of mispellings) hotel and museum information are exported into Prolog facts and into a short textual summary –templates filled with the detected information hotels: Price information about Cosmopolitan Hotel : $157 museums: *** *Fraunces Tavern Museum* *** Open from 12:00:00 to 17:00:00; Restriction: Tuesday-Friday

31 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 31 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -good -illustrative examples (tutorial) -completeness: sometimes it is very shortly described -prior knowledge about Java and Eclipse is helpful

32 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 32 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -annotators only from tutorial -sentence annotation -word annotation -date/time annotators -examples for using regular expressions etc. -external resources can be integrated: -lexical resources as external resources (text files) -existing processing resources -implementation of an interface is necessary

33 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 33 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -specific Eclipse component editors or -simple text Editors

34 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 34 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -faster than GATE? -in CPE detailed information about processing time for each module

35 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 35 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -Collection Reader -document(s) from a directory -adapt extensions into Preprocessing (CAS Initializer) -e.g., extraction of text fragments from a HTML document

36 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 36 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? no limitations: –all is possible, but implementation or interfacing by user wish: –more processing and linguistic resources within the distribution

37 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 37 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -import: CAS Initializer -export: CAS Consumer -transform annotations in any other format -export of -document + annotations -only annotations -required: Java application

38 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 38 Overview Introduction GATE UIMA Conclusion

39 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 39 Conclusion intended use –GATE: academic/scientific application tools available comfortable GUI –UIMA: more commercial plain framework simplified definition of (complex) results structures simplified pre- and postprocessing of annotations in sum: incommensurable

40 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 40 Conclusion both are extensible no final judgement about: use GATE or UIMA –depends on your task –task description –expected results –which processing resources are necessary your preferences for interface –prefer the Eclispe environment (or other Java editors) –prefer a comfortable GUI or use both

41 Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 41 Conclusion found in the UIMA Forum: I see UIMA and GATE as complementary rather than competitive, and each can gain from the strengths of the other. GATE was originally developed as a research tool, and has features suited to rapid prototyping of text processing code, like JAPE (a language for defining finite-state transducers over annotations on a document). UIMA is more targetted at robust deployment of applications, with strong typing of feature structures and better support for distributed processing. We're currently working on writing a translation layer to allow UIMA analysis components to be used in GATE and vice-versa. It's not in a releasable state just yet, but we hope to release something in the near future. Keep your eye on http://gate.ac.uk/ for details. Ian Roberts (GATE developer)


Download ppt "Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document."

Similar presentations


Ads by Google