Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de A web based Workbench for Interactive Semantic Text Analysis: Design and Prototypical Implementation Tobias Waltl TUM
Overview 1.Introduction 2.Research Questions 3.Requirements 4.Existing Architectures for NLP Applications 5.Architecture & Implementation 6.Live Demo 7.Conclusion & Outlook © sebis Tobias Waltl - Master's Thesis 2
Problem Very huge and fast growing amount of legal literature [2,4] 2005 – 2009: 616 passed laws 2009 – 2013: 553 passed laws Judgements Commentaries … © sebis Tobias Waltl - Master's Thesis 3 A system that Deals with legal literature, Semantically analyzes and annotates it, Provides a workbench for exploring and filtering the literature and its annotations Progress in NLP technologies IBM Watson UIMA (Unstructured Information Management Architecture) GATE (General Architecture for Text Engineering) … Introduction Motivation
© sebis Tobias Waltl - Master's Thesis 4 Introduction Related work – GATE Developer [1]
© sebis Tobias Waltl - Master's Thesis 5 Introduction Related work – Argo [3]
What are requirements for a software architecture to support semantic analysis of legal literature? What are common software architectures based on these requirements that support semantic analysis in web applications? How does a prototypical integration of an architecture enabling semantic analysis on legal literature look like? Research Questions © sebis Tobias Waltl - Master's Thesis 6
Technical requirements from literature review The workbench should be a web application The system’s architecture should foster reuse of components It should be easy to integrate and to interchange foreign components The system’s text mining engine should support parallel processing of NLP tasks Requirements (excerpt) © sebis Tobias Waltl - Master's Thesis 7 Functional requirements from expert interviews The workbench shall annotate legal definitions The workbench shall annotate exceptions of legal norms The workbench shall provide linguistic information
Other approaches Whiteboard architecture Talisman TalLab Heart of Gold TIPSTER-based TIPSTER Ellogon LIMA GATE UIMA Existing Architectures for NLP Applications © sebis Tobias Waltl - Master's Thesis 8
Modular architecture Combination of analysis engines (AE) forms a pipeline AE communicate with the CAS AE specify inputs / outputs UIMA © sebis Tobias Waltl - Master's Thesis 9 Annotations Strongly typed annotations Standoff annotations in a Common Analysis Structure (CAS) Implementations Frameworks for Java and C++ Rule engine for regex-like pattern matching over annotations
Rule-based language for pattern matching over annotations Powerful tool for functional requirements Example: UIMA - Ruta © sebis Tobias Waltl - Master's Thesis 10
Architecture & Implementation © sebis Tobias Waltl - Master's Thesis 11
Live Demo © sebis Tobias Waltl - Master's Thesis 12
Conclusion Current implementation serves as fundament for further features and can easily be extended All nonfunctional requirements fulfilled Also some functional requirements fulfilled Outlook Development of further patterns Editable texts Different kinds of literature: Judgements Commentaries Contracts … Conclusion & Outlook © sebis Tobias Waltl - Master's Thesis 13
Technische Universität München Department of Informatics Chair of Software Engineering for Business Information Systems Boltzmannstraße Garching bei München wwwmatthes.in.tum.de Tobias Waltl B.Sc. Thank you for your attention!
Screenshot of the app used for interviews © sebis Tobias Waltl - Master's Thesis 15
Nonfunctional requirements © sebis Tobias Waltl - Master's Thesis 16 The workbench should be a web application The system’s architecture should foster reuse of components The system’s text mining engine should incorporate a common type system for the created annotations The system’s text mining engine should comply with a standardized data format for data exchange between its components It should be easy to integrate and to interchange foreign components The system’s text mining engine should support parallel processing of NLP tasks
Functional requirements © sebis Tobias Waltl - Master's Thesis 17 The workbench should support adding, removing, and editing of annotations The workbench should support the persistence of annotations It shall be possible to fold sections of the displayed text It shall be possible to leave own comments in the documents It shall be possible to set bookmarks in the documents It shall be possible to edit the texts It shall be possible that multiple users work on the same document and track their changes The workbench shall feature a comparison of documents and their different versions The workbench shall be able to import documents with different formats The workbench shall allow for exporting documents in different formats The workbench shall provide information about incoming and outgoing references The workbench shall annotate legal definitions The workbench shall annotate exceptions of legal norms The workbench shall annotate legal consequences The workbench shall provide linguistic information
Architecture Requirement TIPSTEREllogonLIMA Whiteboard Architecture TALISMANTalLab Heart of Gold GATEUIMA Web application Reuse of components Type system Common data format Integration and interchangeability Parallel processing Framework Assessment of architectures against nonfunctional requirements © sebis Tobias Waltl - Master's Thesis 18
Why typed annotations? Typed annotations © sebis Tobias Waltl - Master's Thesis 19
Architecture & Implementation © sebis Tobias Waltl - Master's Thesis 20
Architecture & Implementation © sebis Tobias Waltl - Master's Thesis 21
Example pipeline © sebis Tobias Waltl - Master's Thesis 22
References [1] Cunningham, H., Tablan, V., Roberts, A., & Bontcheva, K. (2013). Getting More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics. PLoS Comput Biol, 9(2), e doi: /journal.pcbi [2] Deutscher Bundestag (2014): Deutscher Bundestag – Neue Ausgabe des Datenhandbuchs zur Geschichte des Deutschen Bundestages. Retrieved from (last access on ) [3] Rak, R., Rowley, A., Black, W., & Ananiadou, S. (2012). Argo: an integrative, interactive, text mining-based workbench supporting curation. Database, doi: /database/bas010 [4] Walter, S. (2010). Definitionsextraktion aus Urteilstexten (Dissertation). Universität des Saarlandes, Saarbrücken, Germany. Retrieved from (last access on ) © sebis Tobias Waltl - Master's Thesis 23