Institute of Informatics & Telecommunications – NCSR “Demokritos” Ellogon and the challenge of threads Georgios Petasis Software and Knowledge Engineering.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Chapter 10: Designing Databases
Prentice Hall, Database Systems Week 1 Introduction By Zekrullah Popal.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
PHP (2) – Functions, Arrays, Databases, and sessions.
Memory Management (II)
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. COMPSCI 125 Introduction to Computer Science I.
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. COMPSCI 125 Introduction to Computer Science I.
© Copyright Eliyahu Brutman Programming Techniques Course.
About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Development of mobile applications using PhoneGap and HTML 5
The POSTGRES Next - Generation Database Management System Michael Stonebraker Greg Kemnitz Presented by: Nirav S. Sheth.
.NET, and Service Gateways Group members: Andre Tran, Priyanka Gangishetty, Irena Mao, Wileen Chiu.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
Overview of SQL Server Alka Arora.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
SWE 316: Software Design and Architecture – Dr. Khalid Aljasser Objectives Lecture 11 : Frameworks SWE 316: Software Design and Architecture  To understand.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Levels of Architecture & Language CHAPTER 1 © copyright Bobby Hoggard / material may not be redistributed without permission.
Institute of Informatics & Telecommunications – NCSR “Demokritos” TkRibbon: Windows Ribbons for Tk Georgios Petasis Software and Knowledge Engineering.
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
Threading Models in Visual Basic Language Student Name: Danyu Xu Student ID:98044.
Software Requirements Engineering CSE 305 Lecture-2.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Institute of Informatics & Telecommunications – NCSR “Demokritos” TileQt and TileGtk: current status Georgios Petasis Software and Knowledge Engineering.
Chapter 10: File-System Interface 10.1 Silberschatz, Galvin and Gagne ©2011 Operating System Concepts – 8 th Edition 2014.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.
1 Memory Management (b). 2 Paging  Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter.
SQL Fundamentals  SQL: Structured Query Language is a simple and powerful language used to create, access, and manipulate data and structure in the database.
Institute of Informatics & Telecommunications – NCSR “Demokritos” iTcl and TclOO From the perspective of a simple user Georgios Petasis Software and Knowledge.
Institute of Informatics & Telecommunications – NCSR “Demokritos” TkDND: a cross-platform drag’n’drop package Georgios Petasis Software and Knowledge Engineering.
Polytechnic University of Tirana Faculty of Information Technology Computer Engineering Department A MULTITHREADED SEARCH ENGINE AND TESTING OF MULTITHREADED.
A Remote Collaboration Environment for Protein Crystallography HEPiX-HEPNT Conference, 8 Oct 1999 Nicholas Sauter, Stanford Synchrotron Radiation Laboratory.
It consists of two parts: collection of files – stores related data directory structure – organizes & provides information Some file systems may have.
Lesson 1 1 LESSON 1 l Background information l Introduction to Java Introduction and a Taste of Java.
San Jose, California September 2002 What is ICU? Roadmap and Myths Helena Shih Chapman ICU Development Manager IBM Globalization Center of Competency.
6 Systems Analysis and Design in a Changing World, Fourth Edition.
Design and implementation Chapter 7 – Lecture 1. Design and implementation Software design and implementation is the stage in the software engineering.
CSE333 CSE333 DDS.1 A System for Creating Specialized DDS Architectures Research and Work By: Tom Puzak Nathan Viniconis Solomon Berhe Jeffrey Peck Project.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the structure of a C-language program. ❏ To write your first C.
Multi-threading and other parallelism options J. Apostolakis Summary of parallel session. Original title was “Technical aspects of proposed multi-threading.
Database Systems: Design, Implementation, and Management Tenth Edition
DHTML.
CHP - 9 File Structures.
Introduction to the C Language
Institute of Informatics & Telecommunications NCSR “Demokritos”
Developing Ellogon Components…
Institute of Informatics & Telecommunications
Appendix D: Network Model
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Un</br>able’s MySecretSecrets
Paging and Segmentation
iTcl and TclOO From the perspective of a simple user
Chapter 27 WWW and HTTP.
Chapter 7 –Implementation Issues
Middleware, Services, etc.
Prof. Leonardo Mostarda University of Camerino
Introduction to Computer Systems
Appendix D: Network Model
Software Architecture & Design
Presentation transcript:

Institute of Informatics & Telecommunications – NCSR “Demokritos” Ellogon and the challenge of threads Georgios Petasis Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research “Demokritos”, Athens, Greece

Overview  The Ellogon NLP platform  Ellogon architecture and data model –Collections and documents –Attributes and annotations  The object cache  Thread safety and multiple threads  Conclusions 2 14 Oct 2010Ellogon and the challenge of threads

The Ellogon NLP platform (1)  Ellogon is an infrastructure for natural language processing –Provides facilities for managing corpora –Provides facilities for manually annotating corpora –Provides facilities for loading processing components, and applying them on corpora  Development started in 1998 –I think with Tcl/Tk 8.1 (beta?) –~ lines of C/C++/Tcl code –A lot of legacy code, especially in the GUI No widespread use of tile/ttk No OO (i.e. iTcl) in most parts of the code 14 Oct 2010Ellogon and the challenge of threads 3

The Ellogon NLP platform (2)  Ellogon was amongst the first platforms to offer complete multi-lingual support –Of course, it as using Tcl 8.1  The first prototype was written entirely in Tcl/Tk –Performance was not good, but memory consumption was excellent! 14 Oct 2010Ellogon and the challenge of threads 4

The Ellogon NLP platform (4)  Too many Tcl objects required (> 10K)  A solution from observing the data: –Objects tend to contain the same information  Why not build a cache of objects? –Objects can be reused as appropriate  Was it a good solution? –Yes, this approach worked well for many years  But recent hardware brings a new challenge: –How can this data model meet multiple threads? 14 Oct 2010Ellogon and the challenge of threads 5

Ellogon Architecture Language Processing Components Graphical Interface ServicesInternet (HTTP, FTP, SOAP) Operating System Services (ActiveX, COM, DDE) Database Connectivity (ODBC)… Operating System ??? XMLEllogonDatabases Collection – Document Manager Storage Format Abstraction Layer C++ API C API 14 Oct 2010Ellogon and the challenge of threads 6

Ellogon Data Model 14 Oct 2010Ellogon and the challenge of threads 7... Document Document Document Attributes language = Hellenic (string) Collection Textual Data Information about Textual Data Document Annotationsco-reference type = person entity = 132 token pos = noun lemma = abc Attributes language = Hellenic (string) bgImage = (image)

Annotations 14 Oct 2010Ellogon and the challenge of threads 8 Annotation ID 0 Type token Span Set [0 4] Attribute Set type = EFW pos = PN This is a simple sentence Annotations Annotation Span Set Denotes ranges of annotated textual data Annotation Span Set Denotes ranges of annotated textual data Annotation ID Unambiguously identifies the annotation within a document Annotation ID Unambiguously identifies the annotation within a document Annotation Type Classifies annotations into categories Annotation Type Classifies annotations into categories Annotation Attribute Set Contains linguistic information in the form of named values Annotation Attribute Set Contains linguistic information in the form of named values

The Collection  A C structure, containing (among other elements): –A Tcl list object, containing the documents to be deleted (if any) –A Tcl command token, holding the Tcl command that represents the collection at the Tcl level –A Tcl Hash table that contains the attributes of the collection. Each attribute is a Tcl list object –Two Tcl objects that can hold arbitrary information, such as notes and associated information 14 Oct 2010Ellogon and the challenge of threads 9

The Document  A C structure, containing (among other elements): –A Tcl command token, holding the Tcl command that represents the document at the Tcl level –A Tcl Hash table that contains the attributes of the document. Each attribute is a Tcl list object –A Tcl Hash table that contains the annotations of the document. Each annotation is either a Tcl list object, or an object of custom type 14 Oct 2010Ellogon and the challenge of threads 10

Attributes  Each attribute is a Tcl list object, containing three elements: –The attribute name: the name can be an arbitrary string –The type of the attribute value: this can be an item from a predefined set of value types –The value of the attribute, which can be an arbitrary (even binary) string 14 Oct 2010Ellogon and the challenge of threads 11

Annotations  An annotation is a Tcl object of custom type  It can be roughly seen as a list of four elements: –The annotation id: an integer, which uniquely identifies the annotation inside a document –The annotation type: an arbitrary string that classifies the annotation into a category –A list of spans: each span is a Tcl list object, holding two integers, the start/end character offsets of the text annotated by the span –A list of attributes: a Tcl list object, whose elements are attributes 14 Oct 2010Ellogon and the challenge of threads 12

The object cache  Ellogon implements a global memory cache for Tcl objects –Containing information from all opened collections and documents  The cache is used when: –Creating an element (i.e. attribute, span, annotation, etc.) –An annotation/attribute is put in a document –A collection/document is loaded 14 Oct 2010Ellogon and the challenge of threads 13

Why is cache important?  Linguistic information tents to repeat a lot  Example: annotating a word document with a part-of-speech tagger – “token” annotations –Containing “pos” attributes  Assume a tag set of 10 part-of-speech categories –Each “pos” value has a potential repetition in the thousands  Caching “token’ and “pos” makes sense  Caching larger clusters/constructs of objects makes even more sense  Sharing objects across document reduces memory consumption further 14 Oct 2010Ellogon and the challenge of threads 14

Thread safety (1)  The object cache is thread “unfriendly” –Tcl objects cannot be shared among threads  Parallel processing of documents is a highly desirable feature –But thread-safety is an open question for the Ellogon platform 14 Oct 2010Ellogon and the challenge of threads 15

Thread safety (2)  The CDM implementing the data model (and the object cache) is already thread-safe: –The global variables/objects are few, and their access is protected by mutexes –The object cache is global, and protected again with a mutex –Ellogon plug-in components use thread-specific storage for storing their “global” variables Through special pre-processor definitions for C/C++ components  But thread-safety does not necessarily allow the usage of threads inside Ellogon 14 Oct 2010Ellogon and the challenge of threads 16

14 Oct 2010Ellogon and the challenge of threads 17

Can Ellogon become multi-threaded?  Difficult to be answered  Requirements are: –The graphical user interface must not block during component execution It should be running in its own thread? –Each execution chain must run on its own thread  The documents of a collections should be distributed into N threads –And processed in parallel –This is a highly desired feature 14 Oct 2010Ellogon and the challenge of threads 18

Obstacles for multiple threads  The object cache –Splitting it in multiple threads increases memory consumption  The GUI is also a viewer for linguistic data –If running in a separate thread, deep copy of objects is required  Plug-in components in Tcl –They frequently short-circuit the “API”, and tread API elements as Tcl lists It is easier 14 Oct 2010Ellogon and the challenge of threads 19

Conclusions  Ellogon has been in active development and usage for more than an decade now  Enhancements are required in order to exploit contemporary hardware better  However, it is unclear whether threads can be introduced –Without a major re-organisation of the platform –Without breaking compatibility with plug-in components  Any suggestions/ideas? 14 Oct 2010Ellogon and the challenge of threads 20

Thank you!