Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Slides:



Advertisements
Similar presentations
1 Hencie Consulting Services Building a Knowledge Share System Using Oracle Designer SCOUG 2000 Conference By Murli Manickam Sameer.
Advertisements

From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Bringing Procedural Knowledge to XLIFF Prof. Dr. Klemens Waldhör TAUS Labs & FOM University of Applied Science FEISGILTT 16 October 2012 Seattle, USA.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Part IV: Memory Management
COM vs. CORBA.
Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
Programming Paradigms and languages
Copyright © 2002 W. A. Tucker1 Chapter 1 Lecture Notes Bill Tucker Austin Community College COSC 1315.
Portability CPSC 315 – Programming Studio Spring 2008 Material from The Practice of Programming, by Pike and Kernighan.
Chapter 2: The Visual Studio.NET Development Environment Visual Basic.NET Programming: From Problem Analysis to Program Design.
Chapter 1 Program Design
Stimulating reuse with an automated active code search tool Júlio Lins – André Santos (Advisor) –
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Overview of Search Engines
UNIT-V The MVC architecture and Struts Framework.
Microsoft Visual Basic 2012 CHAPTER ONE Introduction to Visual Basic 2012 Programming.
Microsoft Visual Basic 2005 CHAPTER 1 Introduction to Visual Basic 2005 Programming.
Introduction to Systems Analysis and Design Trisha Cummings.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Starting Chapter 4 Starting. 1 Course Outline* Covered in first half until Dr. Li takes over. JAVA and OO: Review what is Object Oriented Programming.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
JSP Standard Tag Library
Introduction to Java Appendix A. Appendix A: Introduction to Java2 Chapter Objectives To understand the essentials of object-oriented programming in Java.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. WEB.
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
Module 1: Introduction to C# Module 2: Variables and Data Types
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Java Introduction to JNI Prepared by Humaira Siddiqui.
CS 11 java track: lecture 1 Administrivia need a CS cluster account cgi-bin/sysadmin/account_request.cgi need to know UNIX
CHAPTER TEN AUTHORING.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Editors And Debugging Systems Other System Software Text Editors Interactive Debugging Systems UNIT 5 S.Sharmili Priyadarsini.
C++ and Ubuntu Linux Review and Practice CS 244 Brent M. Dingle, Ph.D. Game Design and Development Program Department of Mathematics, Statistics, and.
Content Management Systems Part 1. What is a Content Management System? A tool to separate content from presentation What’s the difference?? 
Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.
 Programming - the process of creating computer programs.
CS562 Advanced Java and Internet Application Introduction to the Computer Warehouse Web Application. Java Server Pages (JSP) Technology. By Team Alpha.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
Module 7: SQL Server Special Considerations. Overview SQL Server High Availability Unicode.
“Architecture” The outcome of top-level design, reflecting principal design decisions Can (and should) be modified and updated Analogous to architecture.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
CSCI-383 Object-Oriented Programming & Design Lecture 25.
Chapter – 8 Software Tools.
1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.
1 CSC160 Chapter 1: Introduction to JavaScript Chapter 2: Placing JavaScript in an HTML File.
Program Design. Simple Program Design, Fourth Edition Chapter 1 2 Objectives In this chapter you will be able to: Describe the steps in the program development.
Plug-In Architecture Pattern. Problem The functionality of a system needs to be extended after the software is shipped The set of possible post-shipment.
TTCN-3 Testing and Test Control Notation Version 3.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Pyragen A PYTHON WRAPPER GENERATOR TO APPLICATION CORE LIBRARIES Fernando PEREIRA, Christian THEIS - HSE/RP EDMS tech note:
System is a set of interacting or interdependent components forming an integrated whole.
1 RIC 2009 Symbolic Nuclear Analysis Package - SNAP version 1.0: Features and Applications Chester Gingrich RES/DSA/CDB 3/12/09.
Introduction to Algorithm. What is Algorithm? an algorithm is any well-defined computational procedure that takes some value, or set of values, as input.
Data Virtualization Tutorial: Introduction to SQL Script
Introduction to Visual Basic 2008 Programming
CPSC 315 – Programming Studio Spring 2012
Operation System Program 4
Portability CPSC 315 – Programming Studio
Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.
Computer Programming-1 CSC 111
Plug-In Architecture Pattern
Presentation transcript:

Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development

Thomas Hampp IBM18th International Unicode Conference2 Basic Text Analysis Tasks Code page conversion and text representation Segmentation (tokens, sentences, paragraphs) Morphological analysis / dictionary lookup Compound word decomposition Spell Checking/Spell Aid …

Thomas Hampp IBM18th International Unicode Conference3 Advanced Text Analysis Tasks Summarization Categorization/Clustering Extraction of names, terms or relations Information extraction Parsing All task should be provided for all languages

Thomas Hampp IBM18th International Unicode Conference4 A Library for Text Analysis The same text analysis tasks are needed in different multilingual contexts/systems The same software library should be used in all contexts/systems to perform the analysis The library should work language neutral The text analysis tasks required for a given context/system should be an input parameter for the library

Thomas Hampp IBM18th International Unicode Conference5 Two Problems and One Solution The realization of such a library faces two kinds of challenges: A. Implementing the actual language specific analysis tasks B. Encapsulating the language specific processing by representing input and output in a language neutral fashion Unicode plays a major role in solving problem B

Thomas Hampp IBM18th International Unicode Conference6 A Software Design for a Text Analysis Library Single API towards the application Separated but combinable language- specific processing modules Central representation system for linguistic information Centralized flow of control driven by linguistic analysis targets

Thomas Hampp IBM18th International Unicode Conference10 Implementation Implemented as C++ DLL/shared Lib Provides an extensive object oriented API for applications and plugins Uses Unicode (ICU based) for all text content Ported to 9 platforms (therefore no platform dependant solutions acceptable) Because of use in search/indexing strong focus on performance Supports 30+ languages and 90+ code pages

Thomas Hampp IBM18th International Unicode Conference11 Enter Unicode Used as internal character representation format (character set) Converters from/to over 90 external code pages had to be written/integrated A decision had to be made on the Unicode encoding format: we choose UTF-16

Thomas Hampp IBM18th International Unicode Conference12 The Pros UTF-16 We started out without knowledge of surrogate issues False assumption: Fixed length encoding Good balance between size and straightforward representation Efficient interoperability with Windows, Java, XML4C APIs etc

Thomas Hampp IBM18th International Unicode Conference13 The Cons of UTF-16 Not a fixed length encoding because of surrogates Can not be passed to legacy functions (C library, OS APIs) Character classification functions have to work on pointers for surrogates Wastes some space with western languages

Thomas Hampp IBM18th International Unicode Conference14 ANSI C/C++ Compatibility ANSI C++ does define a type w_char for wide character representation (and a matching wide string class wstring ) Unfortunately size and encoding of w_char are not standardized So we combined the ANSI C++ basic_string template class with the Unicode character data type from ICU to create a C++ and Unicode conformant string class

Thomas Hampp IBM18th International Unicode Conference15 Impact Beyond Character Representation Tokenization Finite state processing Dictionary formats Environmental issues Development tools support

Thomas Hampp IBM18th International Unicode Conference16 Impact: Tokenization Tokenization needs access to character properties Most but not all relevant are provided by Unicode character database For application defined properties there is no more fast & simple 256 character property lookup Approach limited to western scripts

Thomas Hampp IBM18th International Unicode Conference17 Impact: Finite State Processing Finite state character processing in C usually works with transition tables encoded as arrays This is easy to implement and very fast in execution To cover the full range of all Unicode characters, more sophisticated transition tables are required

Thomas Hampp IBM18th International Unicode Conference18 Impact: Dictionaries Dictionaries tend to be large As much of them as possible has to be loaded in memory for performance reasons For multilingual (server) applications multiple dictionaries will be in memory Therefore dictionary size matters much Doubling dictionary size might not be an viable option

Thomas Hampp IBM18th International Unicode Conference19 Impact: Environmental Issues There is always as residue of single byte string data (from message catalog, command line, library calls etc.) which sometimes has to be mixed with Unicode string data Interfaces for console, messages, logs etc. are mostly single byte Configuration files should be platform-neutral, easily editable and support the full Unicode character set

Thomas Hampp IBM18th International Unicode Conference20 Impact: Development Tools Support Only specialized editors can handle Unicode text Most debuggers dont display Unicode Source code string constants are hard to maintain Message catalog compilers on some platforms are not Unicode enabled

Thomas Hampp IBM18th International Unicode Conference21 A Word About Unicode Normalization Forms For reasons of efficient interoperability a fixed Unicode normalization had to be specified Early normalization is performance critical Since round trip convertibility was not a design goal Unicode Kompatibility Composed Normal Form has been chosen Normalization and cope page conversion can and should be done in one step

Thomas Hampp IBM18th International Unicode Conference22 Benefits of Unicode Use No more code page troubles within the boundaries of the application Very often algorithms can be established for groups of languages Multilanguage document collections and even mixed language documents are no problem to represent Easy and efficient Java (JNI) integration

Thomas Hampp IBM18th International Unicode Conference23 Summing Up: Building on Unicode… …solves only the basic character representation problem for multilingual text analysis …sets a solid foundation for a multilingual system …enables algorithms to be reused for groups of languages. …can have impact on the system far beyond the character representation level …has been worth the trouble