Language resources and their commercial applications Kara Warburton

Slides:



Advertisements
Similar presentations
© 2000 XTRA Translation Services Is MT technology available today ready to replace human translators?
Advertisements

Summary Objectives: Establish the new office and staff
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
IATI Technical Advisory Group Technical Proposals Simon Parrish IATI Technical Advisory Group, DIPR March 2010.
Language standards as a cornerstone for business strategies Implications for the design of academic curricula Kara Warburton, City University of Hong Kong.
Presented to: By: Date: Federal Aviation Administration Registry/Repository in a SOA Environment SOA Brown Bag #5 SWIM Team March 9, 2011.
Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.
The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
Near East Plant Protection Network for Regional Cooperation & Knowledge Sharing Food and Agriculture Organization of the United Nations An Overview on.
Spatial Data Infrastructure: Concepts and Components Geog 458: Map Sources and Errors March 6, 2006.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
ANSI TAG 37 Committee F43 Language Services and Products Interagency Language Roundtable September 30, 2011 Sue Ellen Wright ISO TC 37, Terminology and.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
Copyright © 2009 by SDL Tridion. SDL Tridion®, SDL Tridion R5™, BluePrinting™, SiteEdit™ and WebForms™ are trademarks of SDL Tridion Holding B.V. or its.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
1 Dr Alexiei Dingli Introduction to Web Science Conclusion.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
1 Adaptive Management Portal April
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
KNOWLEDGE MANAGEMENT AT ACCENTURE
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Global Cognitive Computing Market
© 2011 Infotech Enterprises. All Rights Reserved We deliver Global Engineering Solutions. Efficiently.August 7, 2015 Geo-Technical Data management – A.
Database Administration Chapter 16. Need for Databases  Data is used by different people, in different departments, for different reasons  Interpretation.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Software Developer Career. ◦ Desktop Program development ◦ Web Program Development ◦ Mobile Program Development.
The ECHA-term project Multilingual REACH and CLP Terminology Dieter Rummel, Translation Centre for the Bodies of the EU Luxembourg EAFT - Oslo, 11 October.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
CISE Demonstrator Vincent Dijkstra DG Informatics (DIGIT)
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Commonalities and Differences.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
Database Design - Lecture 1
DBS201: DBA/DBMS Lecture 13.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Rutherford Appleton Laboratory SKOS Ecoterm 2006 Alistair Miles CCLRC Rutherford Appleton Laboratory Semantic Web Best Practices and Deployment.
OASIS ebXML Registry Standard Open Forum 2003 on Metadata Registries 10:30 – 11:15 January 20, 2003 Kathryn Breininger The Boeing Company Chair, OASIS.
An Overview of MPEG-21 Cory McKay. Introduction Built on top of MPEG-4 and MPEG-7 standards Much more than just an audiovisual standard Meant to be a.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
Interfacing Registry Systems December 2000.
Context Inspired Component Architecture Navigating the Shifting Currents of Data xmlCoP Meeting May 18, 2005 ANSI Accredited Standards Committee X12 Ralph.
Managing the information explosion Binesh Lad. 20% 80% Structured Content Everything else.
Development Process and Testing Tools for Content Standards OASIS Symposium: The Meaning of Interoperability May 9, 2006 Simon Frechette, NIST.
Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.
TECHONOLOGY experts INDUSTRY Some of our clients Link Translation’s extensive experience includes translation for some of the world's largest and leading.
SOFTWARE & LOCALIZATION WEBSITE Simplify and accelerate your.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
0 Content Management and the Need for Change in Technical Communication Written by: Scott P. Abel 20 June 2007 Nick Savillo ENG 393.
OWL Representing Information Using the Web Ontology Language.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Chapter Thirteen Copyright, John Wiley and Sons, Inc. Chapter Thirteen three Learning Concepts – Chapter Understand the increasing benefits and challenges.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
SNOMED CT Vendor Introduction 27 th October :30 (CET) Implementation Special Interest Group Tom Seabury IHTSDO.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Unit 5 Systems Integration and Interoperability
Language resources and their commercial applications
Introduction to reference metadata and quality reporting
Presentation transcript:

Language resources and their commercial applications Kara Warburton

ISO/TC 37 Terminology and other language and content resources My aim Demonstrate the value of language resources for commercial applications Discuss why standards for language resources are important Present TC37 as a standards-developing organization Warning – slight terminology bias!

ISO/TC 37 Terminology and other language and content resources Managing language resources A language resource is Information expressed in a natural language Information that supports the interpretation of natural language Language resources can enhance business processes If properly deployed Requires interoperability, which in turn requires standards.

ISO/TC 37 Terminology and other language and content resources Why me? Implemented terminological resources, lexical resources, and standards for content interoperability in business environments - Terminologist for IBM, LISA contributor, business consultant Developed standards and best practices for language resources: ISO TC37, LISA Practical experience as a technical writer and translator – using language resources in increasingly technical environments

ISO/TC 37 Terminology and other language and content resources The cold reality The computer age has generated exponential growth of information and knowledge. Even with the aid of computers, we can’t manage this volume of information. Why? Computers can’t understand “natural” language. They only understand “1” and “0”. Natural language is largely unstructured; even many structured language resources are “unpredictably” structured. This environment demands increasing volumes of structured language resources to enable next-generation computing

ISO/TC 37 Terminology and other language and content resources Business scenarios for managing language resources Translation memories Terminologies and lexical resources for enhancing NLP applications Content management and retrieval Content repurposing Content classification Normalized language Keyword management Example: term extraction tool – use of “layered” lexical resources; grammatical rules; ranking algorithms

ISO/TC 37 Terminology and other language and content resources Managing terminology supports both social and commercial interests Economic/commercial: Control terminology to ensure quality and minimize production costs. Build terminological resources that are repurposable across the content management chain. Increase competitiveness in local and global markets. Social/geopolitical: Strengthen and protect minority languages. Support cultural diversity. Increase global presence and visibility.

ISO/TC 37 Terminology and other language and content resources Is “managed” terminology really important for a business? In the automotive industry, almost 50% of translation errors are “wrong term” (Woyde) 40% of time required for text production is terminology work (Stellbrink) Between 30% and 70% of errors in technical documentation are terminology errors (Schutz, and MULTIDOC) Terminology work is necessary for between 4% and 6% of all words in a text (Champagne) Return on investment: 10% ($100 investment yields $110 return) (Champagne) Outsourced translations may be 50% more expensive if source terminology is inconsistent (Kjeldgaard)

ISO/TC 37 Terminology and other language and content resources Need more proof? Terminology tools increase productivity by approx. 20% (Champagne) Without a central reference, each needless search can take 20 to 30 minutes (Champagne) It costs 10 times more to fix a term at the end of the production cycle than at the beginning (Xerox, JDEdwards) Inconsistent or inaccurate terminology raises service costs Terminology mistakes can lead to lawsuits for copyright or trademark infringement, or for damages due to defective products or incorrect user documentation.

ISO/TC 37 Terminology and other language and content resources IBM scenario… “Terminology work is necessary for between 4% and 6% of all words in a text (Champagne)” 429 million words are translated per year in IBM. Thus over 21 million words require attention. In 2009, IBM “processed” over 160,000 terms as part of the “content conveyor belt”, in nearly 3,000 specialized “dictionaries”. Very small staff High degree of automation

ISO/TC 37 Terminology and other language and content resources What “measures” need to be taken? Deploy a terminology database that serves multiple purposes Integrate the database into all content environments to ensure a “push” mechanism Respect data management principles, such as data granularity, elementarity, etc. Adopt best practices for terminology, such as term autonomy and concept orientation Allow for extensibility for features such as morphology as needed for future applications

ISO/TC 37 Terminology and other language and content resources Basic example – repurpose information

ISO/TC 37 Terminology and other language and content resources Controlled authoring

ISO/TC 37 Terminology and other language and content resources Controlled translation

ISO/TC 37 Terminology and other language and content resources Search – Query expansion

ISO/TC 37 Terminology and other language and content resources Search – Query expansion

ISO/TC 37 Terminology and other language and content resources Source data…

ISO/TC 37 Terminology and other language and content resources Search – Query correction

ISO/TC 37 Terminology and other language and content resources Synonyms/inconsistencies multiply in the target language – this is bad for business automatic memory reclamation remise en état automatique du mémoire récupération automatique de mémoire automatic storage reclamation remise en état automatique de l’archivage remise en état automatique du stockage récupération automatique de l’archivage récupération automique du stockage garbage collectionrécupération de place vidage de la corbeille récupération de place en mémoire récupération de positions inutilisées récupération de l’espace mémoire

ISO/TC 37 Terminology and other language and content resources “Cosmetic” differences can become more than cosmetic in the target language 1. pupitre d’administration 2. console d’administration 3. pupitre admin 4. console admin 5. pupitre administratif 6. console administrative 1. administration console 2. admin console 3. administrative console

ISO/TC 37 Terminology and other language and content resources Explosion of affected compounds… administrative console application / administration console application administrative console button / administration console button administrative console login page / administration console login page core administrative console / core administration console....

ISO/TC 37 Terminology and other language and content resources Fixing the problem isn’t easy… Change “pupitre” to “console”….. Le pupitre administratif est ouvert. Vous devez le fermer. La console administrative est ouverte. Vous devez la fermer.

ISO/TC 37 Terminology and other language and content resources Development of terminology resources is also key for language planning Prescriptive terminology approach – just like in enterprise environments The Canadian experience: Termium, the BTQ Other examples: Danterm, Korterm, Eurotermbank Termbases feed into widely-distributed bulletins and other distribution media to support adoption and language reinforcement As an educational resource For social and political policies As an aid to commerce

ISO/TC 37 Terminology and other language and content resources Effective management of language resources requires adherence to standards and best practices

ISO/TC 37 Terminology and other language and content resources

Interoperability requires adherence to standards Interoperability between tools and applications: CAT tools vs controlled authoring, Web interfaces, GMS, ECM, search engines… Interoperability between users – writers, translators, publicists For delivering derivative products – glossaries, Web sites, etc. For different purposes – learning, commercialization, government, social services, language planning, tourism, etc. For different media – online vs paper, hand-helds, transport interfaces, broadcasting media, marketing collateral, etc.

ISO/TC 37 Terminology and other language and content resources Standards at various levels Data transfer File format File structure (data model) Encoding Markup Syntax Semantics

ISO/TC 37 Terminology and other language and content resources ISO TC37 – Terminology and other language and content resources Standardization of principles, methods and applications relating to terminology and other language and content resources in the contexts of multilingual communication and cultural diversity. Web site…

ISO/TC 37 Terminology and other language and content resources TC37 Current focus areas Word segmentation Language annotations to facilitate machine processing Terminology policies Translation quality Simultaneous interpretation Data categories - XML representation and exchange formats Persistent identifiers in multilingual environments

ISO/TC 37 Terminology and other language and content resources Key standards and best practices – ISO TC37 ISO – TBX ISO – TMF ISO  new ISO TC37 Data Category Registry ISO Concept Database ISO 704 – Terminology work: Principles and methods ISO – Translation-oriented terminography ISO – Design, implementation and maintenance of terminology management systems Annotation schemes and frameworks (SC4)

ISO/TC 37 Terminology and other language and content resources Training professionals in language resource management – an opportunity! Lack of university training programs Lack of competency in existing fragmented university courses Increasing demand for qualified professionals For example, LISA offered 6 workshops (there was a demand for more) - 73 companies attended. TermNet summer school – attendance grows each year

ISO/TC 37 Terminology and other language and content resources Thank you!