Correcting Errors in Digital Lexicographic Resources Using a Dictionary Manipulation Language (DML) David Zajic*†, Michael Maxwell†, David Doermann*, Paul.

Slides:



Advertisements
Similar presentations
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Advertisements

© Paradigm Publishing, Inc Excel 2013 Level 2 Unit 2Managing and Integrating Data and the Excel Environment Chapter 7Automating Repetitive Tasks.
SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 7e Kendall & Kendall 8 © 2008 Pearson Prentice Hall.
Tool Support for Producing National Versions - Workshop Zagreb S-Bahn Tool National Versions Developer Support Tool Support for Producing National.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Analyzing Systems Using Data Dictionaries
Customizing Word Microsoft Office Word 2007 Illustrated Complete.
Computers: Tools for an Information Age
1 Computing for Todays Lecture 22 Yumei Huo Fall 2006.
User interface design Designing effective interfaces for software systems Objectives To suggest some general design principles for user interface design.
About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
System Analysis and Design
MCDST : Supporting Users and Troubleshooting a Microsoft Windows XP Operating System Chapter 5: User Environment and Multiple Languages.
Mgt 20600: IT Management & Applications Databases Tuesday April 4, 2006.
FIRST COURSE Creating Web Pages with Microsoft Office 2007.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
An innovative platform to allow translation and indexing of internet sites Localization World
MACHINE TRANSLATION TRANSLATION(5) LECTURE[1-1] Eman Baghlaf.
Data Management Seminar, 8-11th July 2008, Hamburg 1 WinW3S - Translation of Forms and Labels (Mail Merge)
4/20/2017.
Your Interactive Guide to the Digital World Discovering Computers 2012.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
INTRODUCTION TO WEB DATABASE PROGRAMMING
1 Integrated Development Environment Building Your First Project (A Step-By-Step Approach)
COMPUTER SOFTWARE Section 2 “System Software: Computer System Management ” CHAPTER 4 Lecture-6/ T. Nouf Almujally 1.
Topics Introduction Hardware and Software How Computers Store Data
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
National Institute of Standards and Technology 1 Testing and Validating OAGi NDRs Puja Goyal Salifou Sidi Presented to OAGi April 30 th, 2008.
1 Web Basics Section 1.1 Compare the Internet and the Web Compare Web sites and Web pages Identify Web browser components Describe types of Web sites Section.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Using a Template to Create a Resume and Sharing a Finished Document
Introduction to Interactive Media The Interactive Media Development Process.
Chapter Three The UNIX Editors. 2 Lesson A The vi Editor.
FIGIS’ML Hands-on training - © FAO/FIGIS An introduction to XML Objectives : –what is XML? –XML and HTML –XML documents structure well-formedness.
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Session IV Chapter 9 – XML Schemas
Copyright © 2008 Pearson Prentice Hall. All rights reserved Copyright © 2008 Prentice-Hall. All rights reserved. Committed to Shaping the Next.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
©2003 Prentice Hall Business Publishing, Accounting Information Systems, 9/e, Romney/Steinbart 4-1 Accounting Information Systems 9 th Edition Marshall.
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
Copyright © 2011 Pearson Education Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall Global Edition 8.
What it is and how it works
Database Management Systems (DBMS)
I Power Higher Computing Software Development Development Languages and Environments.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Chapter Three The UNIX Editors.
Intermediate 2 Computing Unit 2 - Software Development Topic 2 - Software Development Languages and Environments.
Programming with Microsoft Visual Basic 2008 Fourth Edition Chapter Eight String Manipulation.
Requirement engineering & Requirement tasks/Management. 1Prepared By:Jay A.Dave.
Invitation to Computer Science 6 th Edition Chapter 10 The Tower of Babel.
Chapter – 8 Software Tools.
Understanding Web-Based Digital Media Production Methods, Software, and Hardware Objective
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Chapter 8. Copyright 2003, Paradigm Publishing Inc. CHAPTER 8 BACKNEXTEND 8-2 LINKS TO OBJECTIVES Delete, Move, Copy, and Paste Text Delete, Move, Copy,
XP Creating Web Pages with Microsoft Office
CSCI-235 Micro-Computer Applications
Unit# 8: Introduction to Computer Programming
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Chapter 1 Introduction(1.1)
ICT Word Processing Lesson 5: Revising and Collaborating on Documents
Programming Logic and Design Eighth Edition
ASCII and Unicode.
Presentation transcript:

Correcting Errors in Digital Lexicographic Resources Using a Dictionary Manipulation Language (DML) David Zajic*†, Michael Maxwell†, David Doermann*, Paul Rodrigues†, Michael Bloodgood† †University of Maryland Center for Advanced Study of Language (CASL) *University of Maryland Institute for Advanced Computer Studies (UMIACS) College Park, Maryland, USA LANGUAGE RESEARCH IN SERVICE TO THE NATION | Introduction Discovering and correcting errors in lexicographic data is a common task for teams dealing with digital lexicographic resources. We propose a paradigm for correcting errors and discuss its advantages for the task of editing digital bilingual dictionaries (Figure 1). For example, in Qureshi (1971), an Urdu to English dictionary, the translation of Urdu word “بیجو” is “goal in a children’s game called باؤری”. In the source digitization this translation was split into “goal in children’s game called” and a separate lexical entry “باؤری”. The underlying text was split incorrectly into two distinct text spans. Database and Version Control Solutions A straightforward method of editing a structured digital resource is to store the information in a shared repository and allow experts to edit the repository contents. Issues Suppose a team wishes to undo a local change to the resource. There are a few issues: 1.It is straightforward to restore the resource to its state at a specific time, but all changes made subsequent to that time are lost. 2.It is difficult to produce a coherent record of changes from initial to final form. 3.If the editing process is a lossy transformation of the source into a standard format, meaning that it is not possible to reconstruct the original data from the transformed data, it is desirable to preserve a copy of the original source. Dictionary Manipulation Language (DML) Paradigm The key intuition of our paradigm for editing digital lexicographic resources is that the edits take the form of commands in DML rather than direct modifications to a shared resource. DML commands can be written manually by language experts or generated automatically by computer systems. The end-to-end process consists of reading the original source lexicon, applying a sequence of DML command sets to it, and writing the result to a destination resource. The original source file is never edited directly. Figure 3 shows an excerpt from an XML document containing a structural error, some DML commands to correct the error, and the result of applying the DML commands to the source. DML commands operate on XML documents in which each element of the XML document has a unique identifier. Figure 4 shows a list of DML commands. Advantages of the DML Paradigm Preservation of source data Non-chronological rollback Feeding and bleeding DML as documentation DML as data Support for collaboration between language experts and computer scientists create textElement tag text relation anchor create element tag relation anchor create clone source relation anchor remove element target remove text target remove tail target remove attribute target attribute retag target tag move element target relation anchor move children target relation anchor move text/tail source text/tail target set attribute target attribute value set text/tail target text sub text/tail target pattern text combine element1 element2 Figure 4: A list of DML commands. Applications of DML CASL’s lexicon repair team has used the DML paradigm to perform structural repair on digital sources for three bilingual dictionaries: Iraqi Arabic to English (Woodhead and Beene 2003), Yemeni Arabic to English (Qafisheh 2000) and Urdu to English (Qureshi 1971). Table 1 shows the rough scope of these projects. These projects included restructuring the data to be compatible with a resource- and language-independent schema for bilingual lexicons, and conversion of non-Latin text from legacy encodings to Unicode. References Qafisheh, H. (2000). NTC’s Yemeni Arabic – English Dictionary. NTC Publishing Group, Chicago. Qureshi, B.A. (1971). Kitabistan’s 20th Century Standard Dictionary Urdu into English. Kitabistan Publishing Co., Lahore. Woodhead, D.R. and Beene, W. (Eds.) (2003). A Dictionary of Iraqi Arabic: English – Arabic, Arabic – English. Georgetown University Press, Washington, D.C. LexiconEntries Manual DML Commands Automatic DML Commands Iraqi 13,719 4,759 1,594,688 Yemeni 16, ,685 Urdu 44,237 5, ,612 Future Work XML editor with DML support (Figure 2) Edit XML dictionary graphically Generate DML from user interactions Display result of DML commands visually Generation of ground truth and training data for automatic error detection and automatic error correction DML as annotation language for semi-supervised active machine learning Table 1: Numbers of lexical entries, manually written DML commands and automatically generated DML commands for three lexicon repair projects. Figure 3: An XML excerpt from Urdu dictionary, DML commands to correct a structural error, and the result of applying the DML commands to the source XML. ENTRY ID="351782"> طرفہ tūr'fah... rare... # ABC 5/27/2011 sense tagged as usage, retag CREATEelementTRANSunder T RETAG351795TR REMOVEattribute351795TIME MOVEelement351795under T طرفہ tūr'fah... rare... </ENTRY Figure 2: Screenshot of XML editor with DML support. Figure 1: A paradigm for correcting errors using a DML interpreter.