2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08.

Slides:



Advertisements
Similar presentations
Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.
Advertisements

The eXtensible Markup Language (XML) An Applied Tutorial Kevin Thomas.
University of Jyväskylä/AHo & VLy Experiences of Document Transformations with XSLT and DOM Anne Honkaranta, Virpi Lyytikäinen, Pasi Tiitinen, University.
Documentation Generators: Internals of Doxygen John Tully.
The Universities’ Collection Databases ”The Universities’ Collection Databases” denotes all databases developed by the Unit for digital documentation at.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
A New Learning Tools. Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.
Chapter Concepts Review Markup Languages
Chapter 12: ADO.NET and ASP.NET Programming with Microsoft Visual Basic.NET, Second Edition.
Program Flow Charting How to tackle the beginning stage a program design.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
LREC 2000 Athens, Greece An XML-based Encoding Standard for Language Corpora Nancy Ide Vassar College Patrice Bonhomme LORIA/CNRS Laurent Romary LORIA/CNRS.
Mgt 240 Lecture Website Construction: Software and Language Alternatives March 29, 2005.
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
Collections Management Museums Reporting in KE EMu.
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
Quick Tour of the Web Technologies: The BIG picture LECTURE A bird’s eye view of the different web technologies that we shall explore and study.
PART A Emac Lisp   Emac Lisp is a programming language  Emacs Lisp is a dialect.
OCLC Online Computer Library Center Two Paths to Interoperable Metadata Jean Godby, Devon Smith, Eric Childress DC-2003 September 29, 2003.
Strategies for Building Successful Digital Initiatives at Small to Medium Size Institutions Rachel Frick & Andrew Rouner.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Microsoft Visual Basic 2012 CHAPTER ONE Introduction to Visual Basic 2012 Programming.
ATM 315 Environmental Statistics Course Goto Follow the link and then choose the desktop application.
Speech Recognition Final Project Resources
INTRODUCTION TO WEB DATABASE PROGRAMMING
A First Program Using C#
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
Basics of Web Databases With the advent of Web database technology, Web pages are no longer static, but dynamic with connection to a back-end database.
Working Out with KURL! Shayne Koestler Kinetic Data.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
XML and XSL Institutional Web Management 2001: Organising Chaos.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
WORKING WITH XSLT AND XPATH
NOOJ 0.1 Max Silberztein Université de Franche-Comté 6th INTEX Workshop Sofia, Bulgaria, May 2003.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
COLD FUSION Deepak Sethi. What is it…. Cold fusion is a complete web application server mainly used for developing e-business applications. It allows.
Working with Objects Creating a Dynamic Web Page.
Web Programming : Building Internet Applications Chris Bates CSE :
History of C 1950 – FORTRAN (Formula Translator) 1959 – COBOL (Common Business Oriented Language) 1971 – Pascal Between Ada.
Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
XML technologies for text encoding Tamás Váradi
Converting Millennium ILS Bibliographic records into Dublin- Core XML format for DSpace Alan Ng Hong Kong University Libraries PNC 2009 Annual Conference.
LINGUISTICS RESEARCH AND ANALYSIS OF THE BULGARIAN FOLKLORE. EXPERIMENTAL IMPLEMENTATION OF LINGUISTIC COMPONENTS IN BULGARIAN FOLKLORE DIGITAL LIBRARY.
1 SGML-MARC Incorporating Library Cataloging into the TEI Environment Stephen Paul Davis Columbia University Libraries.
I Power Higher Computing Software Development Development Languages and Environments.
XML Alyssa Roberts. What is XML? Extensible Markup Language Specification to creating custom mark-up languages Simplified version of SGML, originally.
CODING & HTML.  What is Coding?  What is HTML?  How do I write code?  Why is it necessary?  HTML Syntax.
ASP. ASP is a powerful tool for making dynamic and interactive Web pages An ASP file can contain text, HTML tags and scripts. Scripts in an ASP file are.
GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations.
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
Tutorial 10 Programming with JavaScript. 2New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition Objectives Learn the history of JavaScript.
TEI presentation for IS 590 Robert Patrick Waltz July 10 th, 2012.
BRAT: a web based tool for manual annotation Hans Paulussen ITEC, KU Leuven KULAK.
Microsoft Visual Basic 2015 CHAPTER ONE Introduction to Visual Basic 2015 Programming.
1 Introducing Web Developer Tools Rapid application development tools ASP.NET-compatible web editors –Visual Studio.NET Professional Edition –Visual Studio.
Solvency II Tripartite template V2 and V3 Presentation of the conversion tools proposed by FundsXML France.
Web Programming Language
Javascript and Dynamic Web Pages: Client Side Processing
How to print barcodes in batch mode via item-03
Database Management Systems
Multilingual Biomedical Dictionary
Overview of INIS IT systems and applications
Copyright ©2008 by Pearson Education, Inc
Microsoft Word Documents
Batch Setup.
Web Programming : Building Internet Applications Chris Bates CSE :
Instructions for using the Miradi Companion Reporting Tool
Presentation transcript:

2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,

Human language technology language resources –corpora –dictionaries language tools –language resource organizing and retrieval tools –morphology –syntax –semantics –...

Text availability for building corpora written language –flood of text in digital form –“cheap” sources spoken language –difficulties in data collecting problems of recording problems of transcription problems of spontaneity of speakers “expensive” source (typing) both language varieties –corpus as text in digital form

WWW as a text source estimation of words accessible through Altavista (source: Greg Grefenstette, XRCE, ) automated conversion of texts to a standardized format needed

Corpus encoding standards pre-mark-up encoding SGML (’80 and mid-’90) –Text Encoding Initiative (TEI) –Corpus Encoding Standard (CES) Ide et al. (1996) XML (last couple of years) –XCES (XML version of CES) Ide, Bonhomme & Romary (2000)

Conversion to XML 2XML –tool for conversion –input formats HTML RTF –output format XML

2XML 1 producer –Institute of linguistics, Faculty of Philosophy, University of Zagreb programming –Softleks d.o.o., Zagreb platforms –Windows 9x/ME/NT/2000 requirements –Internet Explorer 5.* to run

2XML 2 principle: two-step conversion 1st step –input: HTML or RTF –output: intermediate “dirty” XML 2nd step –input: “dirty” XML –used-defined script applied to it –output: XML document

2XML Conversion: step 1

2XML Conversion: step 2

2XML user- defined script

2XML Goodies goodies –XML tree labeling –XML text editing –execute script on load –batch processing: whole directory

2XML: tree labeling & editing

2XML Tokenizer program which tokenizes XML files output in two formats –tokenized XML file –tabbed file

2XML Tokenizer 2

Tokenizer output: dic file

Tokenizer output: XML file

2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,