The NITE XML Toolkit Jean Carletta University of Edinburgh HCRC Language Technology Group.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Open Office.Org What is the Open Office.org Source Project? Open source project through which Sun Microsystems is releasing the technology for the popular.
Database System Concepts and Architecture
XML: Extensible Markup Language
An Introduction to XML Based on the W3C XML Recommendations.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.
LINUX-WINDOWS INTERACTION. One software allowing interaction between Linux and Windows is WINE. Wine allows Linux users to load Windows programs while.
Information Retrieval in Practice
Xyleme A Dynamic Warehouse for XML Data of the Web.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Overview of Search Engines
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
Standards for Technology in Automotive Retail STAR Workbench 1.0 Michelle Vidanes & Dave Carver STAR XML Data Architects, Certified Scrum Masters.
Microsoft Visual Basic 2012 CHAPTER ONE Introduction to Visual Basic 2012 Programming.
Microsoft Visual Basic 2005 CHAPTER 1 Introduction to Visual Basic 2005 Programming.
Collecting, Storing, Coding, and Analyzing Spoken Tutorial Dialogue Corpora Diane Litman LRDC & Pitt CS.
Linux Operations and Administration
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
JSP Standard Tag Library
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
NXT meets the ICSI Corpus Jean Carletta and Jonathan Kilgour University of Edinburgh HCRC Language Technology Group.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
London April 2005 London April 2005 Creating Eyeblaster Ads The Rich Media Platform The Rich Media Platform Eyeblaster.
London April 2005 London April 2005 Creating Eyeblaster Ads The Rich Media Platform The Rich Media Platform Eyeblaster.
1 Designing a Data Exchange - Best Practices Data Exchange Scenarios –Sender vs. Receiver-initiated exchanges –Node Design Best Practices: –Handling Large.
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
ATLAS Demystified: A Practical Introduction Christophe Laprun, Jonathan Fiscus, John Garofolo, Sylvain Pajot National Institute of Standards and Technology.
Peoplesoft XML Publisher Integration with PeopleTools -Jayalakshmi S.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Scripting with the DOM Ellen Pearlman Eileen Mullin Programming the Web.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Technical Workshops | Esri International User Conference San Diego, California Creating Geoprocessing Services Kevin Hibma, Scott Murray July 25, 2012.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
ELEMENTS OF A COMPUTER SYSTEM HARDWARE SOFTWARE PEOPLEWARE DATA.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Architecture for an Ontology and Web Service Modelling Studio Michael Felderer & Holger Lausen DERI Innsbruck Frankfurt,
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
Javascript Basic Concepts Presentation By: Er. Sunny Chanday Lecturer CSE/IT RBIENT.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Laserfiche Plus AA103 Eric Hu, Software QA Engineer Raymond Cruz, Software Support Engineer.
General Architecture of Retrieval Systems 1Adrienn Skrop.
XML & JSON. Background XML and JSON are to standard, textual data formats for representing arbitrary data – XML stands for “eXtensible Markup Language”
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Lecture Transforming Data: Using Apache Xalan to apply XSLT transformations Marc Dumontier Blueprint Initiative Samuel Lunenfeld Research Institute.
Overview Presentation December 2007 MKT-NVO-P-002E.
Getting data out of XML These exercises provide an overview of how to use the native Taverna XPath services to get data out of XML.
C Copyright © 2009, Oracle. All rights reserved. Using SQL Developer.
Information Retrieval in Practice
Databases (CS507) CHAPTER 2.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
CARA 3.10 Major New Features
Introduction to Visual Basic 2008 Programming
Open Source distributed document DB for an enterprise
Database System Concepts and Architecture
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Unit# 8: Introduction to Computer Programming
Electronics II Physics 3620 / 6620
Smart Integration Express
Extracting Recipes from Chemical Academic Papers
CS 240 – Advanced Programming Concepts
Use Cases Simple Machine Translation (using Rainbow)
Presentation transcript:

The NITE XML Toolkit Jean Carletta University of Edinburgh HCRC Language Technology Group

NITE XML Toolkit Edinburgh, Stuttgart, DFKI NOT the NITE Workbench for Windows from the University of Southern Denmark

The NITE XML Toolkit integrated support for creating and searching different kinds of annotation on the same speech and video data data format that allows for distributed data production some standard GUIs, data utilities support for writing high quality hand- annotation tools for new tasks quickly

NXT corpus design data model is multi-rooted tree with arbitrary graph structure over the top –each node has one set of children, multiple parents annotations often naturally map to a tree –design task is deciding where trees intersect NXT can represent arbitrary graphs but the more the data has this character, the less useful the search is

Only configuration needed to: search/index data in NXT format display data in a standardized (ugly) way (NXT 1.3.0) do an increasing number of "usual" annotation tasks –dialogue act –named entity –time-stamped labelling like The Observer

Programming tailored interfaces development time is 1.5 days - 2 weeks depending on –how clear the spec is –complexity of the interface –familiarity with Swing NXT will include middleware reducing this and making typical program ~200 lines of code

GUI Demos

Recommended Data Paths (1) Transcribe data outside NXT –Transcriber or multi-channel version of it Create timestamped base layers either in NXT or in your favourite other tool –The Observer, Anvil, TASX, EventEditor

Recommended Data Paths (2) Use NXT as a reference storage format for shared data –everyone contributes data to a CVS repository from which different versions of the corpus can be built work in NXT natively when sensible –to create annotations structured over base layers –search/index Use NXT's generic utilities (or roll your own) to export data, run it through some machine process, and re-import the result –POS, morphology, automatic annotation based on statistical model

Up-translation into NXT format existing translations for several common tools take.5-4 days to write, depending on –documentation of input format –complexity of mapping complete lattice output of SR takes thought

Why NXT? best support for distributed creation of hand-annotations structured over transcription best search facility for integrated data set any other approach takes more dedicated development time; main task here is corpus design and up-translation

Reported Problems at Installation won't run –zip file truncated during download –forgot to set classpath –don't have Java can't get signal to play –video codec not installed/not registered in JMF –format not supported by JMF no one thing to run

Reserves

extract from Bdb001.A.words.xml time - line extract from Bdb001.A.speech-quality.xml Stand-off XML

GUI support (low level) a central clock keeps data displays/signal in synch pre-defined display widgets for text areas, trees, grids interfaces that displays can implement –in order to stay synchronized with clock –to allow search results to be highlighted predefined GUIs for displaying a dialogue, searching a corpus that work for anything

Metadata file Equivalent to set of DTDs for the XML files plus: –connections between the files –list of "observations" (coded dialogues/group discussions/texts) –catalog for finding signals and data on disk

Data Handling API Load corpus or meaningful subparts of a corpus (down to individual XML file) Data access, traversal, and manipulation with most important validation done on-line Serialization with choice of standoff syntax Off-line procedure for full validation All data is held in memory; "dump-n-reload" memory management planned

Query/search

Simple example query ($w word)($r reference): = “NN”) && ($r ^ $w) Match pairs of words and referring expressions where the word’s part of speech is NN and the word is in the referring expression.

General features of the language Match variable by no type, single type, or disjunctive type The usual boolean operators plus some syntactic sugar, like -> Quantifiers forall and exists (which do not contribute to the n-tuple returned)

Attribute and content tests Existence Ordering and equality against numbers and strings Match to regexp

Temporal tests Whether data object is timed Start or end time before, after, same as given time Same temporal extent, inclusion, abutment, overlap temporal precedence Start and end times treated as special attributes, for finer comparisons

Structural tests Identity Dominance (traceable through 0 to n children) Precedence (before in some tree ordering) Relationship via a role, which must be named Some distance/tree-limited functionality

Complex queries Evaluate first query, and carry over resulting bindings when evaluating second Result is a tree Any n-tuples from the first query that have no matches for the second are removed Faster to run, more intuitive to write, easier to perform frequency counts

Example complex query ($a w):(TEXT($a) ~ /th.*/):: ($s speechquality):($s ^ $a) && Find instances of words starting with “th” For each find instances of speech quality tags of type "emphasis" that dominate the word Discard words that are not dominated by at least one such tag

Uses for queries Exploring the data Basic frequency counts Verifying data quality Indexing complexes for further use Finding things for screen rendering in GUI

Warts Currently builds in-memory representation of complete data set being loaded –work-arounds: process one dialogue at a time, don't load the annotations you don't need –lazy loading and better memory management under development In large, distributed corpora, pain to assemble the subcorpus you want –build mechanism under development Some useful things missing from query language –arithmetic –distance-limited precedence