Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University
DRH Cheltenham 2/9/03 Language documentation Language documentation produces large quantities of text –Transcribed language events –associated annotations –lexica / dictionaries –analyses –ethnographic notes –……. There is no standard software tool used by linguists Use of proprietary software results in file formats with limited portability
DRH Cheltenham 2/9/03 Advantages of XML: Archiving UNICODE compatibility assured –Besides script possibilities, access to the full International Phonetic Alphabet character set is important for linguists Explicit coding of data model Generic file format assures better portability and lifespan
DRH Cheltenham 2/9/03 Building an archive Addition of data to an XML archive should be automated This implies the existence of transformation scripts to move data between formats Creating these scripts is work which has to be done It can have a second benefit
DRH Cheltenham 2/9/03 Advantages of XML: Interoperability Members of a research team may use different software running on different platforms Problems can arise in sharing data An important use of XML is as an interchange format Transformation scripts created for archiving can also be used for sharing data
DRH Cheltenham 2/9/03 Data structures - 1 Researchers may not agree on common data structures –They are used to working with one tool in one particular way –Their interests are different Even if they agree on a data structure for current work, heritage data may have to be imported to the archive
DRH Cheltenham 2/9/03 Data structures - 2 Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure Where possible, correspondences will be made between the information in different input files
DRH Cheltenham 2/9/03 Example: Dictionary files The prototype implementation of the process uses a simple type of information: dictionary files Source 1 is a FilemakerPro database of lexical material from the language Nusalaut Source 2 is a table in an Access database containing data from several languages
DRH Cheltenham 2/9/03 Source 1
DRH Cheltenham 2/9/03 Source 2
DRH Cheltenham 2/9/03 Process overview
DRH Cheltenham 2/9/03 Stage 1 – txt to xml Data exported from database as delimited text file A document type description (DTD) is created for each source file –This replicates the existing data structure, possibly with additions A Perl script reads data from the txt file and adds tags based on the DTD
DRH Cheltenham 2/9/03 Sample: specific XML
DRH Cheltenham 2/9/03 Stage 1 – Why? Newer versions of commercial software offer an export to XML facility Importing data from a normalized database often means having access to data from more than one table –XSLT takes a single input file –Perl (or an equivalent) does not have this limitation Type conversion can be done using Perl
DRH Cheltenham 2/9/03 Stage 2 – XML1 to XML2 DTD for archive file has a place for all information in all input files More structure imposed at this level –Stage 1 used only elements –Stage 2 uses attributes, mainly for metadata –“Pseudo-normalization”: recurring data substructures treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs Date stamping done at this stage
DRH Cheltenham 2/9/03 Sample: General XML 1
DRH Cheltenham 2/9/03 Sample: General XML 2
DRH Cheltenham 2/9/03 Exporting Data XSLT with The only complication is undoing “pseudo- normalization”
DRH Cheltenham 2/9/03 A more complex problem: aligned interlinear text Important way of presenting data for linguists Various lines of annotation, different levels have different alignment patterns
DRH Cheltenham 2/9/03 The Bird, Bow & Hughes Model Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop A general data model for representing this type of information Four levels: –Text –Phrase –Word –Morpheme
DRH Cheltenham 2/9/03 XML model for aligned text
DRH Cheltenham 2/9/03 Aligned text: Problems Various types of input: –Text strings with space and/or tabs (Shoebox) –Formatted text (e.g. Word tables) –Structured data (e.g. Spinoza database) Type of processing varies –Text strings need a lot of parsing –Structured data needs access to multiple tables Ideally, time alignment to AV source should be included also
DRH Cheltenham 2/9/03 What is gained Interoperability within the project –Data can be imported to the archive file from one format and exported to another format Interoperability outside the project –People who wish to share data with a group will define transformations from their data formats –A bottom-up approach to developing standards Improved data modeling –Encourages members of the project to revise their data formats –Gives us help in developing high-level models for linguistic data
DRH Cheltenham 2/9/03 Future work Processing aligned text formats Using schemas rather than DTDs: data validation Improved version control, especially checking for duplicate or conflicting records
DRH Cheltenham 2/9/03 Some details This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora Funding: –Hans Rausing Endangered Languages Project –Australian Research Council –Faculty of Arts, Monash University Contacts: –