Download presentation
Presentation is loading. Please wait.
Published byDavid Arnold Modified over 9 years ago
1
What’s needed for lexical databases? Experiences with Kirrkirr Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford University http://www.sultry.arts.usyd.edu.au/kirrkirrr/
2
Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access
3
Background: Kirrkirr A dictionary browser/visualization tool In use with a dictionary of Warlpiri, an Indige- nous Australian language (large for such a dictionary - 10 Mb – with exx, crossrefs, etc.) Dictionary is maintained by linguists as text files, with text editor, in an ad hoc format We convert it automatically into validated XML (stack-based error-correcting Perl parser) Kirrkirr software is written in Java (JDK1.1, any platform) and uses XML text file “database”
4
Warlpiri Warumungu Alawa
5
Kirrkirr: Objectives Exploit the power of a computer interface in mediating between users and dictionary data Present a dictionary in a way which is flexible, interactive, customizable, and fun Do visualization: networks of words, domains, activities, dictionary reversal (W-E E-W) Suitable for diverse users, with widely varying literacy levels: inter alia linguists, elementary school children, teachers, and native speakers Aid linguistic science: for subtle linguistic judgments, one needs speaker involvement
6
Usability We’ve been doing paper and electronic dictionary usability testing (Corris, Manning, Poetsch, and Simpson 1999, 2001) 10/6/00: Steve Patrick Jampijinpa, Jessie Patrick Nangala and Samara Napangardi Steve started to look at it with the children, … taking them through the exercises in the dictionary worksheet, and getting them to do the typing and mousing. JP was keen to look up words, Samara, being younger, was more interested in flashing things and banging keys, but was also keen to be involved. They were keen to look up words which had pictures…. They were disappointed not to find puluku in the dictionary – Samara tried to look it up under cow as well. JP was a slow careful speller, and so could type in words she wanted to know without having them written in front of her. We used the rhyme sort to find rhymes. While rhyme is not a feature of Warlpiri songs, it is useful for teaching phonics. Steve asked whether the dictionary would be at the school, and was pleased to hear that when Carmel got some more RAM it would be.
7
Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access
8
The many aspects of databases Three levels: a logical level specifying query semantics between physical data level and external views of/interfaces to the data Data model; data integrity and consistency Query language Concurrency control, transaction management, and data recovery We’re not doing this – like most XML work? (Abiteboul et al. 2000) – but some people need this Storage and query optimization; indices
9
Choices for dictionary representation A relational database (Nathan and Austin 1992, …) The flexible, hierarchical, ordered text structure of dictionaries means that this is painful to do; retrieving dictionary entries may involve innumerable joins A text file (“the document culture”) Common in practice. No data integrity, etc. But portable and tangible. Authors like it. As semi-structured data Matches variable, non-rigid, and extensible hierarchical structure found in dictionaries
10
But semi-structured data is a continuum… From highly structured data that could easily be represented in a relational or OO database (but isn’t for interchange or trendiness reasons) To very unstructured text data, with occasional limited markup of basic structure Linguistic databases tend to be at the unstructured end of the continuum But (unfortunately for linguists) most work on semi-structured databases has focused on the quite structured end … with only very limited work aimed at text databases
11
Crucial observation for dictionary databases In fairly unstructured databases, the contents of fields are also likely to be quite free-form Desired querying is likely to involve flexible content-based queries Current XML query language proposals don’t adequately support this style of usage Even standard techniques for text, like word-based inverted file indices, often contain restrictions, such as allowing wildcards only at the end of words, which greatly limit their usefulness in text applications (e.g., PAT (Salminen and Tompa 1994) can’t search for ‘-isms’)
12
Ramifications for indexing Pre-indexing is often not particularly useful or effective over text databases Regular expressions are often more suitable Linguists often want to ask pattern questions (words with a high vowel after a velar) We can do “fuzzy spelling” spelling correc- tion without Soundex-style precomputation In Kirrkirr, we’re working on doing online morphological analysis, which is again usefully viewed as a finite-state transduction
13
Indexing Indexing is not particularly needed: you can grep 10 Mb in 2–3 seconds on standard PC (users are happy to wait) XML indexing research has concentrated on the structured end of the problem: Regular expressions over path structures are not of much use for textbases We mainly need queries over textual content within XML entities There are not complex join conditions but simple use of intersection or alternation Realistic search needs do not add excessive combina- toric complexity: A linear search of the text is sufficient
14
Data models/schemas Data consistency and correctness are vitally important Even if authors like text editors, it’s a license to make errors and inconsistencies Every kind of validation available has been useful (DTD, id/idref-style constraints) One dictionary data model doesn’t fit all E.g., Warlpiri dictionary has unusual organization via paradigm examples I feel that exploring mediators will be more profitable than complex standards
15
Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access
16
Data structures and data access in Kirrkirr Data maintained by lexicographers in text files Backslash codes, but with end tags, nesting Converted to XML via Perl parser Result is guaranteed to be valid XML (though heuristic parser can make semantic errors) This has involved a lot of work and revealed many inconsistencies in the data. Painful! Automatic data consistency and integrity maintenance is really useful, I’d argue! But text gives freedom, ease-of-use, tangibility (UI issues win: cf. Excel vs. Access)
17
Indices/tables Kirrkirr builds and stores on disk two custom indices/tables over the XML One indexes Warlpiri headwords to XML file positions, and holds a few extra bits of info (about pictures, subentry status, etc. (so the scroll list can be displayed quickly) The other indexes English glosses to Warlpiri words Maintained in memory at runtime (not that large, allows easy regexp-based fuzzy spelling matching)
18
Kirrkirr Dictionary Browser......... word position bits XML Warlpiri dictionary file Indices in memory XML Parser XML Document Object Model Our “logical level” is Java code with hardwired methods for each query – though we have also experimented with XQL (for parts of it) Kirrkirr data access English Warlpiri Dic- tio- nary interface grep (Jakarta-ORO)
19
Data access Scroll list display, simple lookups and searches over headwords and glosses done purely from in-memory indices Getting cross-references for network display, semantic domains, pictures, HTML, etc. is done by using index to jump into XML file, and then parsing it (with SAX until end of entry) Complex searches are done as entity-sensitive regexp search over either the whole dictionary file, or the entries that the search is restricted to (found via the headword index)
20
Customizing Format with XSLT XSLT stylesheets format dictionary entries in ways suited to the needs of different users E.g., simple formats for low literacy users The resulting HTML pages show typed cross- references in the dictionary as colored hyperlinks between different words Since the XML is parsed at run-time, we can add extra information by “parameter passing” from the program to the XSLT E.g. file locations for pictures, search titles
21
English-Warlpiri Dictionary Source dictionary is only Warlpiri-English, but a bidirectional dictionary is needed by users An English index was built from glosses so that glosses link to equivalent Warlpiri entries Basis for English wordlist and fast search Multiword glosses are indexed everywhere except for stopwords, giving easy lookup One underlying dictionary: data consistency The XML entries of all Warlpiri equivalents to an English word are merged, and passed to an XSLT stylesheet which merged HTML
22
Warlpiri Morphological Parsing Warlpiri is an agglutinating language: nyangulparnangku nya -ngu -lpa =rna =ngku see -PAST -IPFV =1SG.SUBj =2SG.OBJ ‘I was looking at you.’ For lookup/linking, users or the program have to know the root/citation form This is difficult for people with limited literacy We have been developing a morphological analyzer so we can look up any form, and link words in examples, etc. (Finite state methods)
23
Conclusions The data structuring and data integrity of a semi-structured database are great for dictionaries A query language, which supported textual content-based queries well, would be great too At present, though, we do not have many good options, and Kirrkirr get by with limited ad hoc indices and text searches, done via a dictionary abstraction layer in the code This hasn’t troubled us too much; UI issues have normally been much bigger challenges
24
Acknowledgements Ken Hale, Mary Laughren, Robert Hoogenraad Jane Simpson, David Nash Nic Gambold, Kay Ross Kevin Jansz, Nitin Indurkhya, Kevin Lim Miriam Corris, Susan Poetsch and many others….
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.