Download presentation
Presentation is loading. Please wait.
Published byAlaina Horton Modified over 6 years ago
1
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages
Kevin Jansz Department of Computer Science, University of Sydney, Australia Christopher Manning Computer Science and Linguistics, Stanford University, USA Nitin Indurkhya School of applied Science, Nanyang Technological University, Singapore
2
Project Objectives providing innovative ways for representing a dictionary, through creative use of the medium of computers providing practical educationally useful programs as a result (at low labor cost) examining the richness of lexical structure Initial target: the Warlpiri dictionary.
3
Talk Outline The research agendas
Kirrkirr: A Warlpiri dictionary browser The Lexical Database exploiting the strengths of XML indexing XML data User interface and visualization User studies
4
Research Program: Lexicon
A language is more than individua words with a definition it is a vast network of associations between words and within and across the concepts represented by words The aim of this work is to provide people with a better understanding of this conceptual map. Traditional paper dictionaries offer very limited ways for making such networks visible On a computer, one can imagine all sorts of ways of bringing out such relationships
5
Research: Computational Lexicography
Dictionaries on computers are now commonplace But there has been little attempt to utilize the potential of the new medium Goal: fun dictionary tools that are effective for language learning, browsing, and research Special interest: dictionaries for minority languages. Here economic, motivational, and user support reasons all point to an important role for computers.
6
MRD Structure The internal structures of current Machine Readable Dictionaries (MRDs) usually merely mimic the structure of the printed form (Boguraev 1990) Some work, notably WordNet (Miller 1995) has involved a fundamental rethinking of dictionary content and organization (in WordNet, organization via “synsets” which are related via links of part, subkind, opposite) But there has been little in the way of software to make such research truly usable by different communities of users.
7
Initial focus Kirrkirr: a Warlpiri browser
Warlpiri is an Australian Aboriginal language spoken in the Tanami desert (NW of Alice) Rich lexical materials have been collected by linguists over decades (Ken Hale, MIT, from 1950’s) resulting in one of the most comprehensive lexical databases for any Australian Language There is a relatively large community of people interested in learning their traditional language Until now, results haven’t been produced in a format usable by the community (only raw printouts) Kirrkirr aims to build a computer interface for browsing the Warlpiri dictionary.
8
Educational goals Dictionary structure and usability are often dictated by professional linguists, while the needs of others (speakers, semi-speakers, young users, second language learners) are not met Aim is to avoid this A low level of literacy makes an e-dictionary potentially more useful than a paper edition as it is less dependent on good knowledge of spelling and alphabetical order. Making it fun and easy to use, and providing multimedia content and the pronunciations of words is a considerable help as well.
9
Target user community
11
Kirrkirr: A Warlpiri dictionary browser
(Jansz 1998; Jansz, Manning and Indurkhya 1999) An environment for the interactive exploration of dictionaries. Although our current work has just been with Warlpiri, the design is general (Arrernte coming soon!) Attempts to more fully utilize graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information Written in Java, it can either be run over the web [high bandwidth] or run locally (here Java’s main advantage is cross-platform support).
12
Specific goals An interactive environment that encouraged exploration: easy and fun to use Reduction of the dependence on alphabetical order Catering to the needs of different user groups (kids, teachers, professionals) Flexible enough to display appropriate information in appropriate ways depending on user level
13
Overview Kirrkirr provides various modules
Graph layout of word relationships Formatted dictionary entries Semantic domain browsing A notes facility for ‘jotting in the margin’ Multimedia: audio, pictures Advanced searching interfaces others in planning: formatting (XSL) editing, figuration patterns These attempt to cater to users with different interests and competence levels
14
(Kirrkirr screen shot)
16
The lexical database Original materials are stored in an ad hoc format of markup using backslash codes with some (rather odd) nesting of structural tags These were converted to XML using an error-correcting stack-based parser (written in PERL). The inconsistency and flexibility of dictionary entries actually made this a surprisingly difficult task. But parser tries to impose data integrity Use of XML gives a clear structure to the data, and makes available many (free) tools
17
XML XML separates the structure of the data from its presentation
Much of the recent enthusiasm for XML has centered around representing simple and rigid structures such as database records The rich hierarchical and variable structure of dictionary entries is really more what something like XML excels at! Result remains a portable, tangible text file
18
Alternative: a standard database
The obvious thing for storing a lot of data Has clear advantages: structure, indexing, query language, relationships, integrity. Many people have suggested using a database for lexical data and some have actually done it (IITLEX, Austin and Nathan) But in general lexicographers oppose the rigidity, and, in practice, standard relational databases are quite ill-suited to dictionaries
19
Problems with using a Relational Database
Dictionary entries vary enormously in structure Data is fragmented Dictionaries are only loosely structured Same element can appear at many levels (dialect, cross-reference, …) Database model is inflexible to extending the dictionary structure Lessens portability
20
Alternative: Object Databases
Dictionary can be viewed as a set of entries (objects) Object-oriented databases for storage Problem: retrieval via customized query languages Problem: off-the-shelf products not widely accepted Proprietary storage formats reduce portability ObjectStore, Versant, Objectivity the main big vendors Restricted API places limits on extensibility Generic object browsers not suitable for dictionaries
21
XML database Document Object Model widely accepted
XML document can be searched and accessed XQL: a recent (and evolving) W3C proposal for querying XML documents
22
XQL - Potential An alternative to investigate for the future is using a standard query language – such as XQL – to get material out of the XML dictionary, rather than using our ad hoc index. At the moment not a huge issue since most retrieval is focussed on components of a particular word XQL standard not stable yet Very preliminary implementations from vendors
23
Extracting information from an XML document
Build an index of its contents Index contains details of what is where (in an XML document) Facilitates quick access to contents Two steps for extracting information: lookup index, then lookup XML document A good index can considerably speed up the 2nd lookup.
24
XML indexing - challenges
Despite the various XML parsers available, it is surprising that there has been little consideration in making single entries retrievable from the file Present XML Parsers tend to put the entire XML document in memory (or its parsed tree form), before the data extraction process begins This is not practical when parsing significant XML databases (e.g., the Warlpiri dictionary is approx. 10Mb).
25
XML Indexing - solutions
The hierarchical structure of XML lends itself to indexing, as each separate entry in the XML file can be considered as a separate entity To make the Warlpiri dictionary usable for Kirrkirr an ad hoc indexing system was developed Uses a slightly modified Ælfred XML parser Entries are indexed by headword in a separate index file The system returns an XML document object containing the single dictionary entry, facilitating processing for related words (Graph layout) XSL processing to HTML
26
XML Indexing - solutions (2)
The use of the XML indexing process considerably improves efficiency as only requested entries are parsed, hence conserving time and bandwidth Once whole entries are parsed, they are kept temporarily in a cache Thus the System uses XML as a median between the structure and indexing of a relational database, with the freedom and functionality of XML.
27
Kirrkirr’s XML Index Process
Index in Memory Kirrkirr 5 XML document object
29
Visualization of dictionary information
For dictionaries with simple textual content behind them, there is little that can be done but an on-line reflection of a printed page But we want more than just definitions of words: we want to know their relationships to other words, and the patterning in these relationships In a computational approach, the program can mediate between the lexical data and the user The interface can select from and choose how to present information (according to the user’s preferences) – in many different ways
30
Previous work Current systems present the search-dominated interface of classic Information Retrieval systems: you type a word in a search box Results try to mimic, but are generally inferior to, the printed version of the dictionary Good feature: rapid searching But these systems do little to utilize the captivating qualities of computers: interactivity, user control and adaptability (Brown 1985).
31
Previous work (2) Current systems are only effective when user has a clearly specified information need – even here, we are ignoring the distinction between information gained and knowledge sought (Sharpe 1995) Lack browsing, and chances for incidental or curiosity driven learning Lack tangibility and situatedness of paper: ineffective for getting an idea of a collection We wish to exploit the essence of hypertext, which is “click to explore” browsing
32
Previous work (3) Little research work (in corpus linguistics, visualization etc.) on dictionary visualization WordNet built a rich network of relationships, which fundamentally departed from the paper dictionary tradition, and has been used in many computational projects However very little has been done in the way of interfaces that make these relationships visible and intelligible to users. Graphical representations seem particularly important given our target users.
33
Graph-based visualization
There is a little previous work on graphical representations of dictionaries For instance, the visual-thesaurus by plumbdesign derived from WordNet But it is also a good demonstration of how chaotic and confusing graphical interfaces can become.
34
Perils of visualization
35
Graph-based visualization
(Jansz 1998; Jansz, Manning and Indurkhya 1999) Classic graph layout problem Adapts work by Eades et al. (1998) and Huang et al. (1998) on visualization and navigation of WWW document linkages Uses the spring algorithm. Big advantage is that it is an iterative updating algorithm, and so gives an easy interactivity: it wiggles and people can play with it. Clarity and simplicity of graph: Software maintains a set of focus nodes to prevent overcrowding
36
Educational advantages
Alphabetical order is important, but A web of words offers other effective opportunities for learning A student can opportunistically explore words that are related in various ways Important semantic relationships can be understood
37
Kirrkirr network display
38
Kirrkirr network display
39
Formatted dictionary entries
Are produced automatically from the XML by using XSL (via James Clark’s XT) XSL allows easy modeling of some user preferences. Most trivially, one can leave out information such as part of speech, or detailed definitions, which we do by providing several stylesheets to choose from This is useful as many users find information overload quite confusing and demotivating Can produce bilingual or monolingual dictionary Opportunities for various output styles, and formats such as RTF or TeX for printing.
40
Formatted dictionary entries
41
Rich typology of link types
The semantically rich types of linkages present in a dictionary (synonym, antonym, hyponym, subheadword, variant, coverbs, …) solves one of the major problems of the web: we have many link types with a clear semantic interpretation Use consistent color-coded text and edges to show these link types Gives a richer browsing experience Unlike HTML, you can tell where you are going before clicking
42
Browsing Work (at PARC and elsewhere: Pirolli et al. 1996) has stressed role for browsing as well as searching in information access It provides a context for learning We provide browsing in several ways: conventional hypertext but with rich semantically-interpreted links their color-coding matches network edges network-based display of words Other methods being investigated: browsing through semantic domains deriving terminology sets (words that are used together in culturally important activities) automatically from text corpora
43
Other components Multimedia (currently pictures and audio)
Can hear pronunciations / see objects I’m keen to put in videos of Warlpiri sign language … Advanced search page search various fields, regular expressions, etc. Notes: one can annotate dictionary entries (to correct or personalize)
45
User study Mim Corris (Yuendumu, Willowra) Jane Simpson (Lajamanu)
User testing with primary and (lower) secondary students Observation of trainee Warlpiri literacy workers Comments from teachers, other adults etc. Purely qualitative observational study of dictionary use. (Doing anything much else would be difficult.) Initial reactions are very enthusiastic Could use as a basis for classroom activities (better with some further development: games and puzzles)
46
A positive anecdote “One of the introductory Warlpiri literacy students, who had not been very interested in the literacy class, spent nearly 3/4 hour looking at Kirrkirr apparently in absorbed concentration. She wasn’t especially interested in the sound and picture possibilities. She moved between words, scrolling along the list, typing in the search, clicking on the words in the network pane. She wasn’t even put off when the dictionary definitions stopped appearing – looking at the networks of words instead. This is quite unlike her attitude to the backslash coded electronic dictionary (where she lost interest quickly because of the difficulty for her of narrowing down searches). After the Kirrkirr demo she asked if she could have a printed dictionary to take away with her to use in camp to learn the words. I interpret this as a desire to learn words in her own time and place.”
47
Conclusions Kirrkirr is just a prototype of what one can do to develop new ways to visualize lexicons We have addressed the challenge of making dictionary information usable in the creation of an application which mediates between well-structured data and users’ needs for searching/browsing and presentation While we have focused our research on Warlpiri, the system can be easily applied to other languages
48
Conclusions (cont.) “... The best future applications of MRDs in education will be those most able to respond to the insights and needs of their users” (Kegl 1995) Kirrkirr can be seen as a step towards the future of edictionaries
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.