What’s needed for lexical databases? Experiences with Kirrkirr Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Easily retrieve data from the Baan database
Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz Department of Linguistics, University.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
XML: Extensible Markup Language
Programming Paradigms and languages
With Microsoft Access 2010© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access.
Kirrkirr: Software for browsing and visual exploration of a structured Warlpiri dictionary Kevin Jansz Department of Linguistics,
Introduction to Databases
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
LYU0101 Wireless Digital Library on PDA Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu First semester FYP Presentation 2001~2002.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Information Retrieval in Practice
Kirrkirr A Dictionary Visualization Tool Conrad Wai Andrei Pop.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages Kevin Jansz Department of Computer Science, University of Sydney,
Prototyping. CS351 - Software Engineering (AY2004)2 Scenario Customer: “We would like the word processor to check the spelling of what is typed in. We.
Kirrkirr: a Bidirectional Warlpiri- English Dictionary Kristen Parton.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Developing a Basic Web Page with HTML
Tutorial 11: Connecting to External Data
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages Kevin Jansz Department of Computer Science, University of Sydney,
XIS™ XML Intranet System. XIS, the XML Intranet System provides the foundation for your database production and management. XIS maximizes the flexible.
XP New Perspectives on Microsoft Access 2002 Tutorial 71 Microsoft Access 2002 Tutorial 7 – Integrating Access With the Web and With Other Programs.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Computer Science : Information Systems Design and Development Unit Web Sites - National 4 / 5 St Andrew’s High School-Revised January 2013 Slide 1 St Andrew’s.
XP New Perspectives on Microsoft Access 2002 Tutorial 41 Microsoft Access 2002 Tutorial 4 – Creating Forms and Reports.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
XML and XSL Institutional Web Management 2001: Organising Chaos.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
XML Extensible Markup Language. Markup Languages u What does this number (100) mean? –Actually, it’s just a string of characters! –A markup language can.
CIS 451: Introduction to XML Dr. Ralph D. Westfall October, 2011.
Intro. to XML & XML DB Bun Yue Professor, CS/CIS UHCL.
Kirrkirr: Transforming the representation of lexical information Experiments with endangered language dictionaries Christopher Manning Computer Science.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Kirrkirr: A flexible and approachable software interface to indigenous dictionaries Christopher Manning & Kristen Parton Computer Science and Linguistics,
Kirrkirr: Software for the Flexible and Interactive Visualization of a Structured Warlpiri Dictionary Christopher Manning Computer Science and Linguistics,
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Lab 6 Creating and Using Lists and.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
+ Information Systems and Databases 2.2 Organisation.
XML Steve Fisher/RAL. 20 October 2000XML - Steve Fisher/RAL2 Warning Information may not be all completely up to date.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
XML and Database.
OWL Representing Information Using the Web Ontology Language.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Introduction to KE EMu
1 Information Retrieval LECTURE 1 : Introduction.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
HTML Basics. HTML Coding HTML Hypertext markup language The code used to create web pages.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Martin Kruliš by Martin Kruliš (v1.1)1.
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages Kevin Jansz Department of Computer Science, University of Sydney,
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
Databases.
What’s needed for lexical databases? Experiences with Kirrkirr
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages Kevin Jansz Department of Computer Science, University of Sydney,
Tutorial 7 – Integrating Access With the Web and With Other Programs
The ultimate in data organization
Presentation transcript:

What’s needed for lexical databases? Experiences with Kirrkirr Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford University

Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

Background: Kirrkirr A dictionary browser/visualization tool In use with a dictionary of Warlpiri, an Indige- nous Australian language (large for such a dictionary - 10 Mb – with exx, crossrefs, etc.) Dictionary is maintained by linguists as text files, with text editor, in an ad hoc format We convert it automatically into validated XML (stack-based error-correcting Perl parser) Kirrkirr software is written in Java (JDK1.1, any platform) and uses XML text file “database”

Warlpiri Warumungu Alawa

Kirrkirr: Objectives Exploit the power of a computer interface in mediating between users and dictionary data Present a dictionary in a way which is flexible, interactive, customizable, and fun Do visualization: networks of words, domains, activities, dictionary reversal (W-E  E-W) Suitable for diverse users, with widely varying literacy levels: inter alia linguists, elementary school children, teachers, and native speakers Aid linguistic science: for subtle linguistic judgments, one needs speaker involvement

Usability We’ve been doing paper and electronic dictionary usability testing (Corris, Manning, Poetsch, and Simpson 1999, 2001) 10/6/00: Steve Patrick Jampijinpa, Jessie Patrick Nangala and Samara Napangardi Steve started to look at it with the children, … taking them through the exercises in the dictionary worksheet, and getting them to do the typing and mousing. JP was keen to look up words, Samara, being younger, was more interested in flashing things and banging keys, but was also keen to be involved. They were keen to look up words which had pictures…. They were disappointed not to find puluku in the dictionary – Samara tried to look it up under cow as well. JP was a slow careful speller, and so could type in words she wanted to know without having them written in front of her. We used the rhyme sort to find rhymes. While rhyme is not a feature of Warlpiri songs, it is useful for teaching phonics. Steve asked whether the dictionary would be at the school, and was pleased to hear that when Carmel got some more RAM it would be.

Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

The many aspects of databases Three levels: a logical level specifying query semantics between physical data level and external views of/interfaces to the data Data model; data integrity and consistency Query language Concurrency control, transaction management, and data recovery   We’re not doing this – like most XML work? (Abiteboul et al. 2000) – but some people need this Storage and query optimization; indices 

Choices for dictionary representation A relational database (Nathan and Austin 1992, …)  The flexible, hierarchical, ordered text structure of dictionaries means that this is painful to do; retrieving dictionary entries may involve innumerable joins A text file (“the document culture”)  Common in practice. No data integrity, etc.  But portable and tangible. Authors like it. As semi-structured data  Matches variable, non-rigid, and extensible hierarchical structure found in dictionaries

But semi-structured data is a continuum… From highly structured data that could easily be represented in a relational or OO database (but isn’t for interchange or trendiness reasons) To very unstructured text data, with occasional limited markup of basic structure Linguistic databases tend to be at the unstructured end of the continuum But (unfortunately for linguists) most work on semi-structured databases has focused on the quite structured end … with only very limited work aimed at text databases

Crucial observation for dictionary databases In fairly unstructured databases, the contents of fields are also likely to be quite free-form Desired querying is likely to involve flexible content-based queries Current XML query language proposals don’t adequately support this style of usage  Even standard techniques for text, like word-based inverted file indices, often contain restrictions, such as allowing wildcards only at the end of words, which greatly limit their usefulness in text applications (e.g., PAT (Salminen and Tompa 1994) can’t search for ‘-isms’)

Ramifications for indexing Pre-indexing is often not particularly useful or effective over text databases Regular expressions are often more suitable  Linguists often want to ask pattern questions (words with a high vowel after a velar)  We can do “fuzzy spelling” spelling correc- tion without Soundex-style precomputation  In Kirrkirr, we’re working on doing online morphological analysis, which is again usefully viewed as a finite-state transduction

Indexing Indexing is not particularly needed: you can grep 10 Mb in 2–3 seconds on standard PC (users are happy to wait) XML indexing research has concentrated on the structured end of the problem:  Regular expressions over path structures are not of much use for textbases  We mainly need queries over textual content within XML entities There are not complex join conditions but simple use of intersection or alternation Realistic search needs do not add excessive combina- toric complexity: A linear search of the text is sufficient

Data models/schemas Data consistency and correctness are vitally important  Even if authors like text editors, it’s a license to make errors and inconsistencies  Every kind of validation available has been useful (DTD, id/idref-style constraints) One dictionary data model doesn’t fit all  E.g., Warlpiri dictionary has unusual organization via paradigm examples  I feel that exploring mediators will be more profitable than complex standards

Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

Data structures and data access in Kirrkirr Data maintained by lexicographers in text files Backslash codes, but with end tags, nesting Converted to XML via Perl parser  Result is guaranteed to be valid XML (though heuristic parser can make semantic errors)  This has involved a lot of work and revealed many inconsistencies in the data. Painful!  Automatic data consistency and integrity maintenance is really useful, I’d argue!  But text gives freedom, ease-of-use, tangibility (UI issues win: cf. Excel vs. Access)

Indices/tables Kirrkirr builds and stores on disk two custom indices/tables over the XML  One indexes Warlpiri headwords to XML file positions, and holds a few extra bits of info (about pictures, subentry status, etc. (so the scroll list can be displayed quickly)  The other indexes English glosses to Warlpiri words Maintained in memory at runtime  (not that large, allows easy regexp-based fuzzy spelling matching)

Kirrkirr Dictionary Browser word position bits XML Warlpiri dictionary file Indices in memory XML Parser XML Document Object Model Our “logical level” is Java code with hardwired methods for each query – though we have also experimented with XQL (for parts of it) Kirrkirr data access English Warlpiri Dic- tio- nary interface grep (Jakarta-ORO)

Data access Scroll list display, simple lookups and searches over headwords and glosses done purely from in-memory indices Getting cross-references for network display, semantic domains, pictures, HTML, etc. is done by using index to jump into XML file, and then parsing it (with SAX until end of entry) Complex searches are done as entity-sensitive regexp search over either the whole dictionary file, or the entries that the search is restricted to (found via the headword index)

Customizing Format with XSLT XSLT stylesheets format dictionary entries in ways suited to the needs of different users  E.g., simple formats for low literacy users The resulting HTML pages show typed cross- references in the dictionary as colored hyperlinks between different words Since the XML is parsed at run-time, we can add extra information by “parameter passing” from the program to the XSLT  E.g. file locations for pictures, search titles

English-Warlpiri Dictionary Source dictionary is only Warlpiri-English, but a bidirectional dictionary is needed by users An English index was built from glosses so that glosses link to equivalent Warlpiri entries Basis for English wordlist and fast search  Multiword glosses are indexed everywhere except for stopwords, giving easy lookup One underlying dictionary: data consistency The XML entries of all Warlpiri equivalents to an English word are merged, and passed to an XSLT stylesheet which merged HTML

Warlpiri Morphological Parsing Warlpiri is an agglutinating language: nyangulparnangku nya -ngu -lpa =rna =ngku see -PAST -IPFV =1SG.SUBj =2SG.OBJ ‘I was looking at you.’ For lookup/linking, users or the program have to know the root/citation form This is difficult for people with limited literacy We have been developing a morphological analyzer so we can look up any form, and link words in examples, etc. (Finite state methods)

Conclusions The data structuring and data integrity of a semi-structured database are great for dictionaries A query language, which supported textual content-based queries well, would be great too At present, though, we do not have many good options, and Kirrkirr get by with limited ad hoc indices and text searches, done via a dictionary abstraction layer in the code This hasn’t troubled us too much; UI issues have normally been much bigger challenges

Acknowledgements Ken Hale, Mary Laughren, Robert Hoogenraad Jane Simpson, David Nash Nic Gambold, Kay Ross Kevin Jansz, Nitin Indurkhya, Kevin Lim Miriam Corris, Susan Poetsch and many others….