22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Language Processing Technology Machines and other artefacts that use language.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Information Retrieval in Practice
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Collaborative Cross-Language Search Douglas W. Oard University of Maryland, College Park May 14, 2015SICS Workshop.
Overview of Search Engines
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
8/19/20151 بسم الله الرحمن الرحيم ICS 482 Natural Language Processing Lecture 24: Project Ideas + Students Presentations Husni Al-Muhtaseb.
Junior ENGLISH. MS DANIELA Blog:
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA
Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata.
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
COMP 523 DIANE POZEFSKY 20 August AGENDA Introductions Logistics Software Engineering Overview Selecting a project Working with a client.
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Text Based Information Retrieval Text Based Information Retrieval H02C8A H02C8B Marie-Francine Moens Karl Gyllstrom Katholieke Universiteit Leuven.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Chapter 23: Probabilistic Language Models April 13, 2004.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
What’s happening in iCLEF? (the iCLEF Flickr Challenge) Julio Gonzalo (UNED), Paul Clough (U. Sheffield), Jussi Karlgren (SICS), Javier Artiles (UNED),
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Multilingual Search Shibamouli Lahiri
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Proposal for Term Project
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Rapidly Retargetable Translingual Detection
Multilingual Information Access in a Digital Library
CSE 635 Multimedia Information Retrieval
Cross Language Information Retrieval (CLIR)
Presentation transcript:

22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland

Outline Thinking out of the box Some results Lesson Learned

Surprise Language Framework Zero-resource start (treasure hunt) Time constrained (10 or 29 days) English Users / Documents in language X Character-coded text Research-oriented Intensely collaborative (team-based)

Schedule Cebuano Announce: Mar 5 Test Data: Stop Work: Mar 14 Newsletter:April Talks:May 30 (HLT) Papers: Hindi Jun 1 Jun 27 Jun 30 August Aug 5 (TIDES PI) Aug 15 (TALIP)

16 Participating Teams Cebuano and Hindi ISI Maryland NYU Johns Hopkins Sheffield LDC CMU UC Berkeley MITRE Hindi Only U Mass Alias-i BBN IBM CUNY KAT SPAWAR

Five evaluated tasks –Automatic CLIR (English queries) –Topic tracking (English examples, event-based) –Machine translation into English –English “Headline” generation –Entity tagging (five MUC types) Several useful components –POS tags, morphology, time expressions, parsing Several demonstration systems –Interactive CLIR (two systems) –Cross-language QA (English Q, Translated A) –Machine translation (+ Translation elicitation) –Cross-document entity tracking

Hindi Participants Alias-I UC Berkeley BBN CMU CUNY Johns Hopkins IBM ISI LDC MITRE NYU SPAWAR U. Sheffield U. Massachusetts U. Maryland Resource Generation Detection Extraction Summarization Translation

Detection Extraction Summarization Books Web Books Web People Lexicons Corpora Time Resource Harvesting Systems Research Results Capture Process Knowledge Innovation Cycle Coordination Strategy Push Organize Talk

The Synchronization Challenge

Cebuano MT Results Bible Cebuano book Dict Melamed News

Cebuano Interactive CLIR Starting Point: iCLEF 2002 system (German) –Interface: “synonyms”/examples (parallel)/MT –Back end: InQuery/Pirkola’s method 3-day porting effort –Cebuano indexing (no stemming) –One-best gloss translation (bilingual term list) Informal Evaluation –2 Cebuano native speakers (at ISI)

Hindi syntax is generally very “regular” Subject – Object – Verb is the preferred order –John saw Mary.= जॉन ने मेरी को देखा । Presence of (occasionally deleted) case markers often permit reordering –John saw Mary.= मेरी को जॉन ने देखा । English (or western) punctuation is pervasive in many modern texts –John said, “ I am here ” = जॉन ने कहा, “ मैं यहाँ हूँ ” The subject may be omitted in some contexts –A: Where is John?B: [He] went home. – अ : जॉन कहाँ है ? ब : [ वह ] घर चला गया।

Hindi Encoding Text encoding for storage and transmission and text rendering for display and printing are separated Which syllable constituents get their own code-points? –Several 8-bit encodings: After assigning a code point to each stand-alone vowel and full consonant, and to half-consonants and vowels within a syllable, spare code-points get used for assorted/frequent CC clusters. –Unicode UTF-16: Only stand-alone vowels, full consonants and vowels within syllables have their own code-points. All half consonants are realized by a `full consonant + halant’ sequence Choice of the “grammar” for syllable construction and rendering? –Several 8-bit encodings write the code-points in display order, simplifying the rendering program –Unicode writes it in pronunciation order, making for a considerably more complex display program

Hindi Week 1: Porting Monday –2,973 BBC documents (UTF-8) –Batch CLIR (no stem, 2/3 known items rank 1) Tuesday –MIRACLE (“ITRANS”, gloss) –Stemmer (implemented from a paper) Wednesday –BBC CLIR collection (19 topic, known item) Friday: –Parallel text (Bible: 900k words, Web: 4k words) –Devanagari OCR system

Hindi Weeks 2/3/4: Exploration N-grams (trigrams best for UTF-8) Relative Average Term Frequency (Kwok) Scanned bilingual dictionary (Oxford) More topics for test collection (29) Weighted structured queries (IBM lexicon) Alternative stemmers (U Mass, Berkeley) Blind relevance feedback Transliteration Noun phrase translation MIRACLE integration (ISI MT, BBN headlines)

Formative Evaluation

Transliteration Importance: Names, loan words – दक्षिण कोरिया (Dakshin Korea) Pronunciation crosswalk English->Hindi –English pronunciation (Festival) –Overgenerate Hindi characters (hand-built rules) Doctor => d aa k t ax r OR d ao k t ax r –Rank n-best using bigrams (Hindi name list) Treat as alternate translations for CLIR –Pirkola’s method

Some Challenges Formative evaluation Synchronize variable-rate efforts –Soccer, not football Integration Capturing lessons learned –See the forest, not just the trees

For More Information TIDES Newsletter –Cebuano: April –Hindi: August Papers –NAACL/HLT Short paper –MT Summit (late Sep) –ACM TALIP Special Issue Demonstration systems –Contact individual sites