Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese.

Slides:



Advertisements
Similar presentations
An Introduction of Chinese Language Clary Xue
Advertisements

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Summer Workshop 2000 Presented at the ANLP-NAACL.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Chinese Character Recognition for Video Presented by: Vincent Cheung Date: 25 October 1999.
Advance Information Retrieval Topics Hassan Bashiri.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
EDC 424 Spring 2014 JMaggiacomo Development of Orthographic Knowledge.
1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Prof. Li Universidad Del Este. Review of Greetings.
Evidence from Content INST 734 Module 2 Doug Oard.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Automatic Spoken Document Processing for Retrieval and Browsing Zahra Ahmadi.
Nasal endings of Taiwan Mandarin: Production, perception, and linguistic change Student : Shu-Ping Huang ID No. : NA3C0004 Professor : Dr. Chung Chienjer.
Automated Scoring of Picture- based Story Narration Swapna Somasundaran Chong Min Lee Martin Chodorow Xinhao Wang.
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen-
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Mandarin Beginner What we will do…  Introduce Chinese as a language and dialects  Introduce Pinyin (function, composing rules)
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
IPSOM Indexing, Integration and Sound Retrieval in Multimedia Documents.
Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Multi-Source Information Extraction Valentin Tablan University of Sheffield.
Chinese Language 华 文 huá wén
ALE161 國際行銷英文簡報技巧 International Marketing Presentation Techniques
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Helen Meng,' Sanjeev Khudanpur,2 Gina Levow,3 Douglas W
Rapidly Retargetable Translingual Detection
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Grapheme to Phoneme correspondence in English.
Translingual Knowledge Projection and Statistical Machine Translation
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Presentation transcript:

Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese University of Hong Kong Sanjeev Khudanpur Johns Hopkins University Douglas W. Oard University of Maryland Hsin-Min Wang Academia Sinica, Taiwan

Outline Background The MEI Project –Multiscale Retrieval –Multiscale Translation Using the TDT-3 collection Schedule

Motivation Emerging speech retrieval applications –E.g., Increasing need for translingual audio search –1896 Internet accessible radio & TV stations –529 of these (28%) are not in English source:

The Big Picture Speech to Speech Translation Translingual Audio Browsing Translingual Audio Search English Query English Audio SelectExamine MEI

Related Work TREC Spoken Document Retrieval –Close coupling of recognition and retrieval TREC Cross-Language Retrieval –Close coupling of translation and retrieval TDT-3 –Coupling recognition, translation and retrieval –Using baseline recognizer transcripts

The MEI Project Closely coupling recognition and translation –For the purpose of retrieval English text queries, Mandarin news audio Specific research issues: –Multi-scale retrieval –Multi-scale translation

Multi-scale Analysis of Mandarin Initial/Final Preme/Core Final Preme/Toneme /iang/ /ji/ /j/ /ang/ /i//a/ /ng/ /j/

Multi-scale Retrieval Subword-scale –Syllable lattice matching [Chen, Wang & Lee, 2000] –Overlapping syllable n-grams [Meng et al., 1999] –Skipped syllable pairs [Chen, Wang & Lee, 2000] –Syllable confusion matrix [Meng et al., 1999] Word-scale –Structured queries [Pirkola, 1998] Multi-scale –Unified retrieval using a merged feature set –Scale-optimized retrieval with result-set merging

Why Multi-scale Retrieval? Word-based retrieval exploits lexical knowledge –Enhances precision Subword units achieve complete phonological coverage –Enhances recall Combination of evidence may beat either alone

Multi-scale Translation Word-scale –Dictionary-based [Levow & Oard, 2000] –Parallel corpora [Nie, 1999] –Comparable corpora [Fung, 1998] Subword-scale –Cross-language phonetic map [Knight & Graehl, 1997] /bei2 ai4 er3 lan2/ Kosovo (/ke1-sou3-wo4/, /ke1-sou3-fo2/, /ke1-sou3-fu1/, /ke1-sou3-fu2/)

Using the TDT-3 Collection English queries formed from topic descriptions –2-4 words (simulated Web search) –Full topic description (simulated routing profile) Mandarin broadcast news audio (121 hours) –Story-boundary-known condition (4624 stories) –Baseline recognizer transcripts provide words

Schedule DecFebJunAprAug Six Weeks: Summer Workshop Planning Meeting First MEI Team Planning Meeting Second MEI Team Planning Meeting

Things We Need Ideas –To sharpen our focus Connections –To build a community of interest Resources –To build on what others have done

Background: Chinese Many dialects (e.g., Mandarin and Cantonese) –differences in phonetics, vocabularies, syntax… Syllable-based language –~400 base syllables, 4 lexical tones + light tone Syllable structure (CG)V(X) –(CG): onset, optional, consonant+medial glide –V:nuclear vowel –X:coda, glide / alveolar nasal / velar nasal –~ 21 initials, 39 finals

Background: Chinese (cont) Characters (written) -> syllables (spoken) Degenerate mapping – /hang2/, /hang4/, /heng2/ or /xing2/ –/fu4 shu4/ (LDC’s CALLHOME lexicon) Tokenization / Segmentation –/zhe4 yi1 wan3 hui4 ru2 chang2 ju3 xing2/