Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.

Slides:

Advertisements

Similar presentations

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Advertisements

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.

Cross-Language Retrieval INST 734 Module 11 Doug Oard.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Summer Workshop 2000 Presented at the ANLP-NAACL.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

Overview of Search Engines

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

Cross-Language Retrieval INST 734 Module 11 Doug Oard.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Evidence from Content INST 734 Module 2 Doug Oard.

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

IBM Haifa Research Lab © 2008 IBM Corporation Retrieving Spoken Information by Combining Multiple Speech Transcription Methods Jonathan Mamou Joint work.

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Summary  The task of extractive speech summarization is to select a set of salient sentences from an original spoken document and concatenate them to.

Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.

IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.

Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,

Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

National Taiwan University, Taiwan

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

TDT 2000 Workshop Lessons Learned These slides represent some of the ideas that were tried for TDT 2000, some conclusions that were reached about techniques.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.

GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Helen Meng,' Sanjeev Khudanpur,2 Gina Levow,3 Douglas W

Rapidly Retargetable Translingual Detection

Presentation transcript:

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer Workshop 2000 Progress Update The MEI Team August 2, 2000

Outline Baseline (Pat, Gina, Wai-Kit) Upper Bounds (Pat, Erika, Helen) Climbing Upwards (Upcoming Research Problems) –translation (Gina, Jian Qiang) –word-subword fusion (Helen, Doug, Wai-Kit) –named entities, numerals (Helen, Sanjeev, Wai-Kit, Karen) –syllable lattice generation (Hsin-Min, Berlin)

The MEI Task An example query (NYT, AP newswire) An example document (VOA) –accompanied by raw anchor scripts A China Airlines A-310 jetliner returning from the Indonesian island of Bali with 197 passengers and crew crashed and burst into flames Monday night just short of Taipei’s Chiang Kai-Shek Airport……. (full story used as query, typically words)

Our Baseline System Query Query Term Selection (1 to full document) Query Term Translation (dictionary-lookup) InQuery Retrieval Engine Translated, hexified Chinese query terms Audio documents Dragon Mandarin Speech Recognizer Tokenized, hexified Chinese word sequence Evaluate retrieval outputs

Our First Retrieval Experiment... Queries –17 exemplars –1 per topic in TDT2 corpus Documents – 2265 in all –~500 belong to at least 1 topic –others are “off-topic” or “briefs” –each topic has >=2 relevant documents

Our First Retrieval Experiment No. of query terms selected = 100 (sweep) No. of alternative translations per term = 1 Word-based retrieval Average Precision = 16.91%

In Search of Upper Bounds... Confounding factors on query side –term selection –translation (no. of terms, definition of a term, named entities, dictionary / COTS system) Confounding factors on the document side –syllable recognition performance, OOV –word tokenization Confounding factors in retrieval –word-based or subword-based (characters, syllables) –subword n-grams (n=??)

Upper Bounds (Word) Queries (ASR); Documents (ASR) –isolates the confounding factors (term selection, translation, recognition performance, word tokenization) –Ave Precision=73.3% Queries (Xinhua); Documents (ASR/TKN) –isolate similar confounding factors –resembles MEI TDT task (queries and documents come from different news sources) –word tokenization (CETA / Dragon) –Best Ave Precision = 53.5%(ASR), 58.7% (TKN)

Chinese Words and Subwords Characters (written) -> syllables (spoken) Degenerate mapping – /hang2/, /hang4/, /heng2/ or /xing2/ –/fu4 shu4/ (LDC’s CALLHOME lexicon) Tokenization / Segmentation –/zhe4 yi1 wan3 hui4 ru2 chang2 ju3 xing2/

Upper Bounds (Subword) Queries (Xinhua); Documents (ASR/TKN) –character-based retrieval –overlapping character n-grams (document, within- term for queries, bigrams fare best) –Best Ave Precision = 54.3%(ASR), 55.9%(TKN) –overlapping bigrams in queries –Best Ave Precision = 61.7% (cross-term overlap) –syllable-based retrieval –word tokenization affects syllable lookup –syllable bigrams fare best –Best Ave Precision = 51.6%(ASR), 53.3% (TKN)

Upper Bound (Translingual) Putting back the translingual element Selected English query terms --> translated Chinese query terms (Oracle -- Jian Qiang Wang) Retrieval performance –word-based (180 terms, no #syn, #sum) 50.6% –subword-based retrieval (character bigrams, #sum 52.1%, #syn 52.3%) –TKN??

Thus Far... Ave Precision ASR / ASR (73%) XH / VOA_ASR (low 50% range) TDT_English / ASR (???) “perfect” translation, “best” index term set Baseline (16.9%) Trying to climb up

Better Translation # translation alternatives per term –Current best (120 query terms, 3 translations per term, word-based retrieval, ASR reseg with CETA, #sum 28.1%) –(90 query terms, 2 translations pre term, word-based retrieval, ASR orig #sum 27.53%) Phrase-based translation –2 types of phrases (named entities, dictionary-based phrases) –term selection (consider both phrases and component words), higher # terms –Current best (250 query terms, all translations, word- based retrieval, 43.3%)

Word-Subword Fusion Words incorporate lexical knowledge Subwords are intended to handle the OOV problem Combination of both may beat either alone Ranked list of retrieved documents –from word-based retrieval –from subword-based retrieval

Merging: Loose Coupling Types of Evidence –Score –Rank Score Combination –Max –Linear combination Rank Combination –Round robin –Source bias –Query bias 1 voa voa voa … 1000 voa voa voa voa … 1000 voa voa voa voa40911 … 1000 voa42201

Tight Coupling: Words and Bigrams jiang qiang zhe ze min ming Lattice: Words: Jiang Zemin Bigrams: jiang_zhe jiang_ze qiang_zhe qiang_ze zhe_min zhe_ming ze_min ze_ming Combination: jiang_zhe zhe_min Jiang Zemin

Word-Subword Fusion (weighted similarity) Merging ranked lists Each retrieved document is scored –i denotes words, subword n-grams

Numerals and Named Entities Verbalize numerals Named Entities –BBN tags (names of locations, people, organization) –Derive Bilingual Term List from TDT2 –English letter-to-phone generation –Cross-lingual phonetic mapping (English phones to Chinese phones) –Syllabification

Cross-Lingual Phonetic Mapping Named entity Jiang Zemin, Kosovo Syllabify Pinyin Spelling E.g. jiang ze min English Pronunciation Lookup or Letter-to-Phone Generation English Phones, e.g. k ao s ax v ow Cross-lingual Phonetic Mapping Chinese Phones, e.g. k e s u o w o Syllabification Chinese syllables, e.g. ke suo wo

Syllable Lattice for Document Representation Address ASR errors and OOV –Augment Dragon ASR output with alternate syllable hypotheses Generate syllable n-grams for audio indexing Include into word-subword fusion DRAGON LVCSR Our ASR

Lots to do still...

Named Entity Tagger Phrase tagging Unknown words and phrases English to Chinese translation dictionary Term Translation Spoken Mandarin document s Dragon Mandarin ASR Query Processing Document Processing Query to INQUERY Document to INQUERY Character n-gram generation Mandarin-English Information: Investigation Translingual Speech Retrieval Mandarin-English Information: Investigation Translingual Speech Retrieval Johns Hopkins University, Center for Language and Speech Processing, JHU/NSF Summer Workshop 2000 MEI Team :Helen MENG (CUHK), Berlin CHEN (National Taiwan University), Erika GRAMS (Advanced Analytic Tools), Sanjeev KHUDANPUR (JHU/CLSP), Gina-Anne LEVOW (University of Maryland), Wai-Kit LO (CUHK) Douglas OARD (University of Maryland), Patrick SCHONE (Department of Defense), Karen TANG (Princeton University), Hsin-Min WANG (Academia Sinica), Jianqiang WANG (University of Maryland) MEI Team :Helen MENG (CUHK), Berlin CHEN (National Taiwan University), Erika GRAMS (Advanced Analytic Tools), Sanjeev KHUDANPUR (JHU/CLSP), Gina-Anne LEVOW (University of Maryland), Wai-Kit LO (CUHK) Douglas OARD (University of Maryland), Patrick SCHONE (Department of Defense), Karen TANG (Princeton University), Hsin-Min WANG (Academia Sinica), Jianqiang WANG (University of Maryland) Word sequence Character n-gram sequence INQUERY Ranked List of Possibly Relevant Documents Translated words and phrases Relevance Assessments Figure of Merit Scoring Query Term Selectio n As of Sunday July 9, 2000 Word sequence Character n-gram sequence Segmented Chinese Text Input English text query