Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Indexing DNA Sequences Using q-Grams
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Making Touchscreen Keyboards Adaptive to Keys, Hand Postures, and Individuals – A Hierarchical Spatial Backoff Model Approach Ying Yin 1,2, Tom Ouyang.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Large-Scale Entity-Based Online Social Network Profile Linkage.
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.
Personal Name Classification in Web queries Dou Shen*, Toby Walker*, Zijian Zheng*, Qiang Yang**, Ying Li* *Microsoft Corporation ** Hong Kong University.
Automatic Name Transliteration via OCR and NLP Yu Cao Tao Wang.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures Written by Alexander Budanitsky Graeme Hirst Retold by.
Online Spelling Correction for Query Completion Huizhong Duan, UIUC Bo-June (Paul) Hsu, Microsoft WWW 2011 March 31, 2011.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Online Chinese Character Handwriting Recognition for Linux
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,
An Efficient Search Strategy for Block Motion Estimation Using Image Features Digital Video Processing 1 Term Project Feng Li Michael Su Xiaofeng Fan.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Web-based acquisition of Japanese katakana variants
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Improving Chinese handwriting Recognition by Fusing speech recognition
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Clustering Algorithms for Noun Phrase Coreference Resolution
The CoNLL-2014 Shared Task on Grammatical Error Correction
Finding Similar Failures Using Callstack Similarity
Introduction to Text Analysis
Research on the Modeling of Chinese Continuous Speech Recognition
University of Illinois System in HOO Text Correction Shared Task
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper Paper follow up ICCLP, SIGIR paper Guihong Cao Guihong Cao MSKK-III – Clustering for technique transfer MSKK-III – Clustering for technique transfer Yang Wen Yang Wen MSKK-III – Distance word dependency MSKK-III – Distance word dependency Min Zhang Min Zhang MSKK/CSC – Entropy based pruning for applications of (Pinyin/Hiragana) input system MSKK/CSC – Entropy based pruning for applications of (Pinyin/Hiragana) input system

Chinese Spelling Checking (or, the Big CSC) Jianfeng Gao NLC Group, MSRCN

Outline Introduction Introduction Chinese spelling checking Chinese spelling checking Our approach Our approach Key techniques and experiments Key techniques and experiments Millstone Millstone

Introduction Chinese spelling errors using MS-Pinyin input system Chinese spelling errors using MS-Pinyin input system Chinese spelling error patterns Chinese spelling error patterns English spelling checking English spelling checking Why CSC is difficult? Why CSC is difficult? Goal: Automatically correct Chinese spelling errors using MS-Pinyin (MSPY) input system

Text in the brain Syllable Key stroke (Typing) Converted text Chinese spelling errors using MSPY Pinyin (phonetic) errors Typographic errors System errors

Chinese spelling errors patterns Substitution errors Substitution errors Pinyin error Pinyin error System error (include Pinyin error in some systems) System error (include Pinyin error in some systems) Non-substitution errors word segmentation errors Non-substitution errors word segmentation errors insertion/deletion/transposition Typographic errors – insertion/deletion/transposition

English spelling checking Non-word error detection (the hte) Non-word error detection (the hte) N-gram (letter) analysis N-gram (letter) analysis Dictionary lookup Dictionary lookup Real-word error detection (from form) Real-word error detection (from form) NLP – parser driven NLP – parser driven Statistical approach – data/error driven Statistical approach – data/error driven Local – n-gram language model, depend on pre-defined confusion set Local – n-gram language model, depend on pre-defined confusion set Global – Winnow, Bayesian, TBL, etc. Global – Winnow, Bayesian, TBL, etc. Problem – lack of error detection Problem – lack of error detection

Why CSC is difficult? Word segmentation Word segmentation Ambiguous Ambiguous OOV – Proper noun detection (personal name, location, organization, etc.) OOV – Proper noun detection (personal name, location, organization, etc.) Segmentation error propagation Segmentation error propagation Non-word errors (in sense of English) do not exist Non-word errors (in sense of English) do not exist MSPY makes good use of word trigram language model MSPY makes good use of word trigram language model

Chinese spelling checking CSC – related works CSC – related works Template matching – long distance, e.g. Template matching – long distance, e.g. Pattern matching – long words (n>=3), e.g., Pattern matching – long words (n>=3), e.g., N-gram models – substitution errors N-gram models – substitution errors CSC – challenges CSC – challenges Long distance, coverage issue of template/pattern set Long distance, coverage issue of template/pattern set High-frequent-used confusion set, e.g. { } { } High-frequent-used confusion set, e.g. { } { } OOV, especially the proper nouns OOV, especially the proper nouns N-gram, has been fully used by MSPY N-gram, has been fully used by MSPY

Chinese spelling errors patterns in MSPY Proper noun Proper noun Personal name Personal name Location Location organization organization Non-word errors: context independent Non-word errors: context independent Insertion/deletion/transposition/substitution Insertion/deletion/transposition/substitution E.g., E.g., Real-word errors: context sensitive Real-word errors: context sensitive E.g.,, E.g.,,

Flowchart of our approach Text with errors Word segmentation Non-word error correction Real-word error correction Proper noun detection Word fuzzy matching Trigger: single char string, low prob Context sensitive disambiguation

Word segmentation and proper noun detection Language model based word segmentation Language model based word segmentation Class-based language model Class-based language model P(W) = P outside (W) P inside a (W| ), a = ? P(W) = P outside (W) P inside a (W| ), a = ? Outside probability – PN tagged training data Outside probability – PN tagged training data Using NLPWIN to tag the corpus Using NLPWIN to tag the corpus Filtering, rule base Filtering, rule base EM? EM? Inside probability – PN list training data Inside probability – PN list training data Using cache (or, dynamic dictionary) Using cache (or, dynamic dictionary)

Experiments and Findings Measure: precision/recall – definition Measure: precision/recall – definition Training data – People Daily Training data – People Daily Tag tool – NLPWIN Tag tool – NLPWIN Test data – spec. Test data – spec. Results and Findings Results and Findings

Long word fuzzy matching Definition of Distance(s1, s2) Definition of Distance(s1, s2) Long word, n>=3, Long word, n>=3, Sum of delete/insert/substitute a character Sum of delete/insert/substitute a character Fast fuzzy matching Fast fuzzy matching Global – Lei Zhangs ACL Global – Lei Zhangs ACL Local – trigger, (single char, or low n-gram probability ) Local – trigger, (single char, or low n-gram probability ) Search – error detection/correction Search – error detection/correction Viterbi Viterbi Simplified version Simplified version Long word + Local matching Long word + Local matching

Experiments and Findings Contact: 100 person, characters/person Contact: 100 person, characters/person Error analysis Error analysis Algorithm … Algorithm … Measure: precision/recall Measure: precision/recall Large lexicon, acquisition. Large lexicon, acquisition. Trigger/threshold ? Trigger/threshold ? Results and Findings Results and Findings

Context sensitive disambiguation Building confusion set – specific to MSPY Building confusion set – specific to MSPY Feature selection – Context vector Feature selection – Context vector Collocation – contiguous POS or words/characters Collocation – contiguous POS or words/characters Context words – words/characters within a K-size window Context words – words/characters within a K-size window Triple ? Triple ? Weighting schema and Classifier Weighting schema and Classifier Context Vector, TFIDF Context Vector, TFIDF Winnow, Bayesian, TBL, etc. Winnow, Bayesian, TBL, etc. Scaling up Scaling up Enlarge confusion set Enlarge confusion set Feature pruning Feature pruning Adaptation Adaptation

Experiments and Findings Measure: precision/recall Measure: precision/recall Training data Training data Test data (XXX confusion set) Test data (XXX confusion set) Results and Findings Results and Findings

Experiments and Findings Current Work Current Work Pseudo-training set based on MSPY IME Pseudo-training set based on MSPY IME Preliminary data processing (400M PD) Preliminary data processing (400M PD) Unigram error model (10,000 Words useful) Unigram error model (10,000 Words useful) /69484 /10289 /2394 …… /69484 /10289 /2394 …… Trigram error pattern (980,000 useful) Trigram error pattern (980,000 useful) [ ] => / [ ] => [ ] => / [ ] => Experiments based on basic approaches Experiments based on basic approaches Pseudo-test set from Pseudo-test set from Continuous pair (Recall = 50%, Precision = 25%) Continuous pair (Recall = 50%, Precision = 25%) Pattern Matching (??) Pattern Matching (??) Future Work Future Work Hybrid approaches Hybrid approaches Pattern Clustering + Continuous pair Pattern Clustering + Continuous pair Functional words error detection Functional words error detection

System evaluation – put it all together Evaluation toolset Evaluation toolset Measure: precision/recall Measure: precision/recall Training data Training data Test data Test data Results and Findings Results and Findings

Prototype Demo … Demo … Online & offline CSC Online & offline CSC Right click Right click Spelling error detection/correction Spelling error detection/correction Proper noun detection/correction Proper noun detection/correction

Assignment Jianfeng Gao – overall, fuzzy matching Jianfeng Gao – overall, fuzzy matching Mu Li – context sensitive disambiguation Mu Li – context sensitive disambiguation Jian Sun – PN detection Jian Sun – PN detection Yang Wen – system evaluation Yang Wen – system evaluation Yulin Kang – demo Yulin Kang – demo Lei Zhang – senior consultant Lei Zhang – senior consultant

Millstone Oct. 2001, Ming says Yes (TAB demo) Oct. 2001, Ming says Yes (TAB demo) Dec. 2001, Dong says Yes (Transfer) Dec. 2001, Dong says Yes (Transfer) Aug. 2002, HJ says Yes (Party) Aug. 2002, HJ says Yes (Party)

Information Access at \\msrcn4p3\rootD\gaojf\spell Access at \\msrcn4p3\rootD\gaojf\spell Contact me if any problems Contact me if any problems Jianfeng Gao, Tel: , Jianfeng Gao, Tel: ,