Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Slides:



Advertisements
Similar presentations
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Advertisements

Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
Proportion Priors for Image Sequence Segmentation Claudia Nieuwenhuis, etc. ICCV 2013 Oral.
Every Term Has Sentiment: Learning from Emoticon Evidences for Chinese Microblog Sentiment Analysis Jiang Fei State Key Laboratory.
PROJECT TITLE Name Use a total of up to 15 slides.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning Erheng Zhong ¶, Wei Fan ‡, Qiang Yang ¶, Olivier Verscheure ‡, Jiangtao.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Webpage Understanding: an Integrated Approach
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
Graphical models for part of speech tagging
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Playing GWAP with strategies - using ESP as an example Wen-Yuan Zhu CSIE, NTNU.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Measuring Behavioral Trust in Social Networks
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
Detection of Spelling Errors in Swedish Clinical Text Nizamuddin Uddin and Hercules Dalianis Department of Computer and Systems Sciences, (DSV)
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Automatic Labeling of Multinomial Topic Models
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Cultural Presentation lesson Huang Laoshi Credits to Ms. A. Gu Middlesex High (p.9-11) WHuang CH4AP draft1.
Unit 3 English Around the World Topic 1 English is widely spoken throughout the world. Section B 〔Ⅰ〕
CPSC 203 Introduction to Computers Lab 66 By Jie Gao.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Analysis of Lenovo's Marketing Mix. CONTENTS product 何勤 price 黄旭霞 place 丁彩 promotion 张晓 conclusion 杨凯丽.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Unit 2 Arnwick was a city with 200,000 people. Module 9 Population.
Language Identification and Part-of-Speech Tagging
Adaptive entity resolution with human computation
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Presentation transcript:

Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao, Maosong Sun and Wolfgang Menzel State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University

Tsinghua University 2 Part Ⅰ Background

Tsinghua University 3 Introduction Chinese word segmentation Combination ambiguity 火 把 (torch) 火 (fire) 把 (make) Overlapping ambiguity a. 先解决其主要问题,再解决其次要问题 其 次要 (the subordinate) b. 首先要关注整体,其次要注意细节 其次 要 (secondly we should) ★ 火 把

Tsinghua University 4 Overlapping ambiguity string (OAS) Length; Order; Intersection length; Structure Maximal overlapping ambiguity string (MOAS) True / Pseudo ambiguity MOAS e.g. 其次要 ( TM ) : 其次 要 & 其 次要 e.g. 部长篇小说 (PM) : 部 (measure word) 长篇小说 Related Terms order2 order , 1-3 3

Tsinghua University 5 [Sun et al.,1999] 100 million character A set of core for MOAS is found [Li, et al., 2003] 650 million character Similar method is used to improve the performance of segmenter Previous Work

Tsinghua University 6 Two basic issues remain unsolved in their work: Only include news data, the results need further validated Determine the core of pseudo OA strings. both for general-purpose and domain-specific. Motivation

Tsinghua University 7 Statistical Properties of MOAS From General Corpus From Domain-specific Corpus Part Ⅱ

Tsinghua University 8 Data Set CBC : 929,963,468 characters Rich in content (from 1920’s) covering rich categories such as novel, essay, news…… Chinese Word List Peking University, with 74,191 entries Automatically find totally 733,066 distinct MOAS types in CBC From General Corpus

Tsinghua University 9 Detailed Distribution Perspective 1: Length From General Corpus

Tsinghua University 10 Perspective 2: Order From General Corpus

Tsinghua University 11 Perspective 3: Intersection Length From General Corpus

Tsinghua University 12 Perspective 4: Structure distribution From General Corpus

Tsinghua University 13 Top N Frequent MOAS --Core candidate 3500 ~ 50.78% 7000 ~ 60.43% ~ 80.39% From General Corpus

Tsinghua University 14 Stability VS Corpus size From General Corpus # of MOAS VS Corpus size # of top N MOAS VS Corpus size Top 7000

Tsinghua University 15 Pseudo MOAS Detection Relax definition on “Pseudo” Eg. “ 出国门 ” : 出 国门 (go abroad) in almost all the cases 出国 门 (the way to go abroad) small possibility 5,507 PM and 1,439 TM judged by hand Token coverage of PM and TM over CBC From General Corpus

Tsinghua University 16 Domain-Specific Corpora Ency55: million characters Web55: million characters Common Parts From Domain-specific Corpora

Tsinghua University 17 Frequent MOAS Coverage in Domain Specific Corpora (N=3,500) From Domain-specific Corpora

Tsinghua University 18 From Domain-specific Corpora Frequent MOAS Coverage in Domain Specific Corpora (N=7,000)

Tsinghua University 19 From Domain-specific Corpora Frequent MOAS Coverage in Domain Specific Corpora (N=40,000)

Tsinghua University 20 From Domain-specific Corpora PM and TM distribution over Domain Corpora 42% of overlapping ambiguities in any Chinese text can be 100% solved. ★

Tsinghua University 21 Part Ⅲ Disambiguation

Tsinghua University 22 Disambiguation Method Current performance on OA Performance of ICTCLAS1.0 on OAs e.g. 公安局 长 是 主管 这一 事故 的 The police chief ( 公安 局长 ) is the person who in charge of this accident. Performance of MSR-Seg1.0 on OAs e.g. 核电站的特殊性 质 The special properties ( 特殊 性质 ) of nuclear power station

Tsinghua University 23 Disambiguation Method Performance of CRF-base[Lafferty 2001] CWS on OAs e.g. 这一 现状 先 天地 决定 了 他们 的 使命 This situation congenitally ( 先天 地 ) makes them to take the mission About 2% of OAS are mistakenly segmented ——it is a net gain

Tsinghua University 24 Individual-based method Simple table lookup: record the PMs and the correct segmentation in a table Advantage Satisfactory token coverage to MOASs Full correctness for segmentation of pseudo MOASs Low cost in time and space complexity. Disambiguation Method

Tsinghua University 25 An extension of [Sun et. al, 1999] Adjust the exist results in large corpora Further verify the properties on domain- specific corpora An disambiguation strategy is proposed Over 42% Overlapping ambiguity can be resolved without any mistake Will be more effective when facing running text Conclusion

Tsinghua University 26 Reference Lafferty J., A. McCallum, and F. Pereira Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18 th International Conference of ICML, pages Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): (In Chinese) Li M., J.F. Gao, C.N. Huang, and J.F. Li Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7. Sun M.S. and Z.P. Zuo Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages Sun M.S., C.N. Huang, and B.K.Y. T’sou Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): (In Chinese) Sun M.S., Z.P. Zuo and B.K.Y. T’sou The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): (In Chinese)

Tsinghua University 27 Thank you any comments ? ^.^