A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

Slides:



Advertisements
Similar presentations
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Advertisements

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Machine learning continued Image source:
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
1 Prepared and presented by Roozbeh Farahbod Voted Perceptron: Modified for NP-Chunking A Re-ranking Method.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Zhenghua Li, Jiayuan Chao, Min Zhang, Wenliang Chen {zhli13, minzhang, Soochow University, China Coupled Sequence.
Final review LING572 Fei Xia Week 10: 03/11/
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Active Learning for Class Imbalance Problem
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
A Language Independent Method for Question Classification COLING 2004.
Date: 2014/02/25 Author: Aliaksei Severyn, Massimo Nicosia, Aleessandro Moschitti Source: CIKM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Building.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Weakly Supervised Training For Parsing Mandarin Broadcast Transcripts Wen Wang ICASSP 2008 Min-Hsuan Lai Department of Computer Science & Information Engineering.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Post-Ranking query suggestion by diversifying search Chao Wang.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
NTU & MSRA Ming-Feng Tsai
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Personalizing Web Search Jaime Teevan, MIT with Susan T. Dumais and Eric Horvitz, MSR.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.
Language Identification and Part-of-Speech Tagging
Linguistic Graph Similarity for News Sentence Searching
PRESENTED BY: PEAR A BHUIYAN
Web News Sentence Searching Using Linguistic Graph Similarity
The Voted Perceptron for Ranking and Structured Classification
Learning to Rank with Ties
Presentation transcript:

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China

Introduction Challenges in CWS Ambiguous Unknown word Web and search technology Free from OOV problem Adaptive to different segmentation standards Entirely unsupervised

The proposed approach Segments Collecting Query sentence => sub-sentence (by punctuation) Submit sub-sentence to a search engine Collect the highlights from returned snippets Query : “ 我明天要去止锚湾玩 ” :

The proposed approach Segments Scoring Select a subset of segments as final segmentation Frequency-based: term frequency Segment occurrences : total number of occurrences SVM-based SVM classifier with RBF kernel and maps the outputs into probabilities as the scores  Reconstruct the query using the segment way with highest score

The proposed approach Segments Selecting Valid subset: if its member segments can reconstruct exactly the query Score of valid subset: the average score of its member segments. Greedy search to find valid subset For efficiency consideration Select the valid subset which has highest score as final segmentation

Evaluations Experiment setting SVM-based score Training set: 3000 randomly selected sentences Feature space ——Three dimensional : TF DF LEN TF: term frequency DF: number of documents indexed by a segment Len: number of characters in a segment Frequency-based score need no training set

Evaluations Comparison result SIGHAN’05

Evaluations Worse than reported results Why SVM is worse ? Feature space too simple Advantage: only 3000 or non training set Avoids OOV problem Better performance can be achieved with more search results provided (Google+Yahoo!)

Evaluations Comparison to IBM full parser

Conclusion It is good at discovering new words (no OOV proble m) and adapting to different segmentation standards Entirely unsupervised which saves labors to labeling training data. Finding more effective scoring methods Combining current approach to other types of segmentation methods to give a better performance

My work going on…… Discriminative Reranking ——ACL 07 & 03 1 Michael Collins and Terry Koo 2 Zhongqiang Huang: Purdue Univ.

Background Have been applied to many NLP application NER, Parsing, sentence boundary detection Haven’t try it on POS-tagging Motivation 1 Rerank the output of an existing probabilistic tagger. 2 The base tagger produces a set of candidate tag sequence for each sentence. 3 A second model attempts to improve upon this initial ranking using additional features

Collins’ Reranking Algorithm Training the reranker n sentences each with n i candidates Along with log-probability produced by the HMM tagger “goodness” score : measures the similarity between the candidate and the gold reference.

Collins’ Reranking Algorithm Training data consists of a set of examples each along with a “goodness” score and a log-probability

Collins’ Reranking Algorithm A set of indicator functions :extract binary features on each example. Each indicator function is associated with a weight parameter which is real valued. is associated with

Collins’ Reranking Algorithm The ranking function The objective of training Set to minimize: Where:

Experiments Using HMM as the base model Data set The most recently released Penn Chinese Tree bank 5.2 (denoted CTB, released by LDC) ——33 POS tags ——500K words, 800K characters, 18K sentences

Experiments Divide into 20 chunks, with each chunk N-best tagged by the HMM model trained on the combination of the other 19 chunks

Experiments Result of Reranking Models N-gram features: N-gram + morphological features.

Conclusion Reranking method is efficient on POS task extract additional reranking features utilizin g more explicitly the characteristics of Man darin. explore semi-supervised training methods f or reranking.