Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.

Slides:



Advertisements
Similar presentations
Does each sentence begin with a capital letter? Underline the beginning letter of each sentence. Is there a. ! ? after each sentence? Circle the punctuation.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
Evaluator Identification & Preview Sign your name at the end of the essay. Review objective of the PROGRESS CHECK. Take 2 minutes to preview your peers.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.
Videogame Project Progress Evaluated previous work Crawled Giantbomb game database Identified entities in review text Clustering adjectives.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Part of speech (POS) tagging
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
The chapter will address the following questions:
Albert Gatt Corpora and Statistical Methods Lecture 9.
ELN – Natural Language Processing Giuseppe Attardi
Chapter 6 System Engineering - Computer-based system - System engineering process - “Business process” engineering - Product engineering (Source: Pressman,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
Requirements Engineering CSE-305 Requirements Engineering Process Tasks Lecture-5.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CMSC 1041 Algorithms II Software Development Life-Cycle.
CONVENTIONS THE MECHANICAL CORRECTNESS OF YOUR PIECE, WHICH HELPS GUIDE THE READER THROUGH THE TEXT BY TUVIA, SALOME AND MIMI.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Tokenization & POS-Tagging
Chapter 23: Probabilistic Language Models April 13, 2004.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Requirement Engineering. Recap Elaboration Behavioral Modeling State Diagram Sequence Diagram Negotiation.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
CIS 4910 Information Systems Development Project Project Documentation.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
1 The Requirements Problem Chapter 1. 2 Standish Group Research Research paper at:  php (1994)
Requirement Engineering
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Evaluator Identification & Preview Sign your name at the end of the essay. Review objective of the PROGRESS CHECK. Take 2 minutes to preview your peers.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Text Based Information Retrieval
Tokenizer and Sentence Splitter CSCI-GA.2591
RW1.1 Decoding and Word Recognition: Recognize and use knowledge of spelling patterns (e.g., diphthongs, special vowel spellings) when reading.
Natural Language Processing (NLP)
CS 430: Information Discovery
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000

Abstract Three problems of text normalization: n Sentence Boundary Disambiguation (SBD) n Disambiguation of capitalization when words are used in positions where capitalization is expected n Identification of abbreviations Use of the Document Centered Approach methods to reduce sentence boundary disambiguation with pre-built resources from existing corpora (i.e. Wall Street Journal and Brown)

Introduction n Text cleaning and normalization is used to develop text processing and Information Retrieval applications. –Text normalization begins with disambiguation of capitalized words Capitalization is expected for proper names, locations, people etc. Ambiguity is presented with mandatory rule of capitalization in special positions (e.g., at the start of a sentence) n Disambiguation of capitalized words in ambiguous positions (also known as normalization) leads to identification of proper names –Study conducted by Church reflects: reference to same thing/object (e.g., hurricane and Hurricane) reference to different thing/object (e.g., apple [fruit] and Apple [computer])

Introduction (cont.) n Disambiguation serves toward sentence splitting/sentence boundary disambiguation (SBD) –Sentence splitting: the process of creating a sentence boundary using punctuation such as “!”, “?”, “.” –Periods can serve one or several roles at once: splitting text information denoting decimal points denoting an abbreviation

Our Approach to SBD (Sentence Boundary Disambiguation) n Experiment began with use of Wall St. Journal corpus and the Brown corpus –Three tasks involved with both corpora Sentence Boundary Disambiguation Capitalized Word Disambiguation Abbreviation Identification –Human-involved, labor-intensive programming of both corpora to recognize: abbreviations that are followed by proper names (e.g., Mr. White) abbreviations which are single-word, short and, in most cases, do not include vowels (e.g. kg., ft., etc.) abbreviations consisting of a series of capitalized single letters separated by periods (e.g., Y.M.C.A., U.C.L.A., A.L.A., etc.) –With programming, error rates still proved to be too high within both corpora (15-16% error rate)

Document-Centered Approach n Document-Centered Approach (DCA): reviewing entire document to formulate disambiguation as it relates to capitalization of proper names and abbreviations –Generalized Principles for the DCA method: if a word has been capitalized in an unambiguous position, this increases the probability that it is a proper name (e.g., “The Riders [as in the Rider family] said….”) if a short word (e.g., “in.” standing for inches) is followed by a period, but occurs elsewhere in the document without a period, the likelihood is that it is not an abbreviation

Getting Abbreviations n Recognition of abbreviations process begins with using existing abbreviation lists n Enhancing existing abbreviation lists (which may be incomplete for existing document) by: –collecting unigram forms of abbreviations from existing document (e.g., Sun., which can stand for Sunday or the newspaper) –collecting bigram forms of abbreviations, which are made up of two words but recognized as one word (e.g., “Vitamin C.”)

Getting Capitalized Words n Disambiguation of capitalized words using the following methods: – The Sequence Strategy: the process of exploring sequences of proper nouns in unambiguous positions – Frequent List Lookup Strategy: a pre-programmed compilation of words that are frequently capitalized to denote proper names – Single Word Assignment: the process of reviewing the entire document to determine whether capitalized words in the document act as proper names – The “After Abbr.” Heuristic: the process of determining a proper name, when a capitalized word follows a capitalized abbreviation, the capitalized word is, in most cases, certainly a proper name

Getting Capitalized Words (cont.) n The Overall Performance: upon applying the four methods above, the final disambiguation results were: –9% ambiguously capitalized words unclassified in Brown Corpus –15% ambiguously capitalized words unclassified in Wall Street Journal Corpus n Ranking of the methods achieving best results for classifying ambiguously capitalized words: –Single Word Assignment –“After Abbr.” Strategy –Sequence Strategy

Assigning Sentence Breaks n The process of correctly recognizing the end of an idea, thought or statement, where a small, and in some cases, non-voweled word followed by a period and then by a lower cased word, we can assume that the small word is an abbreviation

Related Research n Two types of existing SBD systems: –Rule based system: A system comprised of manually built rules, encoded with lists of proper names, abbreviations, common words, etc. to recognize where sentences break within a document –Machine learning system: A system employing the use features such as word spelling, capitalization, suffix, word class, etc. to recognized potential sentence breaking punctuation Examples of machine learning systems developed by: –Kuhn and de Mori –Clarkson and Robinson –Mani&MacMillan –Gale, Church and Yarowsky

Discussion n Final results of the Document Centered Approach: –This approach proved to be comparable to or even better than existing approaches –This approach does not require any human intervention for training –Simplicity of this approach resulted in high-running speed –This approach does not rely on pre-compiled statistics –Easy implementation, without installing new software –This system can be customized for specific domain usage