Tokenizer and Sentence Splitter CSCI-GA.2591

Slides:



Advertisements
Similar presentations
Writing. It is important that the practitioner observes the child in the process of writing in order to see how the writing is attempted: Concentration.
Advertisements

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
University of Sheffield NLP Module 4: Machine Learning.
Perceptron Learning Rule
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Learning linguistic structure with simple recurrent networks February 20, 2013.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Introduction to Machine Learning Approach Lecture 5.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Some Advances in Transformation-Based Part of Speech Tagging
Semantic Grammar CSCI-GA.2590 – Lecture 5B Ralph Grishman NYU.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:
Tokenization & POS-Tagging
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Representation of Characters
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Neural correlates of morphological decomposition in a morphologically rich language : An fMRI study Lehtonen, M., Vorobyev, V.A., Hugdahl, K., Tuokkola.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Language Identification and Part-of-Speech Tagging
Binary Representation in Text
IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*
Sentiment analysis algorithms and applications: A survey
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Preliminaries CSCI-GA.2591
Relation Extraction CSCI-GA.2591
Conditional Random Fields for ASR
(Entity and) Event Extraction CSCI-GA.2591
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Training and Evaluation CSCI-GA.2591
Deep learning and applications to Natural language processing
CSC 594 Topics in AI – Natural Language Processing
Prototype-Driven Learning for Sequence Models
Explicit Reading Instruction In the Elementary Classroom
Audio Books for Phonetics Research
Sequential Pattern Discovery under a Markov Assumption
Statistical NLP: Lecture 9
CSCI 5832 Natural Language Processing
Thomas L. Packer BYU CS DEG
Introduction to Pattern Recognition
CSE 1020:Software Development
Basic Text Processing: Sentence Segmentation
Word Embedding Word2Vec.
Hong Kong English in Students’ Writing
CSCI 5832 Natural Language Processing
Text Mining & Natural Language Processing
Experiments on Processing Overlapping Parallel Corpora
Introduction to Text Analysis
CSCI 5832 Natural Language Processing
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Modeling of Spliceosome
Week 4: Sept. 19 Hickman, English I.
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Introduction to Sentiment Analysis
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Tokenizer and Sentence Splitter CSCI-GA.2591 NYU Tokenizer and Sentence Splitter CSCI-GA.2591 Ralph Grishman

the first stage the first stages in most NLP pipelines involve segmentation of the input into tokens and sentences the biggest challenge for English (and most European languages) is the proper classification of periods: CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. errors at this stage can mess up most later stages

Periods Functions of periods: Can’t ignore the problem sentence boundary marker abbreviation marker initials in numbers Can’t ignore the problem about 25% of periods in English (WSJ) are not sentence boundaries

Using case information For English, case info of following text is quite helpful (period marks sentence boundary unless followed by lower case, digit, or punctuation (, ; :) misclassifies 13-15% of periods in WSJ

Approaches to sentence boundary detection hand-coded rules supervised systems unsupervised systems combined token and sentence models

Hand-coded rules Challenge is to identify abbreviations Can simply list them good performance but labor intensive WSJ system has 700+ items Can capture many of them with a few patterns [Grefenstette & Tapanainen 1994] Capital-letter period letter-period-letter-period-… (U.S., i.e.) Capital-letter consonant* period (Mr., Assn.) Can improve further by excluding words which appear elsewhere in corpus not followed by a period

Supervised classifiers We have corpora which have been split into sentences for other purposes (PTB WSJ, Brown corpus), so we might as well use them to train a sentence boundary classifier [Reynar and Ratnaparkhi 1997] maxent system, 98.8% accuracy on WSJ

Unsupervised classifiers Punkt system [Kiss and Strunk] basic idea: abbreviations are a type of collocation, and like other types of collocations (“hot dog”), can be identified based on the frequency with which the components (the preceding word and the period) co-occur Classify as abbreviation if P(period | word) is close to 1 Secondary collocational criteria look for collocations of words surrounding period and collocation of period and following word Enables detection of abbreviations at end of sentences Were able to get error rate to 1.65% (F = 98.9) on WSJ above 99 for most European languages

Combining token and sentence Because of the strong interaction of token and sentence segmentation, some groups have used a single character-level model for both tasks [Evang et al EMNLP 2013] It didn’t matter if the faces were male, SIOTIITIIOTIIIIIOTIOTIIOTIIIIOTIIIOTIIITO female or those of children. Eighty- TIIIIIOTIOTIIIIOTIOTIIIIIIITOSIIIIIIO three percent of people in the 30-to-34 IIIIIOTIIIIIIOTIOTIIIIIOTIOTIIOTIIIIIIIO year old age range gave correct responses. TIIIOTIIOTIIOTIIIIOTIIIOTIIIIIIOTIIIIIIIIT

They use a character level CRF with features = character and Unicode char. category in a small window They supplement this with 10 binary character embedding features produced by a neural network which has been trained to generate the training character sequence report very low error rates (0.27%) but it is not clear how they directly compare to other work

WASTE system [Jurish and Wurzner] Intermediate approach: over-segment text, then build HMM model on these segments HMM has 5 observable features [shape, case, length, stopword, and blanks] and 3 hidden features [token initial, sentence initial, sentence final] Report low error rate (0.4% on WSJ)

Issues for us Should we use an adaptive system to handle the 6 genres of ACE?