Tokenizer and Sentence Splitter CSCI-GA.2591 NYU Tokenizer and Sentence Splitter CSCI-GA.2591 Ralph Grishman
the first stage the first stages in most NLP pipelines involve segmentation of the input into tokens and sentences the biggest challenge for English (and most European languages) is the proper classification of periods: CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. errors at this stage can mess up most later stages
Periods Functions of periods: Can’t ignore the problem sentence boundary marker abbreviation marker initials in numbers Can’t ignore the problem about 25% of periods in English (WSJ) are not sentence boundaries
Using case information For English, case info of following text is quite helpful (period marks sentence boundary unless followed by lower case, digit, or punctuation (, ; :) misclassifies 13-15% of periods in WSJ
Approaches to sentence boundary detection hand-coded rules supervised systems unsupervised systems combined token and sentence models
Hand-coded rules Challenge is to identify abbreviations Can simply list them good performance but labor intensive WSJ system has 700+ items Can capture many of them with a few patterns [Grefenstette & Tapanainen 1994] Capital-letter period letter-period-letter-period-… (U.S., i.e.) Capital-letter consonant* period (Mr., Assn.) Can improve further by excluding words which appear elsewhere in corpus not followed by a period
Supervised classifiers We have corpora which have been split into sentences for other purposes (PTB WSJ, Brown corpus), so we might as well use them to train a sentence boundary classifier [Reynar and Ratnaparkhi 1997] maxent system, 98.8% accuracy on WSJ
Unsupervised classifiers Punkt system [Kiss and Strunk] basic idea: abbreviations are a type of collocation, and like other types of collocations (“hot dog”), can be identified based on the frequency with which the components (the preceding word and the period) co-occur Classify as abbreviation if P(period | word) is close to 1 Secondary collocational criteria look for collocations of words surrounding period and collocation of period and following word Enables detection of abbreviations at end of sentences Were able to get error rate to 1.65% (F = 98.9) on WSJ above 99 for most European languages
Combining token and sentence Because of the strong interaction of token and sentence segmentation, some groups have used a single character-level model for both tasks [Evang et al EMNLP 2013] It didn’t matter if the faces were male, SIOTIITIIOTIIIIIOTIOTIIOTIIIIOTIIIOTIIITO female or those of children. Eighty- TIIIIIOTIOTIIIIOTIOTIIIIIIITOSIIIIIIO three percent of people in the 30-to-34 IIIIIOTIIIIIIOTIOTIIIIIOTIOTIIOTIIIIIIIO year old age range gave correct responses. TIIIOTIIOTIIOTIIIIOTIIIOTIIIIIIOTIIIIIIIIT
They use a character level CRF with features = character and Unicode char. category in a small window They supplement this with 10 binary character embedding features produced by a neural network which has been trained to generate the training character sequence report very low error rates (0.27%) but it is not clear how they directly compare to other work
WASTE system [Jurish and Wurzner] Intermediate approach: over-segment text, then build HMM model on these segments HMM has 5 observable features [shape, case, length, stopword, and blanks] and 3 hidden features [token initial, sentence initial, sentence final] Report low error rate (0.4% on WSJ)
Issues for us Should we use an adaptive system to handle the 6 genres of ACE?