March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Corpus Processing and NLP
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Punctuation & Grammar., ?; :’!., ?; “” :’!., ?; “” :’!
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Stemming, tagging and chunking Text analysis short of parsing.
WMES3103 : INFORMATION RETRIEVAL
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Some Advances in Transformation-Based Part of Speech Tagging
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
AN IMPLEMENTATION OF A REGULAR EXPRESSION PARSER
Natural Language Processing Lecture 6 : Revision.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Tokenization & POS-Tagging
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Natural Language Processing Chapter 2 : Morphology.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Lexicography Lexicon has two different meanings:
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
NATURAL LANGUAGE PROCESSING
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Spelling, Punctuation And Grammar. English Curriculum 2014 Changes Stronger emphasis on vocabulary development, grammar, punctuation and spelling (for.
CS 2130 Lecture 18 Bottom-Up Parsing or Shift-Reduce Parsing Warning: The precedence table given for the Wff grammar is in error.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Natural Language Processing Vasile Rus
Institute of Informatics & Telecommunications
Natural Language Processing (NLP)
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Basic Text Processing: Sentence Segmentation
Chunk Parsing CS1573: AI Application Development, Spring 2003
Statistical NLP: Lecture 6
Natural Language Processing (NLP)
Information Retrieval and Web Design
Natural Language Processing (NLP)
Presentation transcript:

March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006Introduction to Computational Linguistics 2 Information Food Chain Inference ↑Knowledge Representation ↑Meaning Extraction ↑Semantic Relationships ↑Chunking (noun phrases; verb phrases) ↑Part of Speech Annotation ↑Paragraph and sentence identification ↑Tokenisation ↑Raw Text

March 2006Introduction to Computational Linguistics 3 Start with a Corpus A corpus is an organised body of materials from language that is used as a basis for empirical studies. Corpora classfied according to –Representativeness –Medium –Language –Information Content –Structure

March 2006Introduction to Computational Linguistics 4 Examples of Corpora Project Gutenberg: public domain text resources. Brown Corpus: a tagged corpus of about 1M words put together at Brown Penn Treebank: a corpus of parsed sentences based on text from the WSJ Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament.

March 2006Introduction to Computational Linguistics 5 Low Level Issues Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc. Normalisation: deciding on standard character representations; adopting upper or lower case (or both) Tokenisation

March 2006Introduction to Computational Linguistics 6 Tokenisation Tokenisation is a process which divides input text into individual units called tokens. Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information. An example of such information is the type of the token: word, punctuation, number

March 2006Introduction to Computational Linguistics 7 What counts as a word? Words are quite tricky to define The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967) It is easy to find exceptions.

March 2006Introduction to Computational Linguistics 8 Problems Identifying Words VfB Stuttgart scored twice in quick success -ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday. (example from Mary Dalrymple, University of London) VfB Stuttgart, Manchester United succession 2-1 Wednesday

March 2006Introduction to Computational Linguistics 9 Problems Identifying Words Problems Involving Spaces Lack of spaces between words Lebensversicherungsgesellschaftsanngesteller (life insurance company employee) Ix-Xemx The presence of spaces may not indicate a word break Coca Cola;

March 2006Introduction to Computational Linguistics 10 Problems Involving Special Characters Words often include non-alphanumeric characters which are actually part of the word. $22.50; BSc. IT :-) Words are often terminated by punctuation which is not part of the word. Sometimes, terminating punctuation is part of the word.

March 2006Introduction to Computational Linguistics 11 Periods In general, punctuation marks attach to words, and can be removed. However there are special cases: Most periods mark end of sentence Others mark abbreviations, e.g. "e.g.". "Wash." Note that when an abbreviation occurs at the end of a sentence there is only one period.

March 2006Introduction to Computational Linguistics 12 Apostrophe English contractions such as won't or I'll count as one word according to the classic definition However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP) Penn Treebank splits such contractions into two words.

March 2006Introduction to Computational Linguistics 13 Apostrophe This sometimes leaves odd words For example isn’t yields is + n't 's is ambiguous –Abbreviation for is (he's strange) –Possessive (John's car) Word-final aprostrophe is ambiguous –end of quotation –possessive of word ending in s

March 2006Introduction to Computational Linguistics 14 Exercise How is the apostrophe used in Maltese How should a Maltese tokeniser deal with it?

March 2006Introduction to Computational Linguistics 15 Hyphen Issue: do sequences of words joined by hyphens count as one word or more? Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old) are usually removed. Typesetting hyphens can be ambiguous Lexical hyphens are usually kept hi-fi Hyphens – standing alone – are used as punctuation. Texts are often inconsistent in usage of hyphens

March 2006Introduction to Computational Linguistics 16 Case Types vs. Tokens –How many tokens in the following sentence: The cat chased the rat on the table –How many types? Tokenisation should correctly identify word types, i.e. –Tokens of the same type should be identified –Tokens of different type should be distinguished Case representation of ordinary words must be standardised.

March 2006Introduction to Computational Linguistics 17 Case Heuristics –Map first character of a sentence to standard case –Map all words in titles to lowercase Problems –Identification of sentence boundaries –Identification of proper names

March 2006Introduction to Computational Linguistics 18 Normalisation Character representations. Converting all letters to lower or upper case Removing punctuation Removing letters with accent marks and other diacritics Expanding abbreviations

March 2006Introduction to Computational Linguistics 19 Further Normalisation Stemming: are eats and eating different words? They are two different wordforms that have the same stem, eat, but different suffixes, -s and -ing Stemming versus full morphological analysis.

March 2006Introduction to Computational Linguistics 20 Summary The tokenisation problem interacts with design decisions at different levels concerning –Handling of non alphanumeric characters –Case –Punctuation Typically many of these problems are dealt with by hand crafting special rules which match a particular case. Such rules are often built out of regular expressions.

March 2006Introduction to Computational Linguistics 21 Sources Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999