1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Corpus Processing and NLP
Punctuation is used to create sense, clarity and stress in sentences.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 I256 Applied Natural Language Processing Fall 2009 Lecture 3 Morphology Stemming Tokenization Segmentation Barbara Rosario.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
1 Words and the Lexicon September 10th 2009 Lecture #3.
Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Introduction to Computational Linguistics Lecture 2.
Stemming, tagging and chunking Text analysis short of parsing.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
COMP205 Comparative Programming Languages Part 1: Introduction to programming languages Lecture 2: Structure of programs and programming languages as communication.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Grammar Skills Workshop
Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Natural Language Processing Lecture 6 : Revision.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Tokenization & POS-Tagging
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Grammar Review Parts of Speech Sentences Punctuation.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Natural Language Processing Chapter 2 : Morphology.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 홍 정 아홍 정 아.
1 Writing for Computer Science 4. Punctuation Ko, Myung warn.
1 The grammatical categories of words and their inflections Kuiper and Allan Chapter 2.1.
Statistical NLP: Lecture 3
Text Based Information Retrieval
Natural Language Processing (NLP)
Corpus Linguistics I ENG 617
Basic Text Processing: Sentence Segmentation
Chunk Parsing CS1573: AI Application Development, Spring 2003
Linguistic Essentials
Introduction to Text Analysis
Statistical NLP: Lecture 6
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing

2 Introduction Requirements of NLP work –Computers –Corpora –Application/Software This section covers some issues concerning the formats and problems encountered in dealing with raw data Low-level processing before actual work –Word/Sentence extraction

3 Getting Set Up Computers –Memory requirements for large corpora –Statistical NLP methods involve counts required to be accessed speedily Corpora –“A corpus is a special collection of textual material collected according to a certain set of criteria” –Licensing –Most of the time free sources are not linguistically marked-up

4 Corpora –Representative sample What we find for sample also holds for general population –Balanced corpus Each subtype of text matching predetermined criterion of importance Importance in statistical NLP –Representative corpus –In results type/domain of corpus should be included

5 Software –Text editors TextPad, Emacs, BBedit Regular expressions –Patterns as regular language –Programming language C/C++ widely used (Efficient) Pearl for text preparation and formatting Built in database and easy handling of complicated structures makes Prolog important Java as pure Object Oriented gives automatic memory management

6 Looking at Text Either in raw format or marked-up –‘Markup’ is used for putting some codes into data file, giving some information about text Issues in automatic processing –Junk formatting/content (Corpus Cleaning) –Case sensitivity (All capitalize) 1.Proper Nouns? 2.Stress through capitalization Loss of contextual information

7 Tokenization –Text is divided into units called ‘tokens’ –Treatment of punctuation marks? What is a word? –Graphic word (Kucera and Francis 1967) A string of contiguous alphanumeric characters with white space on either side. This is not practical definition even in case of Latin Especially for news corpus some odd entries can be present e.g. Micro$oft, C| net Apart from these oddities there are some other issues

8 Periods –Words are not always bounded by white spaces (commas, semicolons and periods) –Periods are at the end of sentence and also at the end of abbreviations –In abbreviation they should be attached to words (Wash.wash) –When abbreviations occur at the end of sentence there is only one period present, performing both functions Within morphology, this phenomenon is referred as ‘haplology’

9 Single Apostrophes –Difficulties in dealing with constructions such as I’ll or isn’t –The count of graphic word is 1 according to basic definition but should be counted as 2 words 1. S  NP VP 2. if we split then some funny words may occur in collection –End of quotations marks –Possessive form of words ending with ‘s’ or ‘z’ Charles’ LawMuaz’ book

10 Hyphenation –Does sequence of letters with hyphen in- between, count as one or two? –Line ending hyphens Remove hyphen at the end of line and join both parts together If there is some other type of hyphen at end of line (haplology) then? (text-based) –Mostly in electronic text line breaking hyphens are not present, but there are some other issues…….

11 Some things with hyphens are clearly treated as one word – , A-l-Plus and co-operate Other cases are arguable –Non-lawyer, pro-Arabs and so-called –The hyphens here are called lexical hyphens –Inserted before or after small word formatives to split vowel sequence in some cases Third class of hyphens is inserted to indicate correct grouping –A text-based medium –A final take-it-or-leave-it offer

12 Inconsistencies in hyphenation –Cooperate  Co-operate –So we can have multiple forms treated as either one word or two Lexemes –Single dictionary entry with single meaning Homographs –Two lexemes have overlapping forms/nature Saw

13 Word segmentation in other languages Opposite issue –White spaces but not word boundary –“the New York-New Heaven railroad” –“I couldn’t work the answer out” In spite of, in order to, because of Variant coding of information of certain semantic type –Phone numbers Problem in information extraction

14 Speech Corpora Issues –More contractions –Various phonetic representations –Pronunciation variants –Sentence fragments –Filler words Morphology –Keep various forms separately or collapse them? e.g. sit, sits, sat –Grouping them together and working with lexemes (Initially looks easier)

15 Stemming –Strips off affixes Lemmatization –To extract the lemma or lexeme from inflected form Empirical research within IR shows that stemming does not help in performance 1.Information loss (operating  operate) 2.Closely related tokens are grouped in chunks, which are more useful 3.Not good for morphologically rich languages

16 Sentences –What is a sentence? –In English, something ending with ‘.’, ‘?’ or ‘!’ –Abbreviations issues Other issues –you reminded me, she remarked, of your mother.” –Nested things are classified as ‘clauses’ –Quotation marks after punctuation ‘.’ is not sentence boundary in this case

17 Sentence boundary (SB) detection –Place tentative SB after all occurrences of.?! –Move the boundary after quotation mark (if any) –Disqualify a period boundary in case of Preceded by an abbreviation not at sentence end, and capitalized Prof., Dr. Or not followed by capitalized words like in case of etc., jr. –Disqualify a boundary with ? Or ! If followed by a lower case letter –Regard all other as correct SBs

18 Riley (1989) used classification trees for SB detection –Features of trees included case and length of words preceding or following a period and probabilities of words to occur before and after a sentence boundary –It required large quantity of labeled data Palmer and Hearst used POS of such words and implemented with Neural Networks (98-99% accurate) In other languages?

19 Marked-up Data –Some sort of code is used to provide information (mostly SGML, XML) –It can be done automatically, manually or mixture of both (Semi-Automatic) –Some texts mark up just sentence and paragraph boundaries –Other mark up more than this basic information e.g. Pen Treebank (Full syntactic structure) –Common mark up is POS tagging

20 Grammatical Tagging –Generally done with conventional POS tagging like Noun, Verbs etc. – Also some information regarding nature of the words like Plurality of nouns or Superlative forms of adjectives Tag set –The most influential tag set have been the one used to tag American Brown Corpus and Lancaster-Oslo-Bergen corpus

21 Size of tag sets –Brown (Total tags) –Penn45 –Claws1132 Penn tag set is widely used in computational work Tags are different in different tag sets –Larger tag sets obviously have fine-grained distinctions –Detail level is according to domain of corpora

22 The design of tag set –Grammatical class of word –Features to tell the behavior of the word Part of Speech –Semantic grounds –Syntactic distributional grounds –Morphological grounds Splitting tags in further categories gives improved information but makes classification harder There is not a simple relationship between tag set size and performance of taggers