CSCI 5832 Natural Language Processing

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
CS 430 / INFO 430 Information Retrieval
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 8 Jim Martin.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/23.
Morphological analysis
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
Introduction to English Morphology Finite State Transducers
Text Search and Fuzzy Matching
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Finite State Transducers for Morphological Parsing
Morphological Processing & Stemming Using FSAs/FSTs.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Big Books. Purpose of Big Books Reading Big Books together is non- threatening “social learning activity”. They show students how to read naturally, with.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
Natural Language Processing Chapter 2 : Morphology.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Intro to NLP - J. Eisner1 Phonology [These slides are missing most examples and discussion from class …]
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Year 6 Parents' SATs Meeting 19th March.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
[These slides are missing most examples and discussion from class …]
Speech and Language Processing
Chapter 3 Lexical Analysis.
Welcome to the Year 3 and 4 English Curriculum
CS 430: Information Discovery
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSC NLP - Regex, Finite State Automata
CS 430: Information Discovery
HCC class lecture 13 comments
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
MA/CSSE 474 Theory of Computation
CPSC 503 Computational Linguistics
Regular Expressions and Automata in Language Analysis
Introduction to Information Retrieval
Accounting Information Systems 9th Edition
Morphological Parsing
Information Retrieval and Web Design
Presentation transcript:

CSCI 5832 Natural Language Processing Lecture 5 Jim Martin 7/24/2019 CSCI 5832 Spring 2007

Today 1/31 Review FSTs/Composing cascades Administration FST Examples Porter Stemmer Soundex 7/24/2019 CSCI 5832 Spring 2007

FST Review FSTs allow us to take an input and deliver a structure based on it Or… take a structure and create a surface form Or take a structure and create another structure 7/24/2019 CSCI 5832 Spring 2007

FST Review What does the transition table for an FSA consist of? How does that change with an FST? 7/24/2019 CSCI 5832 Spring 2007

Review In many applications its convenient to decompose the problem into a set of cascaded transducers where The output of one feeds into the input of the next. We’ll see this scheme again for deeper semantic processing. 7/24/2019 CSCI 5832 Spring 2007

English Spelling Changes We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape 7/24/2019 CSCI 5832 Spring 2007

Overall Plan 7/24/2019 CSCI 5832 Spring 2007

Final Scheme: Part 1 7/24/2019 CSCI 5832 Spring 2007

Final Scheme: Part 2 7/24/2019 CSCI 5832 Spring 2007

Composition Create a set of new states that correspond to each pair of states from the original machines (New states are called (x,y), where x is a state from M1, and y is a state from M2) Create a new FST transition table for the new machine according to the following intuition… 7/24/2019 CSCI 5832 Spring 2007

Composition There should be a transition between two related states in the new machine if it’s the case that the output for a transition from a state from M1, is the same as the input to a transition from M2 or… 7/24/2019 CSCI 5832 Spring 2007

Composition δ3((xa,ya), i:o) = (xb,yb) iff There exists c such that δ1(xa, i:c) = xb AND δ2(ya, c:o) = yb 7/24/2019 CSCI 5832 Spring 2007

Extra HW Post-Mortem What does the NY Times think a word is? 7/24/2019 CSCI 5832 Spring 2007

Sentence Segmentation Titles Mr. Mrs. Ms. Just bigger lists? Or something better? Initials in names H. W. Regular expressions? Numbers $2.8 million Acronyms P.O.W., I.R.A., … Lists? Regexs? 7/24/2019 CSCI 5832 Spring 2007

Sentence Segmentation This could get tedious… Lots of rules, lots of exceptions, brittle Another approach is to use machine learning… Take a segmented corpus of text and learn to segment it. Not clear if it is better in this case, but it is the dominant approach these days. 7/24/2019 CSCI 5832 Spring 2007

Next HW Part 1: Alter your code to take newswire text and segment it into paragraphs, sentences and words. As in… <p> <s> The quick brown fox. </s> <s> This is sentence two. </s> </p> … 7/24/2019 CSCI 5832 Spring 2007

Projects Read ahead in the book to get a feel for various areas of NLP The goal is to produce a paper that could be sent to a conference on NLP To get a feel for what that means you need to get familiar with such papers acl.ldc.upenn.edu 7/24/2019 CSCI 5832 Spring 2007

ACL Repository Lots of stuff: all relevant, not all terribly readable But focus on recent ACL conference proceedings NAACL proceedings HLT proceedings Just start browsing titles that look interesting. 7/24/2019 CSCI 5832 Spring 2007

Readings/Quiz First quiz is 2/8 (a week from Thursday). It will cover Chapters 2,3,4 and part of 5. Lectures are based on the assumption that you’ve read the text before class. Quizzes are based on the contents of the lectures and the chapters. 7/24/2019 CSCI 5832 Spring 2007

Light Weight Morphology Sometimes you just need to know the stem of a word and you don’t care about the structure. In fact you may not even care if you get the right stem, as long as you get a consistent string. This is stemming… it most often shows up in IR applications 7/24/2019 CSCI 5832 Spring 2007

Stemming for Information Retrieval Run a stemmer on the documents to be indexed Run a stemmer on users’ queries Match This is basically a form of hashing, where you want collisions. 7/24/2019 CSCI 5832 Spring 2007

Porter No lexicon needed Basically a set of staged sets of rewrite rules that strip suffixes Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem (see first bullet) Lack of guarantee doesn’t matter for IR 7/24/2019 CSCI 5832 Spring 2007

Porter Example Computerization ization -> -ize computerize 7/24/2019 CSCI 5832 Spring 2007

Porter The original exposition of the Porter stemmer did not describe it as a transducer but… Each stage is separate transducer The stages can be composed to get one big transducer 7/24/2019 CSCI 5832 Spring 2007

Soundex You work as a telephone information operator for CU. Someone calls looking for our senior theory professor… Ehrenfeucht What do you type as your query string? 7/24/2019 CSCI 5832 Spring 2007

Soundex Keep the first letter Drop non-initial occurrences of vowels, h, w and y Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 Replace strings of identical numbers with a single number (333 -> 3) Drop any numbers beyond a third one 7/24/2019 CSCI 5832 Spring 2007

Soundex Effect is to map (hash) all similar sounding transcriptions to the same code. Structure your directory so that it can be accessed by code as well as by correct spelling Used for census records, phone directories, author searches in libraries etc. 7/24/2019 CSCI 5832 Spring 2007

Next Time On to Chapter 4 for Thursday. 7/24/2019 CSCI 5832 Spring 2007