89-6801 עיבוד שפות טבעיות - שיעור רביעי Part of Speech Tagging עידו דגן המחלקה למדעי המחשב אוניברסיטת בר אילן.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Language and Grammar Unit

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Outline Why part of speech tagging? Word classes

Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.

BİL711 Natural Language Processing

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Parts of speech & Lexical Categories

February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.

1 Words and the Lexicon September 10th 2009 Lecture #3.

עיבוד שפות טבעיות – שיעור שמיני מתייגים יעל נצר מדעי המחשב אוניברסיטת בן גוריון.

 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Word Classes and English Grammar.

Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

NLP and Speech 2004 English Grammar

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שישי Viterbi Tagging Syntax עידו.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Part of speech (POS) tagging

Part-of-Speech Tagging & Sequence Labeling

עיבוד שפות טבעיות – שיעור שישי Part of Speech taggers מדעי המחשב יעל נצר אוניברסיטת בן גוריון.

Parts of Speech (Lexical Categories). Parts of Speech Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) The building blocks of sentences The [ N.

Part-of-Speech Tagging

8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303.

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Parts of Speech Sudeshna Sarkar 7 Aug 2008.

Some Advances in Transformation-Based Part of Speech Tagging

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

Albert Gatt Corpora and Statistical Methods Lecture 10.

Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.

Parts of Speech Notes. Part of Speech: Nouns  A naming word  Names a person, place, thing, idea, living creature, quality, or idea Examples: cowboy,

10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.

Parts of Speech (Lexical Categories). Parts of Speech n Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) n The building blocks of sentences n The.

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.

Word classes and part of speech tagging Chapter 5.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.

Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.

Tokenization & POS-Tagging

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.

Natural Language Processing

CSA3202 Human Language Technology HMMs for POS Tagging.

Parts of Speech Major source: Wikipedia. Adjectives An adjective is a word that modifies a noun or a pronoun, usually by describing it or making its meaning.

Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.

Part-of-speech tagging

Parts of Speech חלקי הדיבור

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Part-of-Speech Tagging & Sequence Labeling Hongning Wang

Word classes and part of speech tagging Chapter 5.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

1 Natural Language Processing Vasile Rus

Lecture 9: Part of Speech

LANGUAGE How can any language be divided? What are language parts?

עיבוד שפות טבעיות מבוא פרופ' עידו דגן המחלקה למדעי המחשב

CSCI 5832 Natural Language Processing

Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider

Natural Language Processing

Presentation transcript:

עיבוד שפות טבעיות - שיעור רביעי Part of Speech Tagging עידו דגן המחלקה למדעי המחשב אוניברסיטת בר אילן

חלקי - הדיבור מקובל למנות 9 ~ קבוצות מילים המכונות " חלקי - דיבור ": שם עצם (noun), שם תואר (adjective), כינוי (pronoun), שם מספר (numeral), פועל (verb), תואר הפועל (adverb), מלת יחס (preposition), מלת חיבור (conjunction), מלת קריאה (interjection). אך זו רק חלוקה אחת

למה זה טוב ? בסיס לניתוח parsing יצירת קול (TTS) – אופן הביטוי של המילה : – רכבת / רכבת –CONtent/conTENT, OBJect/objECT, DIScount/disCOUNT Chunking/partial parsing/identifing terms N-gram models for speech IR,MT

איך מגדירים חלקי דיבר ? באופן מסורתי, ההגדרה של חלקי הדיבר מבוססת על תכונות מורפולוגיות של המילה או על המילים שמופיעות לידן בסמיכות distributional properties. באופן עקרוני, יש למילים מאותו חלק דיבר דמיון סמנטי, כלומר, הן מתארות איברים מאותן קבוצות למשל – שמות עצם –nouns אנשים, מקומות, דברים – thought, table, sister – שמות תואר – adjectives תכונות, כמויות big, lazy – לואי פעולה – adverbs – מתארים אופן, מקום, זמן, איכות quickly – פעלים – אירועים, התרחשויות או מצבי קיום – eat, is, write – ויש גם מילות יחס, מילות איחוי ועוד...

דוגמא The yinkish dripner blorked quastofically into the nindin with the pidibs. yinkish -adj nindin -noun dripner -noun pidibs -noun blorked -verb quastofically -adverb We determine the P.O.S of a word by the affixes that are attached to it and by the syntactic context (where in the sentence) it appears in.

Open class vs. Closed class types Closed class – הקבוצה שחבריה קבועים בדרך כלל, כמו מילות יחס. Open class – למשל, שמות עצם ופעלים : מילים חדשות מתווספות לקבוצה to fax, לפקסס בקורפוסים שונים ייצפו מילים שונות מהקבוצה הפתוחה, אבל אם הקורפוס גדול מספיק, סביר להניח שימצאו בהם אותם מילים השייכות לקבוצה הסגורה. מילים מהקבוצה הסגורה הן בדרך כלל function words – מילים השייכות לדקדוק כמו of, את – מילים קצרות בדרך כלל המופיעות בתדירות גבוהה, ולהן תפקיד תחבירי חשוב.

שמות עצם Nouns –take -s, 's, -ness, -ment, -er, affixes –Occur with determiners (a,the,this,some … ) –can be a subject of a sentence. Semantically: can be concrete – chair, train, or abstract – relationship. או שמות פעולה, למשל : אכילה, לאכול, eating

Types of Nouns Proper Nouns: –David, Israel, Microsoft –Aren’t preceded by articles –Capitalized (In English) Common Nouns: –Count Nouns: allow grammatical enumeration (book, books) can be counted (one apple, 50 thoughts) –Mass Nouns: snow, salt, communism, …

Verbs מילים המתייחסות לפעולות או תהליכים –Main verbs – draw, provide, differ –Auxiliaries (referred to as closed-class) מערכת הטיה מורפולוגית –eat, eats, eating, eaten

Adjectives מבחינה סמנטית, קבוצה הכוללת ביטויים המתארים תכונות או איכויות, משהו כמו פרדיקט חד - מקומי. שפות רבות כוללות : – צבעים (yellow, green) – גילאים (young, old) – וערכים. (good, bad) יש שפות בלי שמות תואר.

Adverbs קבוצה מעורבת למדי... Unfortunately, John walked home extremely slowly yesterday Directional: sideways, downhill Locative: home, here Degree: extremely, somewhat Manner: slowly, delicately Temporal: yesterday, Monday

Closed class Prepositions – on, under, over, near, by, at, from, to, with Determiners – a, an, the Pronouns – it, she I Conjunctions – and, but, or, as, if, when Auxiliary verbs – can, may, should, are Particles – up, down, on, off, in, at, by Numerals – one, two, second, third

Prepositions and particles. Prepositions on top, by then, with him... מילות יחס המופיעות לפני שם עצם מצינות יחסי זמן / מקום, אבל לא רק. Particles go on, look up, turn down מופיעים אחרי פועל, ובפעלים טרנזיטיביים, גם אחרי המושא –The horse went off its truck/throw off sleep –*The horse went its track off/throw sleep off

Articles a, an, the מופיעים בתחילה צירוף שמני noun phrase גם : this chapter, that page שכיחים מאוד בטקסטים

Conjunctions מאחים שני phrases, צירופים, משפטים, וכו. Or, and, but מאחים צירופים מאותו סטטוס Subordinating conjunctions משמשים לאיחוי צירופים מקוננים I thought that you might like some milk. –I thought – main clause –That you might… - subordinating clause

ויש עוד...

Tagsets Tagset The set of possible tags for parts of speech. (size is changing in applications, languages...) A tagset should include the information that is needed for the next steps in the process, and that people can annotate well Brown corpus – 87 tags Penn Treebank – 45 Large: 146-tag C7 tagset of used to tag the British National Corpus BNC.

Part-Of-Speech Tagging תיוג הוא התהליך של השמת חלקי דיבר או סימון לקסיקלי אחר לכל מילה בקורפוס. (tokenization) תיוג מתבצע בדרך כלל גם על סימני פיסוק הקלט הוא רצף מילים ו -tagset מהסוג שראינו. הפלט הוא התיוג הטוב ביותר עבור כל אחת מן המילים. והבעייה המרכזית, היא – ambiguity: –Time flies like an arrow/ fruit flies like –I can can my can/ אישה נעלה נעלה נעלה נעלה את הדלת...

The Distribution of Tags Tags follow all the usual frequency-based distributional behavior. Most word types have only one part of speech. Of the rest, most have two. Things go pretty much as we'd expect from there on. Of course, as usual, the most frequently occurring word types tend to have multiple tags. (As we'll see later in the semester, they also tend to have more meanings). Therefore while its easy to determine the correct tag for most word types, it isn't necessarily so easy to tag most texts.

Word Types in the Brown Corpus 35340Unambiguous (1 tag) 4100Ambiguous (2-7 tags) tags 2643 tags 614 tags 125 tags 26 tags 1 (“still”)7 tags

State of the Art A dumb tagger that simply assigns the most common tag to each word achieves ~90% Best approaches give ~96/97% This still means that there will be on average one tagging error per sentence Life is much more difficult if we do not have a lexicon and/or training corpus or if we use a tagger across domains and genres.

מתייגים - מבוססי חוקים – קידוד ידני –Transformation-based tagging (learning) Stochastic Tagging - הסתברותיים –HMM –Bayesian networks –Maximum entropy

Supervised Learning Scheme Classification Model “Labeled” Examples New Examples Classifications Training Algorithm Classification Algorithm

איך זה עובד ? P(NN|race) = 0.98 P(VB|race)=0.02 בצעד הראשון, יתוייג המשפט לפי התג הסביר יותר : is/VBZ expected/VBN to/TO race/NN tomorrow the/DT race/NN for/IN outer/JJ space/NN אחרי הבחירה הראשונית של התג, המתייג מבצעה את הטרנפורמציות שלמד מהקורפוס – לדוגמא : Change NN to VN when the previous tag is TO החוק הזה יחליף את race/NN to/TO ב - to/TO race/VB

Transformational Based Learning (TBL) for Tagging Introduced by Brill (1995) Can exploit a wider range of lexical and syntactic regularities via transformation rules – triggering environment and rewrite rule Tagger: –Construct initial tag sequence for input – most frequent tag for each word –Iteratively refine tag sequence by applying “transformation rules” in rank order Learner: –Construct initial tag sequence for the training corpus –Loop until done: Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking

Some examples 1. Change NN to VB if previous is TO –to/TO conflict/NN with  VB 2. Change VBP to VB if MD in previous three –might/MD vanish/VBP  VB 3. Change NN to VB if MD in previous two –might/MD reply/NN  VB 4. Change VB to NN if DT in previous two –the/DT reply/VB  NN

Transformation Templates Specify which transformations are possible For example: change tag A to tag B when: 1.The preceding (following) tag is Z 2.The tag two before (after) is Z 3.One of the two previous (following) tags is Z 4.One of the three previous (following) tags is Z 5.The preceding tag is Z and the following is W 6.The preceding (following) tag is Z and the tag two before (after) is W

Lexicalization New templates to include dependency on surrounding words (not just tags): Change tag A to tag B when: 1.The preceding (following) word is w 2.The word two before (after) is w 3.One of the two preceding (following) words is w 4.The current word is w 5.The current word is w and the preceding (following) word is v 6.The current word is w and the preceding (following) tag is X (Notice: word-tag combination) 7.etc…

Initializing Unseen Words How to choose most likely tag for unseen words? Transformation based approach: –Start with NP for capitalized words, NN for others –Learn “morphological” transformations from: Change tag from X to Y if: 1.Deleting prefix (suffix) x results in a known word 2.The first (last) characters of the word are x 3.Adding x as a prefix (suffix) results in a known word 4.Word W ever appears immediately before (after) the word 5.Character Z appears in the word

Unannotated Input Text Annotated Text Ground Truth for Input Text Rules Learning Algorithm TBL Learning Scheme Setting Initial State

Greedy Learning Algorithm Initial tagging of training corpus – most frequent tag per word At each iteration: –Identify rules that fix errors and compute “error reduction” for each transformation rule: #errors fixed - #errors introduced –Find best rule; If error reduction greater than a threshold (to avoid overfitting): Apply best rule to training corpus Append best rule to ordered list of transformations

Next Week … HMMs לא לשכוח שעורי בית...