Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS.

Slides:



Advertisements
Similar presentations
Bellwork If you roll a die, what is the probability that you roll a 2 or an odd number? P(2 or odd) 2. Is this an example of mutually exclusive, overlapping,
Advertisements

Artificial Intelligence: Knowledge Representation
Chapter 6 Writing a Program
Copyright © 2003 Pearson Education, Inc. Slide 1.
Introductory Mathematics & Statistics for Business
Copyright: ©2005 by Elsevier Inc. All rights reserved. 1 Author: Graeme C. Simsion and Graham C. Witt Chapter 3 The Entity-Relationship Approach.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
How to Factor Quadratics of the Form
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
Addition Facts
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
1/23 Learning from positive examples Main ideas and the particular case of CProgol4.2 Daniel Fredouille, CIG talk,11/2005.
Xiaolong Wang and Daniel Khashabi
Access Control 1. Given Credit Where It Is Due Most of the lecture notes are based on slides by Dr. Daniel M. Zimmerman at CALTECH Some slides are from.
3 Logic The Study of What’s True or False or Somewhere in Between.
Review Chapter 11 - Tables © 2010, 2006 South-Western, Cengage Learning.
Green Eggs and Ham.
Forecasting.
How to convert a left linear grammar to a right linear grammar
Chapter 2 Section 3.
University of Sheffield NLP Module 4: Machine Learning.
Machine Learning: Intro and Supervised Classification
1 Web Pages Week Three more tags… Sound Redirection Marquee.
This, that, these, those Number your paper from 1-10.
Chapter 7 Classification and Regression Trees
Addition 1’s to 20.
25 seconds left…...
AE1APS Algorithmic Problem Solving John Drake
Week 1.
18-Dec-14 Pruning. 2 Exponential growth How many leaves are there in a complete binary tree of depth N? This is easy to demonstrate: Count “going left”
Computer Vision Lecture 7: The Fourier Transform
Does one size really fit all? Evaluating classifiers in Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Classification Classification Examples
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
The Pumping Lemma for CFL’s
Information Extraction Lecture 3 – Rule-based Named Entity Recognition CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Linear Regression.
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Information Extraction Lecture 6 – Decision Trees (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Chapter 5: Linear Discriminant Functions
Ensemble Learning: An Introduction
Three kinds of learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CS 4700: Foundations of Artificial Intelligence
CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS
Positional Number Systems
Basic Machine Learning: Linear Models Alexander Fraser CIS, LMU München WSD and MT.
Computer Science 1000 Digital Circuits. Digital Information computers store and process information using binary as we’ve seen, binary affords us similar.
Microsoft ® Office Access ® 2007 Training Datasheets II: Sum, sort, filter, and find your data ICT Staff Development presents:
Chapter Twenty-ThreeModern Programming Languages1 Formal Semantics.
Formal Semantics Chapter Twenty-ThreeModern Programming Languages, 2nd ed.1.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Linear Discrimination Reading: Chapter 2 of textbook.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS
Information Extraction Review of Übung 2
Information Extraction Lecture 3 – Rule-based Named Entity Recognition
CSC 594 Topics in AI – Natural Language Processing
Prof. Dr. Alexander Fraser, CIS
Presentation transcript:

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS

Decision Trees vs. Linear Models Decision Trees are an intuitive way to learn classifiers from data They fit the training data well With heavy pruning, you can control overfitting NLP practitioners often use linear models instead Please read Sarawagi Chapter 3 (Entity Extraction: Statistical Methods) for next time The models discussed in Chapter 3 are linear models, as I will discuss here 2

Decision Trees for NER So far we have seen: How to learn rules for NER A basic idea of how to formulate NER as a classification problem Decision trees Including the basic idea of overfitting the training data 3

Rule Sets as Decision Trees Decision trees are quite powerful It is easy to see that complex rules can be encoded as decision trees For instance, let's go back to border detection in CMU seminars... 4

A Path in the Decision Tree The tree will check if the token to the left of the possible start position has "at" as a lemma Then check if the token after the possible start position is a Digit Then check the second token after the start position is a timeid ("am", "pm", etc) If you follow this path at a particular location in the text, then the decision should be to insert a 6

Linear Models However, in practice decision trees are not used so often in NLP Instead, linear models are used Let me first present linear models Then I will compare linear models and decision trees 7

Binary Classification I'm going to first discuss linear models for binary classification, using binary features We'll take the same scenario as before Our classifier is trying to decide whether we have a tag or not at the current position (between two words in an ) The first thing we will do is encode the context at this position into a feature vector 8

Feature Vector Each feature is true or false, and has a position in the feature vector The feature vector is typically sparse, meaning it is mostly zeros (meaning false) It will represent the full feature space. For instance, consider... 9

11 Our features represent this table using binary variables For instance, consider the lemma column Most features will be false (false = off = 0, these words are used interchangably). The features that will be on (true = on = 1) are: -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +2_lemma_pm +3_lemma_will

Classification To classify we will take the dot product of the feature vector with a learned weight vector We will say that the class is true (i.e., we should insert a here) if the dot product is >= 0, and false otherwise Because we might want to shift the values, we add a *bias* term, which is always true 12

Feature Vector We might use a feature vector like this: (this example is simplified – really we'd have all features for all positions) Bias term -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid

Weight Vector Now we'd like the dot product to be > 0 if we should insert a tag To encode the rule we looked at before we have three features that we want to have a positive weight -1_lemma_at +1_Digit +2_timeid We can give them weights of 1 Their sum will be three To make sure that we only classify if all three weights are on, let's set the weight on the bias term to -2 14

Dot Product - I Bias term -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid To compute the dot product take the product of each row, and sum this

Dot Product - II Bias term -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid *-2 0*0 1*0 0*0 1*1 1*0 0*0 1*1 1*-2 1*

Learning the Weight Vector The general learning task is simply to find a good weight vector! This is sometimes also called "training" Basic intuition: you can check weight vector candidates to see how well they classify the training data Better weights vectors get more of the training data right So we need some way to make (smart) changes to the weight vector, such that we annotate more and more of the training data right I will talk about this next time 17

Feature Extraction We run feature extraction to get the feature vectors for each position in the text We typically use a text representation to represent true values (which are sparse) We usually define feature templates which describe the feature to be extracted and give the name (i.e., -1_lemma_ XXX) -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid STIME -3_lemma_Seminar -2_lemma_at -1_lemma_4 -1_Digit +1_timeid +2_lemma_ will NONE... 18

Training vs. Testing When training the system, we have gold standard labels (see previous slide) When testing the system on new data, we have no gold standard We run the same feature generation first Then we take the dot product to get the classification decision Finally, we usually have to go back to the original text to write the tags into the correct positions 19

Further reading (optional): Tom Mitchell “Machine Learning” (text book) research/training/machine-learning- tutorial/ research/training/machine-learning- tutorial/ 20

Thank you for your attention! 21