6/29/2015 2:01 AM1 CSE 573 Finite State Machines for Information Extraction Topics –Administrivia –Background –DIRT –Finite State Machine Overview –HMMs.

Slides:



Advertisements
Similar presentations
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Advertisements

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Named Entity Recognition.
Lecture 5: Learning models using EM
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models (HMMs) for Information Extraction
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning Bayesian Networks
Bayesian Networks Alan Ritter.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Feb, 27, 2015 Slide Sources Raymond J. Mooney University of.
Information Extraction 2 CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts Amherst From.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to Text Mining
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Information Extraction Yunyao Li EECS /SI /29/2006.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Hidden Markov Models for Information Extraction CSE 454.
Artificial Intelligence Recap & Expectation Maximization CSE 473 Dan Weld.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CS Statistical Machine learning Lecture 24
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hidden Markov Models for Information Extraction
Hidden Markov Models (HMMs) for Information Extraction
Hidden Markov Models Part 2: Algorithms
CSE 574 Finite State Machines for Information Extraction
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Presentation transcript:

6/29/2015 2:01 AM1 CSE 573 Finite State Machines for Information Extraction Topics –Administrivia –Background –DIRT –Finite State Machine Overview –HMMs –Conditional Random Fields –Inference and Learning TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA

Mini-Project Options 1.Write a program to solve the counterfeit coin problem on the midterm. 2.Build a DPLL and or a WalkSAT satisfiability solver. 3.Build a spam filter using naïve Bayes, decision trees, or compare learners in the Weka ML package. 4.Write a program which learns Bayes nets.

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * * Slides from Cohen & McCallum

Landscape: Our Focus Pattern complexity Pattern feature domain Pattern scope Pattern combinations Models closed setregularcomplexambiguous wordswords + formattingformatting site-specificgenre-specificgeneral entitybinaryn-ary lexiconregexwindowboundaryFSMCFG Slides from Cohen & McCallum

Landscape of IE Techniques Models Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? …and beyond Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Slides from Cohen & McCallum

Simple Extractor Boston, Seattle, … Cities suchas EOP * ** **

DIRT How related to IE? Why unsupervised? Distributional Hypothesis?

DIRT Dependency Tree?

DIRT Path Similarity Path Database

DIRT Evaluation? X is author of Y

Overall Accept? Proud?

Finite State Models Naïve Bayes Logistic Regression Linear-chain CRFs HMMs Generative directed models General CRFs Sequence Conditional General Graphs

Graphical Models Family of probability distributions that factorize in a certain way Directed (Bayes Nets) Undirected (Markov Random Field) Factor Graphs Node is independent of its non- descendants given its parents Node is independent all other nodes given its neighbors

Recap: Naïve Bayes Assumption: features independent given label Generative Classifier –Model joint distribution p(x,y) –Inference –Learning: counting –Example The article appeared in the Seattle Times. city? length capitalization suffix Need to consider sequence! Labels of neighboring words dependent! П П

Hidden Markov Models Generative Sequence Model –2 assumptions to make joint distribution tractable 1. Each state depends only on its immediate predecessor. 2. Each observation depends only on current state. Finite state model x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Graphical Model transitions observations … … state sequence observation sequence YesterdayPedro other person location person otherperson… … y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y t-1

Hidden Markov Models Generative Sequence Model Model Parameters –Start state probabilities –Transition probabilities –Observation probabilities Finite state model x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Graphical Model transitions observations … … state sequence observation sequence YesterdayPedro other person location person otherperson… … y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 П - -

IE with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background Slide by Cohen & McCallum

IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should –Model entire document –Consist of target and non-target states –Not necessarily fully connected 18 Slide by Okan Basegmez

Information Extraction with HMMs Example – Research Paper Headers 19 Slide by Okan Basegmez

HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities Back-off to: or Results: Slide by Cohen & McCallum

A parse of a sequence Given a sequence x = x 1 ……x N, A parse of o is a sequence of states y = y 1, ……, y N 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2 Slide by Serafim Batzoglou person other location

Question #1 – Evaluation GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N A trained HMM θ=(,, ) QUESTION How likely is this sequence, given our HMM ? P(x, θ) Why do we care? Need it for learning to choose among competing models! -

Question #2 - Decoding GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N A trained HMM θ=(,, ) QUESTION How dow we choose the corresponding parse (state sequence) y 1 y 2 y 3 y 4 ……y N, which “best” explains x 1 x 2 x 3 x 4 ……x N ? There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, … -

Question #3 - Learning GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N QUESTION How do we learn the model parameters θ =(,, ) to maximize P(x, λ ) ? -

Solution to #1: Evaluation Given observations x=x 1 … x N and HMM θ, what is p(x) ? Naïve: enumerate every possible state sequence y=y 1 … y N Probability of x and given particular y Probability of particular y Summing over all possible state sequences we get N T state sequences! 2T multiplications per sequence For small HMMs T=10, N=10 there are 10 billion sequences! П П 

Solution to #1: Evaluation Use Dynamic Programming: Define forward variable probability that at time t - the state is y i - the partial observation sequence x=x 1 … x t has been emitted

Solution to #1: Evaluation Use Dynamic Programming Cache and reuse inner sums Define forward variables probability that at time t -the state is y t = S i -the partial observation sequence x=x 1 … x t has been emitted  П    y y -  

The Forward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O(K 2 N) Space: O(KN) K = |S| #states N length of sequence -  -   

The Forward Algorithm

The Backward Algorithm

INITIALIZATION INDUCTION TERMINATION Time: O(K 2 N) Space: O(KN)    

Solution to #2 - Decoding Given x=x 1 … x N and HMM θ, what is “best” parse y 1 … y N ? Several optimal solutions 1. States which are individually most likely: most likely state y * t at time t is then But some transitions may have 0 probability!

Solution to #2 - Decoding Given x=x 1 … x N and HMM θ, what is “best” parse y 1 … y N ? Several optimal solutions 1. States which are individually most likely 2. Single best state sequence We want to find sequence y 1 … y N, such that P(x,y) is maximized y * = argmax y P( x, y ) Again, we can use dynamic programming! 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … o1o1 o2o2 o3o3 oKoK 2 1 K 2

The Viterbi Algorithm DEFINE INITIALIZATION INDUCTION TERMINATION Backtracking to get state sequence y* buggy

The Viterbi Algorithm Time: O(K 2 T) Space: O(KT) x 1 x 2 ……x j-1 x j ……………………………..x T State 1 2 K i δ j (i) Max Remember: δ k (i) = probability of most likely state seq ending with state S k Slides from Serafim Batzoglou Linear in length of sequence

The Viterbi Algorithm 36 Pedro Domingos

Solution to #3 - Learning Given x 1 …x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ ’ such that P(o | θ ’ ) ≥ P(o | θ)

Solution to #3 - Learning Use hill-climbing –Called the forward-backward (or Baum/Welch) algorithm Idea –Use an initial parameter instantiation –Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters –Until estimates don’t change much

Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm –The E Step: Compute the forward and backward probabilities for given model parameters and our observations –The M Step: Re-estimate the model parameters

40 Chicken & Egg Problem If we knew the actual sequence of states –It would be easy to learn transition and emission probabilities –But we can’t observe states, so we don’t! If we knew transition & emission probabilities –Then it’d be easy to estimate the sequence of states (Viterbi) –But we don’t know them! Slide by Daniel S. Weld

41 Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution Slide by Daniel S. Weld

42 Input Looks Like Slide by Daniel S. Weld

43 We Want to Predict ? Slide by Daniel S. Weld

44 Chicken & Egg Note that coloring instances would be easy if we knew Gausians…. Slide by Daniel S. Weld

45 Chicken & Egg And finding the Gausians would be easy If we knew the coloring Slide by Daniel S. Weld

46 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly: set  1 =?;  2 =? Slide by Daniel S. Weld

47 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable Slide by Daniel S. Weld

48 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable Slide by Daniel S. Weld

49 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

50 ML Mean of Single Gaussian U ml = argmin u  i (x i – u) Slide by Daniel S. Weld

51 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

52 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable Slide by Daniel S. Weld

53 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

54 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

The Problem with HMMs We want more than an Atomic View of Words We want many arbitrary, overlapping features of words identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” y t-1 y t x t y t+1 x t +1 x t - 1 … … part of noun phrase is “Wisniewski” ends in “-ski” Slide by Cohen & McCallum

Finite State Models Naïve Bayes Logistic Regression Linear-chain CRFs HMMs Generative directed models General CRFs Sequence Conditional General Graphs ?

Problems with Richer Representation and a Joint Model These arbitrary features are not independent. –Multiple levels of granularity (chars, words, phrases) –Multiple dependent modalities (words, formatting, layout) –Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 S t O t S t+1 O t +1 O t - 1 S t-1 S t O t S t+1 O t +1 O t - 1 Slide by Cohen & McCallum

Discriminative and Generative Models So far: all models generative Generative Models … model P(y,x) Discriminative Models … model P(y|x) P(y|x) does not include a model of P(x), so it does not need to model the dependencies between features!

Discriminative Models often better Eventually, what we care about is p(y|x)! –Bayes Net describes a family of joint distributions of, whose conditionals take certain form –But there are many other joint models, whose conditionals also have that form. We want to make independence assumptions among y, but not among x.

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies. –Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Slide by Cohen & McCallum

Finite State Models Naïve Bayes Logistic Regression Linear-chain CRFs HMMs Generative directed models General CRFs Sequence Conditional General Graphs

Key Ideas Problem Spaces –Use of KR to Represent States –Compilation to SAT Search –Dynamic Programming, Constraint Sat, Heuristics Learning –Decision Trees, Need for Bias, Ensembles Probabilistic Inference –Bayes Nets, Variable Elimination, Decisions: MDPs Probabilistic Learning –Naïve Bayes, Parameter & Structure Learning –EM –HMMs: Viterbi, Baum-Welch

Applications SAT, CSP, Scheduling –Everywhere Planning –NASA, Xerox Machine Learning: –Everywhere Probabilistic Reasoning: –Spam filters, robot localization, etc /06au/schedule/lect27.pdf