Download presentation
Presentation is loading. Please wait.
1
6/29/2015 2:01 AM1 CSE 573 Finite State Machines for Information Extraction Topics –Administrivia –Background –DIRT –Finite State Machine Overview –HMMs –Conditional Random Fields –Inference and Learning TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA
2
Mini-Project Options 1.Write a program to solve the counterfeit coin problem on the midterm. 2.Build a DPLL and or a WalkSAT satisfiability solver. 3.Build a spam filter using naïve Bayes, decision trees, or compare learners in the Weka ML package. 4.Write a program which learns Bayes nets.
3
What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * * Slides from Cohen & McCallum
4
Landscape: Our Focus Pattern complexity Pattern feature domain Pattern scope Pattern combinations Models closed setregularcomplexambiguous wordswords + formattingformatting site-specificgenre-specificgeneral entitybinaryn-ary lexiconregexwindowboundaryFSMCFG Slides from Cohen & McCallum
5
Landscape of IE Techniques Models Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? …and beyond Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Slides from Cohen & McCallum
6
Simple Extractor Boston, Seattle, … Cities suchas EOP * ** **
7
DIRT How related to IE? Why unsupervised? Distributional Hypothesis?
8
DIRT Dependency Tree?
9
DIRT Path Similarity Path Database
10
DIRT Evaluation? X is author of Y
11
Overall Accept? Proud?
12
Finite State Models Naïve Bayes Logistic Regression Linear-chain CRFs HMMs Generative directed models General CRFs Sequence Conditional General Graphs
13
Graphical Models Family of probability distributions that factorize in a certain way Directed (Bayes Nets) Undirected (Markov Random Field) Factor Graphs Node is independent of its non- descendants given its parents Node is independent all other nodes given its neighbors
14
Recap: Naïve Bayes Assumption: features independent given label Generative Classifier –Model joint distribution p(x,y) –Inference –Learning: counting –Example The article appeared in the Seattle Times. city? length capitalization suffix Need to consider sequence! Labels of neighboring words dependent! П П
15
Hidden Markov Models Generative Sequence Model –2 assumptions to make joint distribution tractable 1. Each state depends only on its immediate predecessor. 2. Each observation depends only on current state. Finite state model x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Graphical Model transitions observations … … state sequence observation sequence YesterdayPedro other person location person otherperson… … y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y t-1
16
Hidden Markov Models Generative Sequence Model Model Parameters –Start state probabilities –Transition probabilities –Observation probabilities Finite state model x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Graphical Model transitions observations … … state sequence observation sequence YesterdayPedro other person location person otherperson… … y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 П - -
17
IE with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background Slide by Cohen & McCallum
18
IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should –Model entire document –Consist of target and non-target states –Not necessarily fully connected 18 Slide by Okan Basegmez
19
Information Extraction with HMMs Example – Research Paper Headers 19 Slide by Okan Basegmez
20
HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities Back-off to: or Results: Slide by Cohen & McCallum - - - - -
21
A parse of a sequence Given a sequence x = x 1 ……x N, A parse of o is a sequence of states y = y 1, ……, y N 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2 Slide by Serafim Batzoglou person other location
22
Question #1 – Evaluation GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N A trained HMM θ=(,, ) QUESTION How likely is this sequence, given our HMM ? P(x, θ) Why do we care? Need it for learning to choose among competing models! -
23
Question #2 - Decoding GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N A trained HMM θ=(,, ) QUESTION How dow we choose the corresponding parse (state sequence) y 1 y 2 y 3 y 4 ……y N, which “best” explains x 1 x 2 x 3 x 4 ……x N ? There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, … -
24
Question #3 - Learning GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N QUESTION How do we learn the model parameters θ =(,, ) to maximize P(x, λ ) ? -
25
Solution to #1: Evaluation Given observations x=x 1 … x N and HMM θ, what is p(x) ? Naïve: enumerate every possible state sequence y=y 1 … y N Probability of x and given particular y Probability of particular y Summing over all possible state sequences we get N T state sequences! 2T multiplications per sequence For small HMMs T=10, N=10 there are 10 billion sequences! П П
26
Solution to #1: Evaluation Use Dynamic Programming: Define forward variable probability that at time t - the state is y i - the partial observation sequence x=x 1 … x t has been emitted
27
Solution to #1: Evaluation Use Dynamic Programming Cache and reuse inner sums Define forward variables probability that at time t -the state is y t = S i -the partial observation sequence x=x 1 … x t has been emitted П y y - - - - -- - - -
28
The Forward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O(K 2 N) Space: O(KN) K = |S| #states N length of sequence - -
29
The Forward Algorithm
30
The Backward Algorithm
31
INITIALIZATION INDUCTION TERMINATION Time: O(K 2 N) Space: O(KN)
32
Solution to #2 - Decoding Given x=x 1 … x N and HMM θ, what is “best” parse y 1 … y N ? Several optimal solutions 1. States which are individually most likely: most likely state y * t at time t is then But some transitions may have 0 probability!
33
Solution to #2 - Decoding Given x=x 1 … x N and HMM θ, what is “best” parse y 1 … y N ? Several optimal solutions 1. States which are individually most likely 2. Single best state sequence We want to find sequence y 1 … y N, such that P(x,y) is maximized y * = argmax y P( x, y ) Again, we can use dynamic programming! 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … o1o1 o2o2 o3o3 oKoK 2 1 K 2
34
The Viterbi Algorithm DEFINE INITIALIZATION INDUCTION TERMINATION Backtracking to get state sequence y* buggy
35
The Viterbi Algorithm Time: O(K 2 T) Space: O(KT) x 1 x 2 ……x j-1 x j ……………………………..x T State 1 2 K i δ j (i) Max Remember: δ k (i) = probability of most likely state seq ending with state S k Slides from Serafim Batzoglou Linear in length of sequence
36
The Viterbi Algorithm 36 Pedro Domingos
37
Solution to #3 - Learning Given x 1 …x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ ’ such that P(o | θ ’ ) ≥ P(o | θ)
38
Solution to #3 - Learning Use hill-climbing –Called the forward-backward (or Baum/Welch) algorithm Idea –Use an initial parameter instantiation –Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters –Until estimates don’t change much
39
Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm –The E Step: Compute the forward and backward probabilities for given model parameters and our observations –The M Step: Re-estimate the model parameters
40
40 Chicken & Egg Problem If we knew the actual sequence of states –It would be easy to learn transition and emission probabilities –But we can’t observe states, so we don’t! If we knew transition & emission probabilities –Then it’d be easy to estimate the sequence of states (Viterbi) –But we don’t know them! Slide by Daniel S. Weld
41
41 Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution.01.03.05.07.09 Slide by Daniel S. Weld
42
42 Input Looks Like.01.03.05.07.09 Slide by Daniel S. Weld
43
43 We Want to Predict.01.03.05.07.09 ? Slide by Daniel S. Weld
44
44 Chicken & Egg.01.03.05.07.09 Note that coloring instances would be easy if we knew Gausians…. Slide by Daniel S. Weld
45
45 Chicken & Egg.01.03.05.07.09 And finding the Gausians would be easy If we knew the coloring Slide by Daniel S. Weld
46
46 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly: set 1 =?; 2 =?.01.03.05.07.09 Slide by Daniel S. Weld
47
47 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld
48
48 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld
49
49 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
50
50 ML Mean of Single Gaussian U ml = argmin u i (x i – u) 2.01.03.05.07.09 Slide by Daniel S. Weld
51
51 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
52
52 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld
53
53 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
54
54 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
55
The Problem with HMMs We want more than an Atomic View of Words We want many arbitrary, overlapping features of words identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” y t-1 y t x t y t+1 x t +1 x t - 1 … … part of noun phrase is “Wisniewski” ends in “-ski” Slide by Cohen & McCallum
56
Finite State Models Naïve Bayes Logistic Regression Linear-chain CRFs HMMs Generative directed models General CRFs Sequence Conditional General Graphs ?
57
Problems with Richer Representation and a Joint Model These arbitrary features are not independent. –Multiple levels of granularity (chars, words, phrases) –Multiple dependent modalities (words, formatting, layout) –Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 S t O t S t+1 O t +1 O t - 1 S t-1 S t O t S t+1 O t +1 O t - 1 Slide by Cohen & McCallum
58
Discriminative and Generative Models So far: all models generative Generative Models … model P(y,x) Discriminative Models … model P(y|x) P(y|x) does not include a model of P(x), so it does not need to model the dependencies between features!
59
Discriminative Models often better Eventually, what we care about is p(y|x)! –Bayes Net describes a family of joint distributions of, whose conditionals take certain form –But there are many other joint models, whose conditionals also have that form. We want to make independence assumptions among y, but not among x.
60
Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies. –Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Slide by Cohen & McCallum
61
Finite State Models Naïve Bayes Logistic Regression Linear-chain CRFs HMMs Generative directed models General CRFs Sequence Conditional General Graphs
62
Key Ideas Problem Spaces –Use of KR to Represent States –Compilation to SAT Search –Dynamic Programming, Constraint Sat, Heuristics Learning –Decision Trees, Need for Bias, Ensembles Probabilistic Inference –Bayes Nets, Variable Elimination, Decisions: MDPs Probabilistic Learning –Naïve Bayes, Parameter & Structure Learning –EM –HMMs: Viterbi, Baum-Welch
63
Applications SAT, CSP, Scheduling –Everywhere Planning –NASA, Xerox Machine Learning: –Everywhere Probabilistic Reasoning: –Spam filters, robot localization, etc. http://www.cs.washington.edu/education/courses/cse 473/06au/schedule/lect27.pdf
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.