Information Extraction Lecture

Information Extraction Lecture
9/21/2009

Wiki Pages - HowTo Key points
Example: Key points Naming the pages – examples: [[Cohen ICML 1995]] [[Lin and Cohen ICML 2010]] [[Minkov et al IJCAI 2005]] Structured links: [[AddressesProblem::named entity recognition]] [[UsesMethod::absolute discounting] [[RelatedPaper::Pang et al ACL 2002] [[UsesDataset::Citeseer]] [[Category::Paper]] [[Category::Problem]] [[Category::Method]] [[Category::Dataset]] Rule of 2: Don’t create a page unless you expect 2 inlinks A method from a paper that’s not used anywhere else should be described in-line No inverse links – but you can emulate these with queries

Wiki Pages – HowTo, con’t
To turn in: Add them to the wiki Add links to them on your user page Send me an with links to each page you want to get graded on [I may send back bug reports until people get the hang of this…] WhenTo: Three pages by 9/30 at midnight Actually 10/1 at dawn is fine. Suggestion: Think of your project and build pages for the dataset, the problem, and the (baseline) method you plan to use.

Projects Some sample projects On Wed 9/29 - “turn in”: On Friday 10/8:
Apply existing method to a new problem Apply new method to an existing dataset Build something that might help you in your research E.g., Extract names of people (pundits, politicians, …) from political blogs Classify folksonomy tags as person names, place names, … On Wed 9/29 - “turn in”: One page, covering some subset of: What you plan to do with what data Why you think it’s interesting Any relevant superpowers you might have How you plan to evaluate What techniques you plan to use What question you want to answer Who you might work with These will be posted on the class web site On Friday 10/8: Similar abstract from each team Team is (preferably) 2-3 people, but I’m flexible Main new information: who’s on what team

Datasets Dataset development is more work than you think
William can point you to a bunch of non-LDC - and gene-entity related extraction datasets Tom Mitchell is making available data from his RtW project Large graph of candidate entities and contexts they appear in Other resources FreeBase tuples Del.icio.us tags …?

Information Extraction using HMMs
Pilfered from: Sunita Sarawagi, IIT Bombay

IE by text segmentation
Source: concatenation of structured elements with limited reordering and some missing fields Example: Addresses, bib records House number Zip Building Road City State 4089 Whispering Pines Nobel Drive San Diego CA 92122 Title Year Journal Author Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115,

Why is Text Segmentation Different?
Unbalanced data vs balanced data: In entity extraction most tokens are not part of any entity  the “NEG” class (aka “Outside”) is more prevalent than any other class. In text segmentation token classes are more balanced. Constraints on number of entity types: Entities can occur (m)any number of times in a document. In an address, (usually) each field appears once. See Grenager et al optional paper

Hidden Markov Models Doubly stochastic models S1 S2
0.6 0.4 Doubly stochastic models Efficient dynamic programming algorithms exist for Finding Pr(S) The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model (Baum-Welch algorithm) A C 0.9 0.1 0.9 0.5 0.8 0.2 0.1 S1 S2 S4 S3 A C 0.3 0.7 A C 0.5 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property.

IE with Hidden Markov Models
As models for IE – need to learn: A B C 0.6 0.3 0.1 X Z 0.4 0.2 Y 0.8 Emission probabilities dddd dd 0.9 0.5 0.8 0.2 0.1 Transition probabilities Title Author Probabilitistic transitions and outputs make the model more robust to errors and slight variations Journal Year

IE with Hidden Markov Models
Need to provide structure of HMM & vocabulary A B C 0.6 0.3 0.1 X Z 0.4 0.2 Y 0.8 Emission probabilities dddd dd 0.9 0.5 0.8 0.2 0.1 Transition probabilities Title Author Probabilitistic transitions and outputs make the model more robust to errors and slight variations Journal Year

HMM Structure Naïve Model: One state per element Nested model
Each element another HMM

Comparing nested models
Naïve: Single state per tag Element length distribution: a, a2, a3,… Intra-tag sequencing not captured Chain: Element length distribution: Each length gets its own parameter Intra-tag sequencing captured Arbitrary mixing of dictionary, Eg. “California York” Pr(W|L) not modeled well. Parallel path: Element length distribution: each length gets a parameter Separates vocabulary of different length elements, (limited bigram model)

Structure choice: Compare Naïve model, multiple-independent-HMM approach, and a search for the best variant of the nested model. Start with maximal-number of states (many parallel paths) Repeatedly merge paths as long as performance on the training set improves.

Another structure: Separate HMM per field
Special prefix and suffix states to capture the start and end of a tag … combine predictions somehow later Road name S1 S2 Prefix Suffix S4 S2 S4 S1 S3 Prefix Suffix Building name

HMM Dictionary For each word (=feature), associate the probability of emitting that word Multinomial model Features of a word, Example: part of speech, capitalized or not type: number, letter, word etc Maximum entropy models (McCallum 2000), other exponential models Bikel: <word,feature> pairs and backoff Attached with each state is a dictionary that can be any probabilistic model on the content words attached with that element. The common easy case is a multinomial model. For each word, attach a probability value. Sum over all probabilities = 1. Intuitively know that particular words are less important than some top-level features of the words. These features may be overlapping. Need to train a joint probability model. Maximum entropy provides a viable approach to capture this.

Feature Hierarchy Search is used to find the best “cut” – bottom-up search using a validation set to decide when “move up”. Also use the feature hierarchy for absolute discounting.

Learning model parameters
When training data defines unique path through HMM Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transition from I When training data defines multiple path: A more general EM like algorithm (Baum-Welch)

Smoothing Two kinds of missing symbols: Approaches:
Title Journal Author Year Two kinds of missing symbols: Case-1: Unknown over the entire dictionary Case-2: Zero count in some state Approaches: Laplace smoothing (m-estimate): (#word w in state) + mp (#any word in state)+m Absolute discounting P(unknown) proportional to number of distinct tokens P(unknown) = (k’) x (number of distinct symbols) P(known) = (actual probability)-(k’), k’ is a small fixed constant, case 2 smaller than case 1 Data-driven

Title Journal Author Year Two kinds of missing symbols: Case-1: Unknown over the entire dictionary Case-2: Zero count in some state Approaches: Laplace smoothing (m-estimate) Absolute discounting Data-driven, Good-Turing like approach: Partition training data into two parts Train on part-1 Use part-2 to map all new tokens to UNK and treat it as new word in vocabulary OK for case-1, not good for case-2. Bikel et al use this method for case-1. For case-2 zero counts are backed off to 1/(Vocab-size)

Title Journal Author Year Two kinds of missing symbols: Case-1: Unknown over the entire dictionary Case-2: Zero count in some state Approaches: Laplace smoothing (m-estimate) Absolute discounting Data-driven, Good-Turing like approach Observation: unknown symbols more likely in some states than others. Used absolute discounting, discounting by 1/((#distinct words in state) + (#distinct words in any state))

Using the HMM to segment
Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm 115 Grant street Mumbai Grant ……… House House House House Road Road Road Road City City City ot ot Pin Pin Pin Pin

Most Likely Path for a Given Sequence
The probability that the path is taken and the sequence is generated: transition probabilities emission probabilities

Example 1 3 begin end 5 2 4 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2
0.8 0.6 0.5 1 3 begin end 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8

Finding the most probable path: the Viterbi algorithm
define to be the probability of the most probable path accounting for the first i characters of x and ending in state k we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state can define recursively can use dynamic programming to find efficiently

Finding the most probable path: the Viterbi algorithm
initialization: Note: this is wrong for delete states: they shouldn’t be initialized like this.

The Viterbi algorithm recursion for emitting states (i =1…L):
keep track of most probable path

The Viterbi algorithm termination:
to recover the most probable path, follow pointers back starting at

Database Integration Augment dictionary
Example: list of Cities Assigning probabilities is a problem Exploit functional dependencies Example Santa Barbara -> USA Piskinov -> Georgia

Information from an atlas would really help here.
2001 University Avenue, Kendall Sq. Piskinov, Georgia University Avenue, Kendall Sq., Piskinov, Georgia House number Road Name City State Area University Avenue, Kendall Sq., Piskinov, Georgia House number Road Name Area Country City

Frequency constraints
Including constraints of the form: the same tag cannot appear in two disconnected segments Eg: Title in a citation cannot appear twice Street name cannot appear twice Not relevant for named-entity tagging kinds of problems

Constrained Viterbi Original Viterbi
Modified Viterbi: pick best path that fits constraints find vk(i) for each k and i=1,…,n if i populates field f check: is f unfilled by 1..i-1? yes: no conflicts no: f was filled at j<i backtrack to j and spawn new Viterbi processes with vk(j) constrains to not fill f. exponential in number of fields, ok in practice

Comparative Evaluation
Naïve model – One state per element in the HMM Independent HMM – One HMM per element; Rule Learning Method – Rapier Nested Model – Each state in the Naïve model replaced by a HMM

Results: Comparative Evaluation
Dataset instances Elements IITB student Addresses 2388 17 Company 769 6 US 740 The Nested model does best in all three cases (from Borkar 2001)

Results: Effect of Feature Hierarchy
Feature Selection showed at least a 3% increase in accuracy

Results: Effect of training data size
HMMs are fast Learners. We reach very close to the maximum accuracy with just 50 to 100 addresses

HMM approach: summary Inter-element sequencing
Intra-element sequencing Element length Characteristic words Non-overlapping tags Outer HMM transitions Inner HMM Multi-state Inner HMM Dictionary Global optimization

Conditional Markov Models

HMM Learning Manally pick HMM’s graph (eg simple model, fully connected) Learn transition probabilities: Pr(si|sj) Learn emission probabilities: Pr(w|si) Attached with each state is a dictionary that can be any probabilistic model on the content words attached with that element. The common easy case is a multinomial model. For each word, attach a probability value. Sum over all probabilities = 1. Intuitively know that particular words are less important than some top-level features of the words. These features may be overlapping. Need to train a joint probability model. Maximum entropy provides a viable approach to capture this.

What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set

What is a symbol? Bikel et al mix symbols from two abstraction levels

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

Stupid HMM tricks Pr(red|red) = 1 Pr(red) start Pr(green|green) = 1

Stupid HMM tricks Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)
start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

HMM’s = sequential NB

From NB to Maxent

From NB to Maxent Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

MaxEnt Comments Implementation: Smoothing: All methods are iterative
Numerical issues (underflow rounding) are important. For NLP like problems with many features, modern gradient-like or Newton-like methods work well – sometimes better(?) and faster than GIS and IIS Smoothing: Typically maxent will overfit data if there are many infrequent features. Common solutions: discard low-count features; early stopping with holdout set; Gaussian prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha)

MaxEnt Comments Performance: Embedding in a larger system:
Good MaxEnt methods are competitive with linear SVMs and other state of are classifiers in accuracy. Can’t as easily extend to higher-order interactions (e.g. kernel SVMs, AdaBoost) – but see [Lafferty, Zhu, Liu ICML2004] Training is relatively expensive. Embedding in a larger system: MaxEnt optimizes Pr(y|x), not error rate.

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1

What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations

What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S t - 1 S t+1 t … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

MXPOST

MXPOST: learning & inference
GIS Feature selection

Alternative inference schemes

MXPost inference

Inference for MENE When will prof Cohen post the notes … B B B B B B B

Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O (Approx view): find best path, weights are now on arcs from state to state.

Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state.

Inference for MXPOST Find best path? tree? Weights are on hyperedges
When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Find best path? tree? Weights are on hyperedges

Inference for MxPOST … … Beam search is alternative to Viterbi:
When will prof Cohen post the notes … I iI iI iI iI iI iI oI oI oI oI oI oI … … O iO iO iO iO iO iO oO oO oO oO oO oO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

MXPost results State of art accuracy (for 1996)
Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

MEMMs Basic difference from ME tagging:
ME tagging: previous state is feature of MaxEnt classifier MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” Mostly a difference in viewpoint MEMM does allow possibility of “hidden” states and Baum-Welsh like training Viterbi is the most natural inference scheme

MEMM task: FAQ parsing

MEMM features

Some interesting points to ponder
“Easier to think of observations as part of the the arcs, rather than the states.” FeatureHMM works surprisingly(?) well. Both approaches allow Pr(yi|x,yi-1,…) to be determined by arbitrary features of the history. “Factored” MEMM

Information Extraction Lecture

Similar presentations

Presentation on theme: "Information Extraction Lecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction Lecture

Similar presentations

Presentation on theme: "Information Extraction Lecture"— Presentation transcript:

Similar presentations

About project

Feedback