Information Extraction Lecture

Information Extraction Lecture

Administrative issues (recap)
Some scary math: 23 sessions left (after this week) 28 people enrolled  28 student “optional” paper presentations My plan: next sessions will each have two 20min student presentations discussion/questions 50-60min of lecture Vitor and I will construct a signup mechanism (Google spreadsheet?) and you the details and post on the web page Procrastinate intelligently If you don’t get contact Vitor (we have Andrew s for roster) What you can present Any “optional” paper from the syllabus Anything else you want that’s related (check with William) When you can present Any time after the topic has been covered in lecture Preferably soon after

Projects (recap) Today, Tuesday 2/6: everyone submit an abstract ( to William cc Vitor, and hardcopy) One page, covering some subset of: What you plan to do Why you think it’s interesting Any relevant superpowers you might have How you plan to evaluate What techniques you plan to use What question you want to answer Who you might work with These will be posted on the class web site Following Tuesday 2/13: Similar abstract from each team Team is (preferably) 2-3 people, but I’m flexible Main new information: who’s on what team

Wrappers

Seminars, Job Ads Websites from DBs BioEntities in Medline Abstracts
Same #rounds as SWI Secondary regularities? BWI rule wts No negative examples covered

Simple 1-step co-training for web pages
f1 is a bag-of-words page classifier, and S is web site containing unlabeled pages. Feature construction. Represent a page x in S as a bag of pages that link to x (“bag of hubs”). Learning. Learn f2 from the bag-of-hubs examples, labeled with f1 Labeling. Use f2(x) to label pages from S. Idea: use one round of co-training to bootstrap the bag-of words classifier to one that uses site-specific features x2/f2

Learner { List1, List3,…}, PR { List1, List2, List3,…}, PR
BOH representation: { List1, List3,…}, PR { List1, List2, List3,…}, PR { List2, List 3,…}, Other { List2, List3,…}, PR … Learner

Experimental results Co-training hurts No improvement

MUC-7 Last Message Understanding Conference, (fore-runner to ACE), about 1998 200 articles in development set (aircraft accidents) 200 articles in final test (launch events) Names of persons, organizations, locations, dates, times, currency & percentage

LTG NetOwl Identifinder (HMMs) MENE+Proteus Manitoba
Commercial RBS Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names)

Borthwick et al: MENE system
MaxEnt token classifiers: 4 tags/field: x_start, x_continue, x_end, x_unique Features: Section features Tokens in window Lexical features of tokens in window Dictionary features of tokens (is token a firstName?) External system of tokens (is this a NetOwl_company_start? proteus_person_unique?) Smooth by discarding low-count features No history: viterbi search used to find best consistent tag sequence (e.g. no continue w/o start)

Viterbi in MENE When will prof Cohen post the notes … B B B B B B B I

Viterbi in MENE When will prof Cohen post the notes … B B B B B B B I
V(t,i) = max {j:ji} score(t,i) + V(t-1,j)

Dictionaries in MENE

MENE results (dry run)

MENE learning curves 92.2 93.3 96.3

Longer names Short names
Largest U.S. Cable Operator Makes Bid for Walt Disney By ANDREW ROSS SORKIN The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus. If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios. Longer names Short names

LTG system Another MUC-7 competitor
Handcoded rules for “easy” cases (amounts, etc) Process of repeated tagging and “matching” for hard cases Sure-fire (high precision) rules for names where type is clear (“Phillip Morris, Inc – The Walt Disney Company”) Partial matches to sure-fire rule are filtered with a maxent classifier (candidate filtering) using contextual information, etc Higher-recall rules, avoiding conflicts with partial-match output “Phillip Morris announced today…. - “Disney’s ….” Final partial-match & filter step on titles with different learned filter. Exploits discourse/context information

LTG Results

LTG NetOwl Identifinder (HMMs) MENE+Proteus Manitoba
Commercial RBS Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names)

Information Extraction using HMMs
Pilfered from: Sunita Sarawagi, IIT Bombay

IE by text segmentation
Source: concatenation of structured elements with limited reordering and some missing fields Example: Addresses, bib records House number Zip Building Road City State 4089 Whispering Pines Nobel Drive San Diego CA 92122 Title Year Journal Author Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115,

Why is Text Segmentation Different?
Unbalanced data vs balanced data: In entity extraction most tokens are not part of any entity  the “NEG” class (aka “Outside”) is more prevalent than any other class. In text segmentation token classes are more balanced. Constraints on number of entity types: Entities can occur (m)any number of times in a document. In an address, (usually) each field appears once. See Grenager et al paper

Hidden Markov Models Doubly stochastic models S1 S2
0.6 0.4 Doubly stochastic models Efficient dynamic programming algorithms exist for Finding Pr(S) The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model (Baum-Welch algorithm) A C 0.9 0.1 0.9 0.5 0.8 0.2 0.1 S1 S2 S4 S3 A C 0.3 0.7 A C 0.5 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property.

Input features Content of the element Inter-element sequencing
Specific keywords like street, zip, vol, pp, Properties of words like capitalization, parts of speech, number? Inter-element sequencing Intra-element sequencing Element length External database Dictionary words Semantic relationship between words Frequency constraints

IE with Hidden Markov Models
As models for IE – need to learn: A B C 0.6 0.3 0.1 X Z 0.4 0.2 Y 0.8 Emission probabilities dddd dd 0.9 0.5 0.8 0.2 0.1 Transition probabilities Title Author Probabilitistic transitions and outputs make the model more robust to errors and slight variations Journal Year

IE with Hidden Markov Models
Need to provide structure of HMM & vocabulary A B C 0.6 0.3 0.1 X Z 0.4 0.2 Y 0.8 Emission probabilities dddd dd 0.9 0.5 0.8 0.2 0.1 Transition probabilities Title Author Probabilitistic transitions and outputs make the model more robust to errors and slight variations Journal Year

HMM Structure Naïve Model: One state per element Nested model
Each element another HMM

Comparing nested models
Naïve: Single state per tag Element length distribution: a, a2, a3,… Intra-tag sequencing not captured Chain: Element length distribution: Each length gets its own parameter Intra-tag sequencing captured Arbitrary mixing of dictionary, Eg. “California York” Pr(W|L) not modeled well. Parallel path: Element length distribution: each length gets a parameter Separates vocabulary of different length elements, (limited bigram model)

Structure choice: Compare Naïve model, multiple-independent-HMM approach, and a search for the best variant of the nested model. Start with maximal-number of states (many parallel paths) Repeatedly merge paths as long as performance on the training set improves.

Embedding a HMM in a state

Bigram model of Bikel et al.
Each inner model a detailed bigram model First word: conditioned on state and previous state Subsequent words conditioned on previous word and state Special “start” and “end” symbols that can be thought Large number of parameters (Training data order~60,000 words in the smallest experiment) Backing off mechanism to previous simpler “parent” models (lambda parameters to control mixing)

Another structure: Separate HMM per field
Special prefix and suffix states to capture the start and end of a tag … combine predictions somehow later Road name S1 S2 Prefix Suffix S4 S2 S4 S1 S3 Prefix Suffix Building name

HMM Dictionary For each word (=feature), associate the probability of emitting that word Multinomial model Features of a word, example, part of speech, capitalized or not type: number, letter, word etc Maximum entropy models (McCallum 2000), other exponential models Bikel: <word,feature> pairs and backoff Attached with each state is a dictionary that can be any probabilistic model on the content words attached with that element. The common easy case is a multinomial model. For each word, attach a probability value. Sum over all probabilities = 1. Intuitively know that particular words are less important than some top-level features of the words. These features may be overlapping. Need to train a joint probability model. Maximum entropy provides a viable approach to capture this.

Feature Hierarchy Search is used to find the best “cut” – bottom-up search using a validation set to decide when “move up”. Also use the feature hierarchy for absolute discounting.

Learning model parameters
When training data defines unique path through HMM Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transition from I When training data defines multiple path: A more general EM like algorithm (Baum-Welch)

Smoothing Two kinds of missing symbols: Approaches:
Title Journal Author Year Two kinds of missing symbols: Case-1: Unknown over the entire dictionary Case-2: Zero count in some state Approaches: Laplace smoothing (m-estimate): (#word w in state) + mp (#any word in state)+m Absolute discounting P(unknown) proportional to number of distinct tokens P(unknown) = (k’) x (number of distinct symbols) P(known) = (actual probability)-(k’), k’ is a small fixed constant, case 2 smaller than case 1 Data-driven

Title Journal Author Year Two kinds of missing symbols: Case-1: Unknown over the entire dictionary Case-2: Zero count in some state Approaches: Laplace smoothing (m-estimate) Absolute discounting Data-driven, Good-Turing like approach: Partition training data into two parts Train on part-1 Use part-2 to map all new tokens to UNK and treat it as new word in vocabulary OK for case-1, not good for case-2. Bikel et al use this method for case-1. For case-2 zero counts are backed off to 1/(Vocab-size)

Title Journal Author Year Two kinds of missing symbols: Case-1: Unknown over the entire dictionary Case-2: Zero count in some state Approaches: Laplace smoothing (m-estimate) Absolute discounting Data-driven, Good-Turing like approach Observation: unknown symbols more likely in some states than others. Used absolute discounting, discounting by 1/((#distinct words in state) + (#distinct words in any state))

Using the HMM to segment
Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm 115 Grant street Mumbai Grant ……… House House House House Road Road Road Road City City City ot ot Pin Pin Pin Pin

Most Likely Path for a Given Sequence
The probability that the path is taken and the sequence is generated: transition probabilities emission probabilities

Example 1 3 begin end 5 2 4 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2
0.8 0.6 0.5 1 3 begin end 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8

Finding the most probable path: the Viterbi algorithm
define to be the probability of the most probable path accounting for the first i characters of x and ending in state k we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state can define recursively can use dynamic programming to find efficiently

Finding the most probable path: the Viterbi algorithm
initialization: Note: this is wrong for delete states: they shouldn’t be initialized like this.

The Viterbi algorithm recursion for emitting states (i =1…L):
keep track of most probable path

The Viterbi algorithm termination:
to recover the most probable path, follow pointers back starting at

Database Integration Augment dictionary
Example: list of Cities Assigning probabilities is a problem Exploit functional dependencies Example Santa Barbara -> USA Piskinov -> Georgia

Information from an atlas would really help here.
2001 University Avenue, Kendall Sq. Piskinov, Georgia University Avenue, Kendall Sq., Piskinov, Georgia House number Road Name City State Area University Avenue, Kendall Sq., Piskinov, Georgia House number Road Name Area Country City

Frequency constraints
Including constraints of the form: the same tag cannot appear in two disconnected segments Eg: Title in a citation cannot appear twice Street name cannot appear twice Not relevant for named-entity tagging kinds of problems

Constrained Viterbi Original Viterbi Modified Viterbi ….

Comparative Evaluation
Naïve model – One state per element in the HMM Independent HMM – One HMM per element; Rule Learning Method – Rapier Nested Model – Each state in the Naïve model replaced by a HMM

Results: Comparative Evaluation
Dataset instances Elements IITB student Addresses 2388 17 Company 769 6 US 740 The Nested model does best in all three cases (from Borkar 2001)

Results: Effect of Feature Hierarchy
Feature Selection showed at least a 3% increase in accuracy

Results: Effect of training data size
HMMs are fast Learners. We reach very close to the maximum accuracy with just 50 to 100 addresses

HMM approach: summary Inter-element sequencing
Intra-element sequencing Element length Characteristic words Non-overlapping tags Outer HMM transitions Inner HMM Multi-state Inner HMM Dictionary Global optimization

Information Extraction Lecture

Similar presentations

Presentation on theme: "Information Extraction Lecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction Lecture

Similar presentations

Presentation on theme: "Information Extraction Lecture"— Presentation transcript:

Similar presentations

About project

Feedback