Information Extraction Rayid Ghani IR Seminar - 11/28/00.

Slides:

Advertisements

Similar presentations

Pattern Finding and Pattern Discovery in Time Series

Advertisements

Search-Based Structured Prediction

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Statistical NLP: Lecture 11

Ch-9: Markov Models Prepared by Qaiser Abbas ( )

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Conditional Random Fields

. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.

Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Isolated-Word Speech Recognition Using Hidden Markov Models

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Graphical models for part of speech tagging

1 Sequence Learning Sudeshna Sarkar 14 Aug Alternative graphical models for part of speech tagging.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Talk Schedule Question Answering from Bryan Klimt July 28, 2005.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.

CS Statistical Machine learning Lecture 24

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

1 PROJECT 10 DATABASE QUERIES— TEXTBOOK DATABASE Management Information Systems, 9 th edition, By Raymond McLeod, Jr. and George P. Schell © 2004, Prentice.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

John Lafferty Andrew McCallum Fernando Pereira

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.

Hidden Markov Models BMI/CS 576

Online Multiscale Dynamic Topic Models

Data Mining Lecture 11.

CONTEXT DEPENDENT CLASSIFICATION

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Presentation transcript:

Information Extraction Rayid Ghani IR Seminar - 11/28/00

What is IE? Analyze unrestricted text in order to extract specific types of information Attempt to convert unstructured text documents into database entries Operate at many levels of the language

Task: Extract Speaker, Title, Location, Time, Date from Seminar Announcement Dr. Gibbons is spending his sabbatical from Bell Labs with us. His work bridges databases, datamining and theory, with several patents and applications to commercial DBMSs. Christos Date: Monday, March 20, 2000 Time: 3:30-5:00 (Refreshments provided) Place: 4623 Wean Hall Phil Gibbons Carnegie Mellon University The Aqua Approximate Query Answering System In large data recording and warehousing environments, providing an exact answer to a complex query can take minutes, or even hours, due to the amount of computation and disk I/O required. Moreover, given the current trend towards data analysis over gigabytes, terabytes, and even petabytes of data, these query response times are increasing despite improvements in

Task: Extract question/answer pairs from FAQ X-NNTP-Poster: NewsHound v1.33 Archive-name: acorn/faq/part2 Frequency: monthly 2.6) What configuration of serial cable should I use? Here follows a diagram of the necessary connections for common terminal programs to work properly. They are as far as I know the informal standard agreed upon by commercial comms software developers for the Arc. Pins 1, 4, and 8 must be connected together inside the 9 pin plug. This is to avoid the well known serial port chip bugs. The modem’s DCD (Data Carrier Detect) signal has been re-routed to the Arc’s RI (Ring Indicator) most modems broadcast a software RING signal anyway, and even then it’s really necessary to detect it for the model to answer the call. 2.7) The sound from the speaker port seems quite muffled. How can I get unfiltered sound from an Acorn machine? All Acorn machine are equipped with a sound filter designed to remove high frequency harmonics from the sound output. To bypass the filter, hook into the Unfiltered port. You need to have a capacitor. Look for LM324 (chip 39) and and hook the capacitor like this:

Task: Extract Title, Author, Institution & Abstract from research paper (previously

Task: Extract Acquired and Acquiring Companies from WSJ Article Sara Lee to Buy 30% of DIM Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest in Paris-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 million dollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars. The investment includes the purchase of 5 million newly issued DIM shares valued at about 5 million dollars, and a loan of about 15 million dollars, it said. The loan is convertible into an additional 16 million DIM shares, it noted. The proposed agreement is subject to approval by the French government, it said.

Types of IE systems Structured texts (such as web pages with tabular information) Semi-structured texts (such as online personals) Free text (such as news articles).

Problems with Manual IE Cannot adapt to domain changes Lots of human effort needed 1500 human hours (Riloff 95) Solution: Use Machine Learning

Why is IE difficult? There are many ways of expressing the same fact: BNC Holdings Inc named Ms G Torretta as its new chairman. Nicholas Andrews was succeeded by Gina Torretta as chairman of BNC Holdings Inc. Ms. Gina Torretta took the helm at BNC Holdings Inc. After a long boardroom struggle, Mr Andrews stepped down as chairman of BNC Holdings Inc. He was succeeded by Ms Torretta.

Named Entity Extraction Can be either a two-step or single step process Extraction => Classification Extraction-Classification Classification (Collins & Singer 99)

Information Extraction with HMMs [Seymore & McCallum ‘99] [Freitag & McCallum ‘99]

Parameters = P(s|s’), P(o|s) for all states in S={s 1,s 2,…} Emissions = word Training = Maximize probability of training observations (+ prior). For IE, states indicate “database field”.

Regrets with HMMs 1. Would prefer richer representation of text: multiple overlapping features, whole chunks of text. Example line features: length of line line is centered percent of non-alphabetics total amount of white space line contains two verbs line begins with a number line is grammatically a question Example word features: identity of word word is in all caps word ends in “-tion” word is part of a noun phrase word is in bold font word is on left hand side of page word is under node X in WordNet 2. HMMs are generative models of the text: P({s…},{o…}). Generative models do not handle easily overlapping, non-independent features. Would prefer a conditional model: P({s…}|{o…}).

Solution: New probabilistic sequence model P(o|s) P(s|s’) P(s|o,s’) Traditional HMM Maximum Entropy Markov Model (Represented by exponential model fit by maximum entropy) (For the time being, capture dependency on s’ with |S| independent functions.) P s’ (s|o)

Old graphical model New graphical model s t-1 stst otot stst otot P(o|s) P(s|s’) P(s|o,s’) Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally.

State Transition Probabilities based on Overlapping Features Model P s’ (s|o) in terms of multiple arbitrary overlapping (binary) features. Example observation feature tests: - o is the word “apple” - o is capitalized - o is on a left-justified line Actual feature, f, depends on both a binary observation feature test, b, and a destination state, s.

Maximum Entropy Constraints Maximum entropy is based on the principle that the best model for the data is the one that is consistent with certain constraints derived from the training data, but otherwise makes the fewest possible assumptions. Constraints: Data averageModel Expectation

Maximum Entropy while Satisfying Constraints When constraints are imposed in this way, the constraint-satisfying probability distribution that has maximum entropy is guaranteed to be: (1) unique (2) the same as the maximum likelihood solution for this model (3) in exponential form: [Della Pietra, Della Pietra, Lafferty, ‘97] Learn parameters by iterative procedure: Generalized Iterative Scaling (GIS)

Generalized Iterative Scaling [Darroch & Ratcliff ‘72] A standard algorithm for finding feature weights a in the exponential model. For each state, learn the parameters of its “black box” next state function, P s’ (s|o), as follows: 1. Calculate the relative frequency of each feature in the training data to get the expected value of each feature “according to the training data”: 2. Start iteration 0 of GIS arbitrary parameter values, say 3. At iteration j, use current values in to calculate expected value of each feature “according to the model”. 4. Update parameters to make the expected value of each feature “according to the model” be closer to the expected value “according to the training data”

Experimental Data 38 files belonging to 7 UseNet FAQs Example: X-NNTP-Poster: NewsHound v1.33 Archive-name: acorn/faq/part2 Frequency: monthly 2.6) What configuration of serial cable should I use? Here follows a diagram of the necessary connection programs to work properly. They are as far as I know agreed upon by commercial comms software developers fo Pins 1, 4, and 8 must be connected together inside is to avoid the well known serial port chip bugs. The Procedure: For each FAQ, train on one file, test on other; average.

Features in Experiments begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-subject blank contains-alphanum contains-bracketed-number contains-http contains-non-space contains-number contains-pipe contains-question-mark contains-question-word ends-with-question-mark first-alpha-is-capitalized indented indented-1-to-4 indented-5-to-10 more-than-one-third-space only-punctuation prev-is-blank prev-begins-with-ordinal shorter-than-30

Models Tested ME-Stateless: A single maximum entropy classifier applied to each line independently. TokenHMM: A fully-connected HMM with four states, one for each of the line categories, each of which generates individual tokens (groups of alphanumeric characters and individual punctuation characters). FeatureHMM: Identical to TokenHMM, only the lines in a document are first converted to sequences of features. MEMM: The maximum entopy Markov model described in this talk.

Results

Conclusions Presented a new probabilistic sequence model based on maximum entropy. Handles arbitrary overlapping features Conditional model Shown positive experimental results on FAQ segmentation. Shown variations for factored state, reduced complexity model, and reinforcement learning.