Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Perceptron Learning Rule

On-line learning and Boosting

Imbalanced data David Kauchak CS 451 – Fall 2013.

ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Evaluating Search Engine

Simple Neural Nets For Pattern Classification

Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.

Introduction to Machine Learning Approach Lecture 5.

Overview of Search Engines

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Research Methods for Computer Science CSCI 6620 Spring 2014 Dr. Pettey CSCI 6620 Spring 2014 Dr. Pettey.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Online Learning Algorithms

Face Detection using the Viola-Jones Method

Evaluation David Kauchak cs458 Fall 2012 adapted from:

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Boosted Wrapper Induction Dayne Freitag Just Research Pittsburg, PA, USA Nicholas Kushmerick Department of Computer Science University College Dublin,

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Presenter: Shanshan Lu 03/04/2010

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.

Today’s Agenda  Reminder: HW #1 Due next class  Quick Review  Input Space Partitioning Software Testing and Maintenance 1.

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Talk Schedule Question Answering from Bryan Klimt July 28, 2005.

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Presentation transcript:

Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001 Joseph Smarr Based on research conducted at UC San Diego in Summer 2001 with Charles Elkan and David Kauchak

Joseph SmarrSources of Success for Information Extraction Methods Overview and Themes: Identifying “Sources of Success” Brief overview of Information Extraction (IE) paradigm and current methods Getting “under the hood” of current systems to understand the source of their performance and limitations Identifying new sources of information to exploit for increased performance and usefulness

Joseph SmarrSources of Success for Information Extraction Methods Motivation for Information Extraction Abundance of freely available text in digital form (WWW, MEDLINE, etc.) Information contained un-annotated text is largely inaccessible to computers Much of this information appears “ripe for the plucking” without having to do full text understanding

Joseph SmarrSources of Success for Information Extraction Methods Highly Structured Example: Amazon.com Book Info Pages Desired Info: title, author(s), price, availability, etc.

Joseph SmarrSources of Success for Information Extraction Methods Partially Structured Example: SCLA Speaker Announcement s Desired Info: title, speaker, date, abstract, etc.

Joseph SmarrSources of Success for Information Extraction Methods Natural Text Example: MEDLINE Journal Abstracts Desired Info: subject size, study type, condition studied, etc. BACKGROUND: The most challenging aspect of revision hip surgery is the management of bone loss. A reliable and valid measure of bone loss is important since it will aid in future studies of hip revisions and in preoperative planning. We developed a measure of femoral and acetabular bone loss associated with failed total hip arthroplasty. The purpose of the present study was to measure the reliability and the intraoperative validity of this measure and to determine how it may be useful in preoperative planning. METHODS: From July 1997 to December 1998, forty-five consecutive patients with a failed hip prosthesis in need of revision surgery were prospectively followed. Three general orthopaedic surgeons were taught the radiographic classification system, and two of them classified standardized preoperative anteroposterior and lateral hip radiographs with use of the system. Interobserver testing was carried out in a blinded fashion. These results were then compared with the intraoperative findings of the third surgeon, who was blinded to the preoperative ratings. Kappa statistics (unweighted and weighted) were used to assess correlation. Interobserver reliability was assessed by examining the agreement between the two preoperative raters. Prognostic validity was assessed by examining the agreement between the assessment by either Rater 1 or Rater 2 and the intraoperative assessment (reference standard). RESULTS: With regard to the assessments of both the femur and the acetabulum, there was significant agreement (p There was also significant agreement (p CONCLUSIONS: With use of the newly developed classification system, preoperative radiographs are reliable and valid for assessment of the severity of bone loss that will be found intraoperatively.

Joseph SmarrSources of Success for Information Extraction Methods Current Types of IE Systems Hand-built systems –Often effective, but slow and expensive to build and adapt Stochastic generative models –HMMs, N-Grams, PCFGs, etc. –Keep separate distributions for “content” and “filler” states Induced rule-based systems –Learn to identify local “landmarks” for beginning and end of target information

Joseph SmarrSources of Success for Information Extraction Methods Formalization of Information Extraction Performance task: –Extract specific tokens from a set of documents that contain the desired information Performance measure: –Precision: # correct returned / total # returned –Recall: # correct returned / total # correct –F1: harmonic mean of precision and recall Learning paradigm: –Supervised learning on set of documents with target fields manually labeled –Usually train/test on one field at a time

Joseph SmarrSources of Success for Information Extraction Methods IE as a Classification Task: Token Extraction as Boundary Detection Input: Linear Sequence of Tokens Date : Thursday, October 25 Time : 4 : : 30 PM Method: Binary Classification of Inter-Token Boundaries Date : Thursday, October 25 Time : 4 : : 30 PM Output: Tokens Between Identified Start / End Boundary Date : Thursday, October 25 Time : 4 : : 30 PM Start / End of Content … Unimportant Boundaries

Joseph SmarrSources of Success for Information Extraction Methods Representation of Boundary Classifiers “Boundary Detectors” are pairs of token sequences –Detector matches a boundary iff p matches text before boundary and s matches text after boundary –Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc. Example: – matches beginning of Date: Thursday, October 25

Joseph SmarrSources of Success for Information Extraction Methods Boosted Wrapper Induction (BWI): Exemplar of Current Rule-Based Systems Wrapper Induction is a high-precision, low- recall learner that performs well for highly structured tasks Boosting is a technique for combining multiple weak learners into a strong learner by reweighting examples Boosted Wrapper Induction (BWI) was proposed by Freitag and Kushmerick in 2000 as the marriage of these two techniques

Joseph SmarrSources of Success for Information Extraction Methods BWI Algorithm Given set of documents with labeled fore and aft boundaries, induce : –F: set of “fore detectors” –A: set of “aft detectors” –H: histogram of field lengths (for pairing fore and aft detectors) To learn each boundary detector: –Start with an empty rule –Exhaustively enumerate all extensions up to lookahead length L –Add best scoring token extension –Repeat until no extensions improve score After learning a new detector: –Re-weight documents according to AdaBoost (down-weight correctly covered docs, up-weight incorrectly covered docs, normalize all weights) –Repeat process, learning a new rule and re-weighting each time –Stop after a predetermined number of iterations

Joseph SmarrSources of Success for Information Extraction Methods Summary of Original BWI Results BWI gives state-of-the-art performance on highly structured and partially structured tasks No systematic analysis of why BWI performs well BWI proposed as a solution for natural text IE, but no tests conducted

Joseph SmarrSources of Success for Information Extraction Methods Goals of Our Research Understand specifically how boosting contributes to BWI’s performance Investigate the relationship between performance and task regularity Identify new sources of information to improve performance, particularly for natural language tasks

Joseph SmarrSources of Success for Information Extraction Methods Comparison Algorithm: Sequential Wrapper Induction (SWI) Same formulation as BWI, but uses set covering instead of boosting to learn multiple rules –Find highest scoring rule –Remove all positive examples covered by new rule –Stop when all positive examples have been removed Scoring function – two choices: –Greedy-SWI: Most positive examples covered without covering any negative examples –Root-SWI: Sqrt(W + ) – Sqrt(W - ) (W + and W - are total weight of positive and negative examples covered) –BWI uses root scoring, but many set covering methods use greedy scoring

Joseph SmarrSources of Success for Information Extraction Methods Component Matrix of Algorithms Method for accumulating multiple detectors: BoostingSet Covering Method for scoring individual detectors: GreedyRoot BWIRoot-SWI Greedy-SWI

Joseph SmarrSources of Success for Information Extraction Methods Question #1: Does BWI Outperform Greedy Approach of SWI? BWI has higher F1 than Greedy-SWI –Greedy-SWI tends to have slightly higher precision, but BWI has considerably higher recall Does this difference come from the scoring function or the accumulation method? Average of 8 partially structured IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Question #2: How Does Performance Differ By Choice of Scoring Function? Greedy-SWI and Root-SWI differ only by their scoring function Greedy-SWI has higher precision, Root-SWI had higher recall, they have similar F1 BWI still outperforms Root- SWI, even though they use identical scoring functions –Remaining differences: boosting vs. set covering total number of rules learned Average of 8 partially structured IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Question #3: How Does Number of Rules Learned Affect Performance? BWI learns predetermined number of rules, but SWI stops when all examples are covered –Usually BWI learns many more rules than Root-SWI Stop BWI after it’s learned as many rules as Root- SWI too (“Fixed-BWI”) –Results in precision-recall tradeoff from Root-SWI –BWI outperforms Fixed-BWI Average of 8 partially structured IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Analysis of Experimental Results: Why Does BWI Outperform SWI? Key Insight: Source of BWI’s success is interaction of two complimentary effects, both due to boosting: –Re-weighting examples causes increasingly specific rules to be learned to cover exceptional cases (high precision) –Re-weighting examples instead of removing them means rules can be learned even after all examples have been covered (high recall)

Joseph SmarrSources of Success for Information Extraction Methods Performance vs. Task Regularity Reveals Important Interaction All methods perform better on tasks with more structure Relative power of different algorithmic components varies with task regularity Highly structuredPartially structuredNatural text

Joseph SmarrSources of Success for Information Extraction Methods How Do We Quantify Task Regularity? Goal: Measure relationship between task regularity and performance Proposed solution: “SWI-Ratio” # of iterations Greedy-SWI takes to cover all positive examples total number of positive examples –Most regular case: 1 rule covers all examples; 1/  = 0 –Least regular case: separate rule for each example; N/N = 1 –Since each new rule must cover at least one example, SWI will learn at most N rules for N examples (and usually much fewer)  SWI-Ratio always between 0 and 1 (smaller = more regular)

Joseph SmarrSources of Success for Information Extraction Methods Desirable Properties of SWI-Ratio Relative to size of document collection  suitable for comparison across different sizes General and objective –SWI is very simple and doesn’t allow any negative examples  unbiased account of how many non- overlapping rules are needed to perfectly cover all examples Quick and easy to run –No free parameters to set (except lookahead, which we kept fixed in all tests)

Joseph SmarrSources of Success for Information Extraction Methods Performance of BWI and Greedy-SWI (F1) vs. Task Regularity (SWI-Ratio) Dotted lines separate highly structured, partially structured, and natural text domains

Joseph SmarrSources of Success for Information Extraction Methods Improving IE Performance on Natural Text Documents Goal: Compensate for weak IE performance on natural language tasks –Need to look elsewhere for regularities to exploit Idea: Consider grammatical structure –Run shallow parser on each sentence –Flatten output into sequence of “typed phrase segments” (using XML tags to mark text) Example of Tagged Sentence: Uba2p is located largely in the nucleus. NP_SEGVP_SEGPP_SEGNP_SEG

Joseph SmarrSources of Success for Information Extraction Methods Typed Phrase Segments Improve BWI’s Performance on Natural Text IE Tasks 21% increase65% increase45% increase

Joseph SmarrSources of Success for Information Extraction Methods Typed Phrase Segments Increase Regularity of Natural Text IE Tasks Average decrease of 21%

Joseph SmarrSources of Success for Information Extraction Methods Encouraging Results Suggest Exploiting Other Sources of Regularity Key Insight: We can improve performance on natural text while maintaining simple IE framework if we expose the right regularities Suggests other linguistic abstractions may be useful –More grammatical info, semantic categories, lexical features, etc.

Joseph SmarrSources of Success for Information Extraction Methods Conclusions and Summary Boosting is key source of BWI’s success –Learns specific rules, but learns many of them IE performance is sensitive to task regularity –SWI-Ratio is quantitative, objective measure of regularity (vs. subjective document classes) Exploiting more regularities in text is key to IE’s future, particularly in natural text –Canonical formatting and keywords are often sufficient in structured text documents –Exposing grammatical information boosts performance for natural text IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Acknowledgements Dayne Fretiag, for making BWI code available Mark Craven, for giving us natural text MEDLINE documents with annotated phrase segments MedExpert International, Inc. for financial support of this research Charles Elkan and David Kauchak, for hosting me at UCSD this summer This work was conducted as part of the California Institute for Telecommunications and Information Technology, Cal-(IT) 2.