Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001.

Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001 Joseph Smarr jsmarr@stanford.edu Based on research conducted at UC San Diego in Summer 2001 with Charles Elkan and David Kauchak

Joseph SmarrSources of Success for Information Extraction Methods Overview and Themes: Identifying “Sources of Success” Brief overview of Information Extraction (IE) paradigm and current methods Getting “under the hood” of current systems to understand the source of their performance and limitations Identifying new sources of information to exploit for increased performance and usefulness

Joseph SmarrSources of Success for Information Extraction Methods Motivation for Information Extraction Abundance of freely available text in digital form (WWW, MEDLINE, etc.) Information contained un-annotated text is largely inaccessible to computers Much of this information appears “ripe for the plucking” without having to do full text understanding

Joseph SmarrSources of Success for Information Extraction Methods Highly Structured Example: Amazon.com Book Info Pages Desired Info: title, author(s), price, availability, etc.

Joseph SmarrSources of Success for Information Extraction Methods Partially Structured Example: SCLA Speaker Announcement Emails Desired Info: title, speaker, date, abstract, etc.

Joseph SmarrSources of Success for Information Extraction Methods Natural Text Example: MEDLINE Journal Abstracts Desired Info: subject size, study type, condition studied, etc. BACKGROUND: The most challenging aspect of revision hip surgery is the management of bone loss. A reliable and valid measure of bone loss is important since it will aid in future studies of hip revisions and in preoperative planning. We developed a measure of femoral and acetabular bone loss associated with failed total hip arthroplasty. The purpose of the present study was to measure the reliability and the intraoperative validity of this measure and to determine how it may be useful in preoperative planning. METHODS: From July 1997 to December 1998, forty-five consecutive patients with a failed hip prosthesis in need of revision surgery were prospectively followed. Three general orthopaedic surgeons were taught the radiographic classification system, and two of them classified standardized preoperative anteroposterior and lateral hip radiographs with use of the system. Interobserver testing was carried out in a blinded fashion. These results were then compared with the intraoperative findings of the third surgeon, who was blinded to the preoperative ratings. Kappa statistics (unweighted and weighted) were used to assess correlation. Interobserver reliability was assessed by examining the agreement between the two preoperative raters. Prognostic validity was assessed by examining the agreement between the assessment by either Rater 1 or Rater 2 and the intraoperative assessment (reference standard). RESULTS: With regard to the assessments of both the femur and the acetabulum, there was significant agreement (p 0.75. There was also significant agreement (p 0.75. CONCLUSIONS: With use of the newly developed classification system, preoperative radiographs are reliable and valid for assessment of the severity of bone loss that will be found intraoperatively.

Joseph SmarrSources of Success for Information Extraction Methods Current Types of IE Systems Hand-built systems –Often effective, but slow and expensive to build and adapt Stochastic generative models –HMMs, N-Grams, PCFGs, etc. –Keep separate distributions for “content” and “filler” states Induced rule-based systems –Learn to identify local “landmarks” for beginning and end of target information

Joseph SmarrSources of Success for Information Extraction Methods Formalization of Information Extraction Performance task: –Extract specific tokens from a set of documents that contain the desired information Performance measure: –Precision: # correct returned / total # returned –Recall: # correct returned / total # correct –F1: harmonic mean of precision and recall Learning paradigm: –Supervised learning on set of documents with target fields manually labeled –Usually train/test on one field at a time

Joseph SmarrSources of Success for Information Extraction Methods IE as a Classification Task: Token Extraction as Boundary Detection Input: Linear Sequence of Tokens Date : Thursday, October 25 Time : 4 : 15 - 5 : 30 PM Method: Binary Classification of Inter-Token Boundaries Date : Thursday, October 25 Time : 4 : 15 - 5 : 30 PM Output: Tokens Between Identified Start / End Boundary Date : Thursday, October 25 Time : 4 : 15 - 5 : 30 PM Start / End of Content … Unimportant Boundaries

Joseph SmarrSources of Success for Information Extraction Methods Representation of Boundary Classifiers “Boundary Detectors” are pairs of token sequences –Detector matches a boundary iff p matches text before boundary and s matches text after boundary –Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc. Example: – matches beginning of Date: Thursday, October 25

Joseph SmarrSources of Success for Information Extraction Methods Boosted Wrapper Induction (BWI): Exemplar of Current Rule-Based Systems Wrapper Induction is a high-precision, low- recall learner that performs well for highly structured tasks Boosting is a technique for combining multiple weak learners into a strong learner by reweighting examples Boosted Wrapper Induction (BWI) was proposed by Freitag and Kushmerick in 2000 as the marriage of these two techniques

Joseph SmarrSources of Success for Information Extraction Methods BWI Algorithm Given set of documents with labeled fore and aft boundaries, induce : –F: set of “fore detectors” –A: set of “aft detectors” –H: histogram of field lengths (for pairing fore and aft detectors) To learn each boundary detector: –Start with an empty rule –Exhaustively enumerate all extensions up to lookahead length L –Add best scoring token extension –Repeat until no extensions improve score After learning a new detector: –Re-weight documents according to AdaBoost (down-weight correctly covered docs, up-weight incorrectly covered docs, normalize all weights) –Repeat process, learning a new rule and re-weighting each time –Stop after a predetermined number of iterations

Joseph SmarrSources of Success for Information Extraction Methods Summary of Original BWI Results BWI gives state-of-the-art performance on highly structured and partially structured tasks No systematic analysis of why BWI performs well BWI proposed as a solution for natural text IE, but no tests conducted

Joseph SmarrSources of Success for Information Extraction Methods Goals of Our Research Understand specifically how boosting contributes to BWI’s performance Investigate the relationship between performance and task regularity Identify new sources of information to improve performance, particularly for natural language tasks

Joseph SmarrSources of Success for Information Extraction Methods Comparison Algorithm: Sequential Wrapper Induction (SWI) Same formulation as BWI, but uses set covering instead of boosting to learn multiple rules –Find highest scoring rule –Remove all positive examples covered by new rule –Stop when all positive examples have been removed Scoring function – two choices: –Greedy-SWI: Most positive examples covered without covering any negative examples –Root-SWI: Sqrt(W + ) – Sqrt(W - ) (W + and W - are total weight of positive and negative examples covered) –BWI uses root scoring, but many set covering methods use greedy scoring

Joseph SmarrSources of Success for Information Extraction Methods Component Matrix of Algorithms Method for accumulating multiple detectors: BoostingSet Covering Method for scoring individual detectors: GreedyRoot BWIRoot-SWI Greedy-SWI

Joseph SmarrSources of Success for Information Extraction Methods Question #1: Does BWI Outperform Greedy Approach of SWI? BWI has higher F1 than Greedy-SWI –Greedy-SWI tends to have slightly higher precision, but BWI has considerably higher recall Does this difference come from the scoring function or the accumulation method? Average of 8 partially structured IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Question #2: How Does Performance Differ By Choice of Scoring Function? Greedy-SWI and Root-SWI differ only by their scoring function Greedy-SWI has higher precision, Root-SWI had higher recall, they have similar F1 BWI still outperforms Root- SWI, even though they use identical scoring functions –Remaining differences: boosting vs. set covering total number of rules learned Average of 8 partially structured IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Question #3: How Does Number of Rules Learned Affect Performance? BWI learns predetermined number of rules, but SWI stops when all examples are covered –Usually BWI learns many more rules than Root-SWI Stop BWI after it’s learned as many rules as Root- SWI too (“Fixed-BWI”) –Results in precision-recall tradeoff from Root-SWI –BWI outperforms Fixed-BWI Average of 8 partially structured IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Analysis of Experimental Results: Why Does BWI Outperform SWI? Key Insight: Source of BWI’s success is interaction of two complimentary effects, both due to boosting: –Re-weighting examples causes increasingly specific rules to be learned to cover exceptional cases (high precision) –Re-weighting examples instead of removing them means rules can be learned even after all examples have been covered (high recall)

Joseph SmarrSources of Success for Information Extraction Methods Performance vs. Task Regularity Reveals Important Interaction All methods perform better on tasks with more structure Relative power of different algorithmic components varies with task regularity Highly structuredPartially structuredNatural text

Joseph SmarrSources of Success for Information Extraction Methods How Do We Quantify Task Regularity? Goal: Measure relationship between task regularity and performance Proposed solution: “SWI-Ratio” # of iterations Greedy-SWI takes to cover all positive examples total number of positive examples –Most regular case: 1 rule covers all examples; 1/  = 0 –Least regular case: separate rule for each example; N/N = 1 –Since each new rule must cover at least one example, SWI will learn at most N rules for N examples (and usually much fewer)  SWI-Ratio always between 0 and 1 (smaller = more regular)

Joseph SmarrSources of Success for Information Extraction Methods Desirable Properties of SWI-Ratio Relative to size of document collection  suitable for comparison across different sizes General and objective –SWI is very simple and doesn’t allow any negative examples  unbiased account of how many non- overlapping rules are needed to perfectly cover all examples Quick and easy to run –No free parameters to set (except lookahead, which we kept fixed in all tests)

Joseph SmarrSources of Success for Information Extraction Methods Performance of BWI and Greedy-SWI (F1) vs. Task Regularity (SWI-Ratio) Dotted lines separate highly structured, partially structured, and natural text domains

Joseph SmarrSources of Success for Information Extraction Methods Improving IE Performance on Natural Text Documents Goal: Compensate for weak IE performance on natural language tasks –Need to look elsewhere for regularities to exploit Idea: Consider grammatical structure –Run shallow parser on each sentence –Flatten output into sequence of “typed phrase segments” (using XML tags to mark text) Example of Tagged Sentence: Uba2p is located largely in the nucleus. NP_SEGVP_SEGPP_SEGNP_SEG

Joseph SmarrSources of Success for Information Extraction Methods Typed Phrase Segments Improve BWI’s Performance on Natural Text IE Tasks 21% increase65% increase45% increase

Joseph SmarrSources of Success for Information Extraction Methods Typed Phrase Segments Increase Regularity of Natural Text IE Tasks Average decrease of 21%

Joseph SmarrSources of Success for Information Extraction Methods Encouraging Results Suggest Exploiting Other Sources of Regularity Key Insight: We can improve performance on natural text while maintaining simple IE framework if we expose the right regularities Suggests other linguistic abstractions may be useful –More grammatical info, semantic categories, lexical features, etc.

Joseph SmarrSources of Success for Information Extraction Methods Conclusions and Summary Boosting is key source of BWI’s success –Learns specific rules, but learns many of them IE performance is sensitive to task regularity –SWI-Ratio is quantitative, objective measure of regularity (vs. subjective document classes) Exploiting more regularities in text is key to IE’s future, particularly in natural text –Canonical formatting and keywords are often sufficient in structured text documents –Exposing grammatical information boosts performance for natural text IE tasks

Joseph SmarrSources of Success for Information Extraction Methods Acknowledgements Dayne Fretiag, for making BWI code available Mark Craven, for giving us natural text MEDLINE documents with annotated phrase segments MedExpert International, Inc. for financial support of this research Charles Elkan and David Kauchak, for hosting me at UCSD this summer This work was conducted as part of the California Institute for Telecommunications and Information Technology, Cal-(IT) 2.

Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001.

Similar presentations

Presentation on theme: "Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001.

Similar presentations

Presentation on theme: "Joseph SmarrSources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001."— Presentation transcript:

Similar presentations

About project

Feedback