1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Named Entity Recognition.

Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.

Extraction as Classification. What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Information Extraction.

Using Information Extraction for Question Answering Done by Rani Qumsiyeh.

Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.

Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 11, 2004.

I256 Applied Natural Language Processing Fall 2009 Lecture 13 Information Extraction (1) Barbara Rosario.

A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)

Information Extraction

Introduction to Text Mining

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.

Information Extraction Junichi Tsujii Graduate School of Science University of Tokyo Japan Ronen Feldman Bar Ilan University Israel.

Information Extraction CSE 574 Dan Weld. What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14,

December 2005CSA3180: Information Extraction I1 CSA3180: Natural Language Processing Information Extraction 1 – Introduction Information Extraction Named.

Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

Information Extraction Yunyao Li EECS /SI /29/2006.

Search Engines and Information Retrieval Chapter 1.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Survey of Semantic Annotation Platforms

Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.

December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State.

1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.

Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL

Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.

Finite State Parsing & Information Extraction CMSC Intro to NLP January 10, 2006.

Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

The Unreasonable Effectiveness of Data

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

CSC 594 Topics in AI – Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

Introduction to Information Extraction

Social Knowledge Mining

CSC 594 Topics in AI – Natural Language Processing

Text Mining & Natural Language Processing

CS246: Information Retrieval

CSCI 5832 Natural Language Processing

Using Uneven Margins SVM and Perceptron for IE

Extracting Why Text Segment from Web Based on Grammar-gram

Presentation transcript:

1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006

2 Today Information Extraction What it is Historical roots: MUC Current state-of-art performance Various Techniques

3 Classifying at Different Granularies Text Categorization: Classify an entire document Information Extraction (IE): Identify and classify small units within documents Named Entity Extraction (NE): A subset of IE Identify and classify proper names –People, locations, organizations

4 Adapted from slide by William Cohen What is Information Extraction? Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

5 Adapted from slide by William Cohen What is Information Extraction Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

6 Adapted from slide by William Cohen What is Information Extraction? Information Extraction = segmentation + classification + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”

7 Adapted from slide by William Cohen What is Information Extraction Information Extraction = segmentation + classification + association A family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

8 Adapted from slide by William Cohen What is Information Extraction Information Extraction = segmentation + classification + association A family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

9 Adapted from slide by William Cohen IE in Context Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models

10 Adapted from slide by William Cohen Landscape of IE Tasks: Degree of Formatting Grammatical sentences and some formatting & links Text paragraphs without formatting Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables

11 Adapted from slide by William Cohen Landscape of IE Tasks: Intended Breadth of Coverage Web site specificGenre specificWide, non-specific Amazon.com Book PagesResumesUniversity Names FormattingLayoutLanguage

12 Adapted from slide by William Cohen Landscape of IE Tasks: Complexity Closed set He was born in Alabama… The big Wyoming sky… U.S. states Regular set Phone: (413) The CALD main office can be reached at U.S. phone numbers Complex pattern University of Arkansas P.O. Box 140 Hope, AR U.S. postal addresses Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence Person names Pawel Opalinski, Software Engineer at WhizBang Labs.

13 Adapted from slide by William Cohen Landscape of IE Tasks: Single Field/Record Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut

14 Slide by Chris Manning, based on slides by several others MUC: the genesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: Terrorist events Industrial joint ventures Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)

15 Adapted from slide by Lucian Vlad Lita Message Understanding Conference (MUC) Named entity –Person, Organization, Location Co-reference –Clinton  President Bill Clinton Template element –Perpetrator, Target Template relation –Incident Multilingual

16 Adapted from slide by Lucian Vlad Lita MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

17 Adapted from slide by Lucian Vlad Lita MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

18 Adapted from slide by Lucian Vlad Lita MUC Templates Relationship –tie-up Entities: –Bridgestone Sports Co, a local concern, a Japanese trading house Joint venture company –Bridgestone Sports Taiwan Co Activity –ACTIVITY 1 Amount –NT$2,000,000

19 Adapted from slide by Lucian Vlad Lita MUC Templates ATIVITY 1 Activity –Production Company –Bridgestone Sports Taiwan Co Product –Iron and “metal wood” clubs Start Date –January 1990

20 Slide by Chris Manning, based on slides by several others Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. Example of IE from FASTUS (1993) TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$

21 Slide by Chris Manning, based on slides by several others Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. Example of IE: FASTUS(1993) TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$ ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

22 Slide by Chris Manning, based on slides by several others Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. Example of IE: FASTUS(1993): Resolving anaphora TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$ ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

23 Slide by Chris Manning, based on slides by others Evaluating IE Accuracy Always evaluate performance on independent, manually- annotated test data not used during system development. Measure for each test document: Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

24 Slide by Chris Manning, based on slides by others MUC Information Extraction: State of the Art c NE – named entity recognition CO – coreference resolution TE – template element construction TR – template relation construction ST – scenario template production

25 Adapted from slides by Cunningham & Bontcheva Two kinds of NE approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re- annotation of the entire training corpus annotators are cheap (but you get what you pay for!)

26 Slide by Chris Manning, based on slides by several others Three generations of IE systems Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the domain Iterative guess-test-tweak-repeat cycle Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates, using automated rule learners Require huge, labeled corpora (effort is just moved!) Statistical Models [1997 – ] Use machine learning to learn which features indicate boundaries and types of entities. Learning usually supervised; may be partially unsupervised

27 Slide by Chris Manning, based on slides by several others Trainable IE systems Pros Annotating text is simpler & faster than writing rules. Domain independent Domain experts don’t need to be linguists or programers. Learning algorithms ensure full coverage of examples. Cons Hand-crafted systems perform better, especially at hard tasks (but this is changing). Training data might be expensive to acquire. May need huge amount of training data. Hand-writing rules isn’t that hard!!

28 Adapted from slide by William Cohen Landscape of IE Techniques Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence?

29 Successors to MUC CoNNL: Conference on Computational Natural Language Learning Different topics each year 2002, 2003: Language-independent NER 2004: Semantic Role recognition 2001: Identify clauses in text 2000: Chunking boundaries – (also conll2004, conll2002…) –Sponsored by SIGNLL, the Special Interest Group on Natural Language Learning of the Association for Computational Linguistics. ACE: Automated Content Extraction Entity Detection and Tracking –Sponsored by NIST – Several others recently See

30 Adapted from slide by William Cohen State of the Art Performance: examples Named entity recognition from newswire text Person, Location, Organization, … F1 in high 80’s or low- to mid-90’s Binary relation extraction Contained-in (Location1, Location2) Member-of (Person1, Organization1) F1 in 60’s or 70’s or 80’s Web site structure recognition Extremely accurate performance obtainable Human effort (~10min?) required on each site

31 CoNNL-2003 Goal: identify boundaries and types of named entities People, Organizations, Locations, Misc. Experiment with incorporating external resources (Gazeteers) and unlabeled data Data: Using IOB notation 4 pieces of info for each term Word POS Chunk EntityType

32 Details on Training/Test Sets Reuters Newswire + European Corpus Initiative Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

33 Summary of Results 16 systems participated Machine Learning Techniques Combinations of Maximum Entropy Models (5) + Hidden Markov Models (4) + Winnow/Perceptron (4) Others used once were Support Vector Machines, Conditional Random Fields, Transformation-Based learning, AdaBoost, and memory-based learning Combining techniques often worked well Features Choice of features is at least as important as ML method Top-scoring systems used many types No one feature stands out as essential (other than words) Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

34 Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

35 Use of External Information Improvement from using Gazeteers vs. unlabeled data nearly equal Gazeteers less useful for German than English (higher quality) Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

36 Precision, Recall, and F-Scores * * * * * * Not significantly different Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

37 Combining Results What happens if we combine the results of all of the systems? Used a majority-vote of 5 systems for each set English: F = (14% error reduction of best system) German: F = (6% error reduction of best system)

38 MUC Redux Task: fill slots of templates MUC-4 (1992) All systems hand-engineered One MUC-6 entry used learning; failed miserably

39

40 MUC Redux Fast forward 12 years … now use ML! Chieu et. al. show a machine learning approach that can do as well as most of the hand-engineered MUC- 4 systems Uses state-of-the-art: –Sentence segmenter –POS tagger –NER –Statistical Parser –Co-reference resolution Features look at syntactic context –Use subject-verb-object information –Use head-words of NPs Train classifiers for each slot type Chieu, Hai Leong, Ng, Hwee Tou, & Lee, Yoong Keok (2003). Closing the Gap: Learning-Based Information Extraction Rivaling Knowledge-Engineering Methods, In (ACL-03).

41 Best systems took 10.5 person-months of hand-coding!

42 IE Techniques: Summary Machine learning approaches are doing well, even without comprehensive word lists Can develop a pretty good starting list with a bit of web page scraping Features mainly have to do with the preceding and following tags, as well as syntax and word “shape” The latter is somewhat language dependent With enough training data, results are getting pretty decent on well-defined entities ML is the way of the future!

43 IE Tools Research tools Gate – MinorThird – Alembic (only NE tagging) – Commercial ?? I don’t know which ones work well