1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 11, 2004.

Slides:



Advertisements
Similar presentations
Finite-State Machines with No Output Ying Lu
Advertisements

4b Lexical analysis Finite Automata
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
Extraction as Classification. What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Information Extraction.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
I256 Applied Natural Language Processing Fall 2009 Lecture 13 Information Extraction (1) Barbara Rosario.
Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University.
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Information Extraction
Introduction to Text Mining
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Extraction Junichi Tsujii Graduate School of Science University of Tokyo Japan Ronen Feldman Bar Ilan University Israel.
December 2005CSA3180: Information Extraction I1 CSA3180: Natural Language Processing Information Extraction 1 – Introduction Information Extraction Named.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Information Extraction Yunyao Li EECS /SI /29/2006.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012.
INTRODUCTION TO THE THEORY OF COMPUTATION INTRODUCTION MICHAEL SIPSER, SECOND EDITION 1.
Ling 570 Day 17: Named Entity Recognition Chunking.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Finite State Parsing & Information Extraction CMSC Intro to NLP January 10, 2006.
Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.
10. Parsing with Context-free Grammars -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Presenter: Shanshan Lu 03/04/2010
MA/CSSE 474 Theory of Computation Decision Problems DFSMs.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Modeling Computation: Finite State Machines without Output
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Introduction to Information Extraction
Social Knowledge Mining
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
4b Lexical analysis Finite Automata
Text Mining & Natural Language Processing
4b Lexical analysis Finite Automata
CS246: Information Retrieval
Teori Bahasa dan Automata Lecture 9: Contex-Free Grammars
Presentation transcript:

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 11, 2004

2 Classifying at Different Granularies Text Categorization: Classify an entire document Information Extraction (IE): Identify and classify small units within documents Named Entity Extraction (NE): A subset of IE Identify and classify proper names –People, locations, organizations

3 Adapted from slide by William Cohen Example: The Problem: Looking for a Job Martin Baker, a person Genomics job Employers job posting form

4 Adapted from slide by William Cohen Example: A Solution: An aggregator Web Site

5 Adapted from slide by William Cohen Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: DateExtracted: January 8, 2001 Source: OtherCompanyJobs: foodscience.com-Job1

6 Adapted from slide by William Cohen Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S.

7 Adapted from slide by William Cohen Data Mining the Extracted Job Information

8 Adapted from slide by William Cohen IE from Research Papers: CiteSeer

9 Adapted from slide by William Cohen What is Information Extraction Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

10 Adapted from slide by William Cohen What is Information Extraction Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

11 Adapted from slide by William Cohen What is Information Extraction Information Extraction = segmentation + classification + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”

12 Adapted from slide by William Cohen What is Information Extraction Information Extraction = segmentation + classification + association A family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

13 Adapted from slide by William Cohen What is Information Extraction Information Extraction = segmentation + classification + association A family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

14 Adapted from slide by William Cohen IE in Context Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models

15 Adapted from slide by William Cohen Landscape of IE Tasks: Degree of Formatting Grammatical sentences and some formatting & links Text paragraphs without formatting Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables

16 Adapted from slide by William Cohen Landscape of IE Tasks: Intended Breadth of Coverage Web site specificGenre specificWide, non-specific Amazon.com Book PagesResumesUniversity Names FormattingLayoutLanguage

Landscape of IE Tasks” Complexity Closed set He was born in Alabama… The big Wyoming sky… U.S. states Regular set Phone: (413) The CALD main office can be reached at U.S. phone numbers Complex pattern University of Arkansas P.O. Box 140 Hope, AR U.S. postal addresses Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence Person names Pawel Opalinski, Software Engineer at WhizBang Labs.

18 Adapted from slide by William Cohen Landscape of IE Tasks: Single Field/Record Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut

19 Adapted from slide by William Cohen State of the Art Performance: a sample Named entity recognition from newswire text Person, Location, Organization, … F1 in high 80’s or low- to mid-90’s Binary relation extraction Contained-in (Location1, Location2) Member-of (Person1, Organization1) F1 in 60’s or 70’s or 80’s Web site structure recognition Extremely accurate performance obtainable Human effort (~10min?) required on each site

20 Slide by Chris Manning, based on slides by several others Three generations of IE systems Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the domain Iterative guess-test-tweak-repeat cycle Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates, using automated rule learners Require huge, labeled corpora (effort is just moved!) Statistical Models [1997 – ] Use machine learning to learn which features indicate boundaries and types of entities. Learning usually supervised; may be partially unsupervised

21 Adapted from slide by William Cohen Landscape of IE Techniques Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence?

22 Slide by Chris Manning, based on slides by several others Trainable IE systems Pros Annotating text is simpler & faster than writing rules. Domain independent Domain experts don’t need to be linguists or programers. Learning algorithms ensure full coverage of examples. Cons Hand-crafted systems perform better, especially at hard tasks. (but this is changing) Training data might be expensive to acquire May need huge amount of training data Hand-writing rules isn’t that hard!!

23 Slide by Chris Manning, based on slides by several others MUC: the genesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: Terrorist events Industrial joint ventures Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)

24 Adapted from slide by Lucian Vlad Lita Message Understanding Conference (MUC) Named entity –Person, Organization, Location Co-reference –Clinton  President Bill Clinton Template element –Perpetrator, Target Template relation –Incident Multilingual

25 Adapted from slide by Lucian Vlad Lita MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

26 Adapted from slide by Lucian Vlad Lita MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

27 Adapted from slide by Lucian Vlad Lita MUC Templates Relationship –tie-up Entities: –Bridgestone Sports Co, a local concern, a Japanese trading house Joint venture company –Bridgestone Sports Taiwan Co Activity –ACTIVITY 1 Amount –NT$2,000,000

28 Adapted from slide by Lucian Vlad Lita MUC Templates ATIVITY 1 Activity –Production Company –Bridgestone Sports Taiwan Co Product –Iron and “metal wood” clubs Start Date –January 1990

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. Example of IE from FASTUS (1993) TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. Example of IE: FASTUS(1993) TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$ ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. Example of IE: FASTUS(1993): Resolving anaphora TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$ ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

32 Slide by Chris Manning, based on slides by others Evaluating IE Accuracy Always evaluate performance on independent, manually-annotated test data not used during system development. Measure for each test document: Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

33 Slide by Chris Manning, based on slides by others MUC Information Extraction: State of the Art c NE – named entity recognition CO – coreference resolution TE – template element construction TR – template relation construction ST – scenario template production

34 Adapted from slide by Lucian Vlad Lita Finite State Transducers for IE Basic method for extracting relevant information IE systems generally use a collection of specialized FSTs –Company Name detection –Person Name detection –Relationship detection

35 Adapted from Jurafsky & Martin 2000 Three Equivalent Representations Finite automata Regular expressions Regular languages Each can describe the others Theorem: For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa.

36 Finite-Automata State Graphs The start state An accepting state A transition a A state

37 Finite Automata A FA is similar to a compiler in that: A compiler recognizes legal programs in some (source) language. A finite-state machine recognizes legal strings in some language. Example: Programming Language Identifiers sequences of one or more letters or digits, starting with a letter: letter letter | digit S A

38 Finite Automata Transition s 1  a s 2 Is read In state s 1 on input “a” go to state s 2 If end of input If in accepting state => accept Otherwise => reject If no transition possible (got stuck) => reject FSA = Finite State Automata

39 Language defined by FSA The language defined by a FSA is the set of strings accepted by the FSA. in the language of the FSM shown below: –x, tmp2, XyZzy, position27. not in the language of the FSM shown below: –123, a?, 13apples. letter letter | digit S A

40 Example: Integer Literals FSA that accepts integer literals with an optional + or - sign: Note the two different edges from S to A (+|-)?[0-9]+ + digit S B A -

41 Example: FSA that accepts three letter English words that begin with p and end with d or t. Here I use the convenient notation of making the state name match the input that has to be on the edge leading to that state. p t a o u d i

42 Formal Definition A finite automaton is a 5-tuple ( , Q, , q, F) where: An input alphabet  A set of states Q A start state q A set of accepting states F  Q  is the state transition function: Q x   Q (i.e., encodes transitions state  input state)

43 How to Implement an FSA A table-driven approach: table: one row for each state in the machine, and one column for each possible character. Table[j][k] which state to go to from state j on character k, an empty entry corresponds to the machine getting stuck. Note: when you use the re package in python, it converts your regex’s into efficient FSMs

44 Terminology Fine Points FSA: Finite State Automaton FSM: Finite State Machine FSA, FSM used interchangibly FST: Finite State Transducer The same as FSA, except each transition produces output

45 Adapted from slide by Lucian Vlad Lita Finite State Transducers for IE FSTs are often compiled from regular expressions Probabilistic (weighted) FSTs FSTs mean different things to different IE approaches: Based on lexical items (words) Based on statistical language models Based on deep syntactic/semantic analysis Several FSTs or a more complex FST can be used to find one type of information (e.g. company names)

46 Adapted from slide by Lucian Vlad Lita Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Text Analyzer: Frodo – Proper Name Baggins – Proper Name works – Verb for – Prep Hobbit – UnknownCap Factory – NounCap Inc– CompAbbr

47 Adapted from slide by Lucian Vlad Lita Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. A regular expression for finding company names: “some capitalized words, maybe a comma, then a company abbreviation indicator” CompanyName = (ProperName | SomeCap)+ Comma? CompAbbr

48 Adapted from slide by Lucian Vlad Lita Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. ProperName works for CompanyName.

49 Adapted from slide by Lucian Vlad Lita Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc word (CAP | PN) CAB comma CAB word CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string Company Name Detection FSA

50 Adapted from slide by Lucian Vlad Lita Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc word  word (CAP | PN)   CAB  CN comma   CAB  CN word  word CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string, CN = CompanyName Company Name Detection FST

51 Slide by Chris Manning, based on slides by others FSM’s for Pattern Recognition Use regex’s at a higher level of description for pattern recognition Determining which person holds what office in what organization [person], [office] of [org] –Vuk Draskovic, leader of the Serbian Renewal Movement [org] (named, appointed, etc.) [person] P [office] –NATO appointed Wesley Clark as Commander in Chief Determining where an organization is located [org] in [loc] –NATO headquarters in Brussels [org] [loc] (division, branch, headquarters, etc.) –KFOR Kosovo headquarters

FASTUS: A Successful Hand-Built System 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Based on finite state automata (FSA) transductions set up new Taiwan dollars a Japanese trading house had set up production of 20, 000 iron and metal wood clubs [company] [set up] [Joint-Venture] with [company]