Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.

Similar presentations


Presentation on theme: "School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents."— Presentation transcript:

1 School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents

2 Example: The Problem Martin Baker, a person Genomics job Employers job posting form

3 Information extraction basics Finding facts from documents, without full understanding e.g. Who, when, where, what, to whom, etc An example Terrorist report: news articles in free text Predefined template: Where When What No of Injured No of killed Weapon Who is responsible MUC conferences in 1990s Annual competition for message understanding

4 Christopher Manning MUC Information Extraction Example Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. JOINT-VENTURE-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.”, “a local concern”, “a Japanese trading house” Joint Ent: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$20 000 000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start date: DURING: January 1990

5 Approaches Manually crafted patterns regular expression Automata Grammar rules, Natural Language processing -based Example extraction pattern Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NOUN-GROUP]

6 Rule-based, knowledge-based Determining which person holds what office in what organization [person], [office] of [org] Vuk Draskovic, leader of the Serbian Renewal Movement [org] (named, appointed, etc.) [person] Prep [office] NATO appointed Wesley Clark as Commander in Chief Determining where an organization is located [org] in [loc] NATO headquarters in Brussels [org] [loc] (division, branch, headquarters, etc.) KFOR Kosovo headquarters

7 Move on to Machine learning Machine learning approach Label documents Learn rules and patterns Successful in name entity identification Currently most systems work with Patterns and name lists Web scraping, Web information extraction, text mining Structured, semi-structured

8 www.apple.com/retail IE is different in different domains! Example: on web there is less grammar, but more formatting & linking The directory structure, link structure, formatting & layout of the Web is its own new grammar. Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002-- Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. "Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles." www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html NewswireWeb

9 Christopher Manning Why is IE hard on the web? Need this price Title A book, Not a toy

10 Semi-structured data extraction Wrapper Extracting information from semi-structured data Typically tables and lists from Web If extracting from automatically generated web pages, simple regex patterns usually work. Amazon page (.*?) For certain restricted, common types of entities in unstructured text, simple regex patterns also usually work. Finding (US) phone numbers (?:\(?[0-9]{3}\)?[ -.])?[0-9]{3}[ -.]?[0-9]{4}

11 Machine learning approach Wrapper induction Labelled pages Pattern induction Unlabelled pages Rule –based Tree –based

12 Evaluation Model solutions as labelled facts Precision: correctFoundFacts/allFoundFacts Recall: correctFoundFacts/allFacts F-measure Typically use harmonic mean of the two 2*precision * Recall/ (Precision + Recall)

13 Full task of “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

14 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * *

15 An Even Broader View Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models Slide by Andrew McCallum. Used with permission.


Download ppt "School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents."

Similar presentations


Ads by Google