Download presentation
Presentation is loading. Please wait.
Published byHester Preston Modified over 9 years ago
1
December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State Machines Pattern Recognition
2
December 2005CSA3180: Information Extraction I2 Classification at different granularities Text Categorization: –Classify an entire document Information Extraction (IE): –Identify and classify small units within documents Named Entity Extraction (NE): –A subset of IE –Identify and classify proper names People, locations, organizations
3
December 2005CSA3180: Information Extraction I3 Martin Baker, a person Genomics job Employers job posting form
4
December 2005CSA3180: Information Extraction I4 Aggregator Websites
5
December 2005CSA3180: Information Extraction I5 foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1
6
December 2005CSA3180: Information Extraction I6 Aggregator Websites Read in many web pages from different sites Extract information into a database Screen Scraping Can then return data matching particular queries Data mining can extract meaningful insight that might not have been obvious
7
December 2005CSA3180: Information Extraction I7
8
December 2005CSA3180: Information Extraction I8 Data Mining
9
December 2005CSA3180: Information Extraction I9 IE from Research Papers
10
December 2005CSA3180: Information Extraction I10 IE from Commercial Websites
11
December 2005CSA3180: Information Extraction I11 What is Information Extraction? Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super- important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
12
December 2005CSA3180: Information Extraction I12 What is Information Extraction? Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE
13
December 2005CSA3180: Information Extraction I13 What is Information Extraction? Information Extraction = segmentation + classification + association As a family of techniques: Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction” October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying…
14
December 2005CSA3180: Information Extraction I14 What is Information Extraction? Information Extraction = segmentation + classification + association A family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super- important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation
15
December 2005CSA3180: Information Extraction I15 What is Information Extraction? Information Extraction = segmentation + classification + association A family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation
16
December 2005CSA3180: Information Extraction I16 IE in Context Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models
17
December 2005CSA3180: Information Extraction I17 IE in Context: Formatting Grammatical sentences and some formatting & links Text paragraphs without formatting Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
18
December 2005CSA3180: Information Extraction I18 IE in Context: Formatting Non-grammatical snippets, rich formatting & links Tables
19
December 2005CSA3180: Information Extraction I19 IE in Context: Coverage Web site specific Amazon.com Book Pages Formatting Genre specific Resumes Layout
20
December 2005CSA3180: Information Extraction I20 IE in Context: Coverage Wide, non-specific University Names Language
21
December 2005CSA3180: Information Extraction I21 IE in Context: Complexity Closed set He was born in Alabama… The big Wyoming sky… U.S. states Regular set Phone: (413) 545-1323 The CALD main office can be reached at 412-268-1299 U.S. phone numbers Complex pattern University of Arkansas P.O. Box 140 Hope, AR 71802 U.S. postal addresses Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence Person names Pawel Opalinski, Software Engineer at WhizBang Labs.
22
December 2005CSA3180: Information Extraction I22 IE in Context: Single Field/Record Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut
23
December 2005CSA3180: Information Extraction I23 State of the Art Named entity recognition from newswire text –Person, Location, Organization, … –F1 in high 80’s or low- to mid-90’s Binary relation extraction –Contained-in (Location1, Location2) Member-of (Person1, Organization1) –F1 in 60’s or 70’s or 80’s Web site structure recognition –Extremely accurate performance obtainable –Human effort (~10min?) required on each site
24
December 2005CSA3180: Information Extraction I24 IE Generations Hand-Built Systems – Knowledge Engineering [1980s– ] –Rules written by hand –Require experts who understand both the systems and the domain –Iterative guess-test-tweak-repeat cycle Automatic, Trainable Rule-Extraction Systems [1990s– ] –Rules discovered automatically using predefined templates, using automated rule learners –Require huge, labeled corpora (effort is just moved!) Statistical Models [1997 – ] –Use machine learning to learn which features indicate boundaries and types of entities. –Learning usually supervised; may be partially unsupervised
25
December 2005CSA3180: Information Extraction I25 IE Techniques Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence?
26
December 2005CSA3180: Information Extraction I26 Trainable IE Systems Pros Annotating text is simpler & faster than writing rules. Domain independent Domain experts don’t need to be linguists or programers. Learning algorithms ensure full coverage of examples. Cons Hand-crafted systems perform better, especially at hard tasks. (but this is changing) Training data might be expensive to acquire May need huge amount of training data Hand-writing rules isn’t that hard!!
27
December 2005CSA3180: Information Extraction I27 MUC: Genesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: –Terrorist events –Industrial joint ventures –Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)
28
December 2005CSA3180: Information Extraction I28 MUC Named entity Person, Organization, Location Co-reference Clinton President Bill Clinton Template element Perpetrator, Target Template relation Incident Multilingual
29
December 2005CSA3180: Information Extraction I29 MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month
30
December 2005CSA3180: Information Extraction I30 MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month
31
December 2005CSA3180: Information Extraction I31 MUC Templates Relationship tie-up Entities: Bridgestone Sports Co, a local concern, a Japanese trading house Joint venture company Bridgestone Sports Taiwan Co Activity ACTIVITY 1 Amount NT$2,000,000
32
December 2005CSA3180: Information Extraction I32 MUC Templates ATIVITY 1 –Activity Production –Company Bridgestone Sports Taiwan Co –Product Iron and “metal wood” clubs –Start Date January 1990
33
December 2005CSA3180: Information Extraction I33 Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example from Fastus (1993)
34
December 2005CSA3180: Information Extraction I34 Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990
35
December 2005CSA3180: Information Extraction I35 Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990
36
December 2005CSA3180: Information Extraction I36 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. set up new Taiwan dollars a Japanese trading house had set up production of 20, 000 iron and metal wood clubs [company] [set up] [Joint-Venture] with [company]
37
December 2005CSA3180: Information Extraction I37 Evaluating IE Accuracy Always evaluate performance on independent, manually- annotated test data not used during system development. Measure for each test document: –Total number of correct extractions in the solution template: N –Total number of slot/value pairs extracted by the system: E –Number of extracted slot/value pairs that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: –Recall = C/N –Precision = C/E –F-Measure = Harmonic mean of recall and precision
38
December 2005CSA3180: Information Extraction I38 Named Entities Person Name: Colin Powell, Frodo Location Name: Middle East, Aiur Organization: UN, DARPA Domain Specific vs. Open Domain
39
December 2005CSA3180: Information Extraction I39 Nymble (BBN Corporation) State of the art system Near-human performance ~90% accuracy Statistical system Approach: Hidden Markov Model (HMM)
40
December 2005CSA3180: Information Extraction I40 Nymble (BBN Corporation) Noisy channel paradigm Originally, entities were marked in the raw text Post noisy channel, annotation is lost Probability of most likely sequence of name classes (NC) given a sequence of words (W) Pr(NC|W) = Pr(W,NC) / Pr(W) since the a priori probability of the word sequence can be considered constant for any given sentence maximize just numerator
41
December 2005CSA3180: Information Extraction I41 Nymble (BBN Corporation) Person Organization Not-A-Name Start of Sentence Five other classes End of Sentence
42
December 2005CSA3180: Information Extraction I42 Automatic Content Extraction DARPA ACE Program Identify Entities –Named: Bilbo, San Diego, UNICEF –Nominal: the president, the hobbit –Pronominal: she Reference resolution –Clinton the president he
43
December 2005CSA3180: Information Extraction I43 Question Answering The over-used pipeline paradigm: Question Analysis Question Information Retrieval Answer Extraction Answer Merging Answer
44
December 2005CSA3180: Information Extraction I44 Question Answering Feedback loops can be present for constraint relaxation purposes Not all QA systems adhere to the pipeline architecture Question answering flavors –Factoid vs. complex –Who invented paper? vs. Which of Mr. Bush’s friends are Black Sabbath fans? –Closed vs. open domain
45
December 2005CSA3180: Information Extraction I45 Answer Extraction The over-used pipeline paradigm: Focus on open domain, factoid question answering Question Analysis Question Information Retrieval Answer Extraction Answer Merging Answer
46
December 2005CSA3180: Information Extraction I46 Practical Issues Web Spell Checking –Mispling –nucular Infrequent forms: –Niagra vs. Niagara, Filenes vs. Filene’s Google QA Genome, video, games
47
December 2005CSA3180: Information Extraction I47 Practical Issues Traditional Information Extraction Either expert built or statistical Specific strategies for specific question types Person Bio vs. Location question types Ability to generalize to new questions and new question types
48
December 2005CSA3180: Information Extraction I48 Practical Issues Who invented Blah? Blah was invented by PersonName Blah was Verb by PersonName where Verb is synonym to invented Blah VerbPhrase by PersonName
49
December 2005CSA3180: Information Extraction I49 Popular Resources Experts and/or Learning Algorithms Gazeteers NE taggers Part Of Speech taggers Parsers Wordnet Stopword list Stemmer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.