Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011.

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 21 11/8/2011.
Sequence Classification: Chunking & NER Shallow Processing Techniques for NLP Ling570 November 23, 2011.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Named Entity Recognition LING 570 Fei Xia Week 10: 11/30/09.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Information Extraction
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Ling 570 Day 17: Named Entity Recognition Chunking.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Lecture 13 Information Extraction Topics Name Entity Recognition Relation detection Temporal and Event Processing Template Filling Readings: Chapter 22.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
NTCIR /21 ASQA: Academia Sinica Question Answering System for CLQA (IASL) Cheng-Wei Lee, Cheng-Wei Shih, Min-Yuh Day, Tzong-Han Tsai, Tian-Jian Jiang,
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Ang Sun Director of Research, Principal Scientist, inome
TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.
Template-Based Event Extraction Kevin Reschke – Aug 15 th 2013 Martin Jankowiak, Mihai Surdeanu, Dan Jurafsky, Christopher Manning.
Deep Questions without Deep Understanding
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Learning to Extract CSCI-GA.2590 Ralph Grishman NYU.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)
Automatically Labeled Data Generation for Large Scale Event Extraction
CSC 594 Topics in AI – Natural Language Processing
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
CSCE 590 Web Scraping – Information Retrieval
CSCI 5832 Natural Language Processing
Social Knowledge Mining
LING 388: Computers and Language
Introduction Task: extracting relational facts from text
Lecture 13 Information Extraction
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
Presentation transcript:

Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Roadmap Information extraction: Definition & Motivation General approach Relation extraction techniques: Fully supervised Lightly supervised

Information Extraction Motivation: Unstructured data hard to manage, assess, analyze

Information Extraction Motivation: Unstructured data hard to manage, assess, analyze Many tools for manipulating structured data: Databases: efficient search, agglomeration Statistical analysis packages Decision support systems

Information Extraction Motivation: Unstructured data hard to manage, assess, analyze Many tools for manipulating structured data: Databases: efficient search, agglomeration Statistical analysis packages Decision support systems Goal: Given unstructured information in texts Convert to structured data E.g., populate DBs

IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: Amount: Effective Date: Follower:

IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline:United Airlines Amount:$6 Effective Date: Follower:American Airlines

Other Applications Shared tasks: Message Understanding Conferences (MUC)

Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers:

Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents

Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents Incident type, Actor, Victim, Date, Location General:

Other Applications Shared tasks: Message Understanding Conferences (MUC) Joint ventures/mergers: CompanyA, CompanyB, CompanyC, amount, type Terrorist incidents Incident type, Actor, Victim, Date, Location General: Pricegrabber Stock market analysis Apartment finder

Information Extraction Components Incorporates range of techniques, tools

Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling

Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling Semi-, un-supervised learning: clustering, bootstrapping

Information Extraction Components Incorporates range of techniques, tools FSTs (pattern extraction) Supervised learning: Classification Sequence labeling Semi-, un-supervised learning: clustering, bootstrapping Part-of-Speech tagging Named Entity Recognition Chunking Parsing

Information Extraction Process Typically pipeline of major components

Information Extraction Process Typically pipeline of major components First step: Named Entity Recognition (NER) Often application-specific Generic type: Person, Organiztion, Location Specific: genes to college courses

Information Extraction Process Typically pipeline of major components First step: Named Entity Recognition (NER) Often application-specific Generic type: Person, Organiztion, Location Specific: genes to college courses Identify NE mentions: specific instances

Example: Pre-NER Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.

Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes

Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes Includes anaphora resolution E.g. pronoun resolution (it, he, she…; one; this, that)

Information Extraction: Reference Resolution Builds on NER Clusters entity mentions referring to same things Creates entity equivalence classes Includes anaphora resolution E.g. pronoun resolution (it, he, she…; one; this, that) Also non-anaphoric, general mentions

Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text

Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations:

Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial

Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial Generic relations: United is part-of UAL Corp. American Airlines is part-of AMR Tim Wagner is an employee of American Airlines

Relation Detection & Classification Relation detection & classification: Find and classify relations among entities in text Mix of application-specific and generic relations Generic relations: Family, employment, part-whole, membership, geospatial Generic relations: United is part-of UAL Corp. American Airlines is part-of AMR Tim Wagner is an employee of American Airlines Domain-specific relations: United serves Chicago; United serves Denver, etc

Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

Event Detection & Classification Event detection and classification: Find key events

Event Detection & Classification Event detection and classification: Find key events Fare increase by United Matching fare increase by American Reporting of fare increases How cued?

Event Detection & Classification Event detection and classification: Find key events Fare increase by United Matching fare increase by American Reporting of fare increases How cued? Instances of “said” and ”cite”

Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text

Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text

Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text Others: Date expressions: Absolute: days of week, holidays Relative: two days from now, next year Clock times: noon, 3:30pm

Temporal Expressions: Recognition and Analysis Temporal expression recognition Finds temporal events in text E.g. Friday, Thursday in example text Others: Date expressions: Absolute: days of week, holidays Relative: two days from now, next year Clock times: noon, 3:30pm Temporal analysis: Map expressions to points/spans on timeline Anchor: e.g. Friday, Thursday: tied to example dateline

Template Filling Texts describe stereotypical situations in domain Identify texts evoking template situations Fill slots in templates based on text

Template Filling Texts describe stereotypical situations in domain Identify texts evoking template situations Fill slots in templates based on text In example: fare-raise is stereo-typical event

Information Extraction Steps Named Entity Recognition Reference Resolution Relation Detection & Classification Event Detection & Classification Temporal Expression Recognition & Analysis Template-filling

Relation Detection & Classification

Common Relations

Relations and Entities from Example

Supervised Approaches to Relation Analysis Two stage process:

Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities

Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples:

Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples: Drawn from labeled training data Negative examples:

Supervised Approaches to Relation Analysis Two stage process: Relation detection: Detect whether a relation holds between two entities Positive examples: Drawn from labeled training data Negative examples: Generated from within-sentence entity pairs w/no relation Relation classification: Label each related pair of entities by type Multi-class classification task

Feature Design Which features can we use to classify relations?

Example: post-NER Citing high fuel prices, [ ORG United Airlines] said [ Time Friday] it has increased fares by [ MONEY $6] per round trip to some cities also served by lower-cost carriers. [ ORG American Airlines], a unit of [ ORG AMR Corp.] immediately matched the move, spokesman [ PER Time Wagner] said. [ org United], a unit of [ ORG UAL Corp.], said the increase took effect [ TIME Thursday] an applies to most routes where it competes against discount carriers, such as [ LOC Chicago] to [ LOC Dallas] and [ LOC Denver] to [ LOC San Francisco].

Common Relations

Feature Design Which features can we use to classify relations?

Feature Design Which features can we use to classify relations? Named entity features: Named entity types of candidate arguments Concatenation of two entity types Head words of arguments Bag-of-words from arguments

Feature Design Which features can we use to classify relations? Named entity features: Named entity types of candidate arguments Concatenation of two entity types Head words of arguments Bag-of-words from arguments Context word features: Bag-of-words/bigrams between entities (+/- stemmed) Words and stems before/after entities Distance in words, # entities between arguments

Feature Design Features for relation classification Syntactic structure features can signal relations E.g. appositive signals part-of relation

Feature Design Features for relation classification Syntactic structure features can signal relations E.g. appositive signals part-of relation

Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse

Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax?

Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax? Binary features: Detection of construction patterns

Feature Design Syntactic structure features Derived from syntactic analysis From base-phrase chunking to full constituent parse Features: Presence of construction Chunk base-phrase paths Bag-of-chunk heads Dependency/constituency path Distance between arguments How can we represent syntax? Binary features: Detection of construction patterns Structure features: Encode paths through trees

Some Example Features

Lightly Supervised Approaches Supervised approaches highly effective, but

Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training

Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but

Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches:

Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches: Build on small seed sets of patterns that capture relations of interest E.g. manually constructed regular expressions

Lightly Supervised Approaches Supervised approaches highly effective, but Require large manually annotated corpus for training Unsupervised approaches interesting, but May not yield results consistent with desired categories Lightly supervised approaches: Build on small seed sets of patterns that capture relations of interest E.g. manually constructed regular expressions Iteratively generalize and refine patterns to optimize

Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern:

Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern: / * has a hub at * / In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport

Bootstrapping Relations Create seed patterns and instances Example: finding airline hub cities Initial pattern: / * has a hub at * / In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport Instances: (Delta, LaGuardia), (Midwest,KCI),(American,San Juan)

Bootstrapping Relations Issues:

Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix?

Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix?

Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns

Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns – see issue 1

Bootstrapping Relations Issues: False alarms: Some matches aren’t valid relations The catheter has a hub at the proximal end. A star topology often has a hub at the center. Fix? /[ORG] has hub at [LOC]/ Misses: Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport. Fix? Relax patterns – see issue 1 Add new high precision patterns

Bootstrapping How can we obtain more high quality patterns?

Bootstrapping How can we obtain more high quality patterns? Human labeling?

Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences

Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub

Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns:

Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns: /[ORG], which uses [LOC] as a hub /

Bootstrapping How can we obtain more high quality patterns? Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences E.g. tuple: Ryanair, Charleroi and hub Occurrences: Budget airline Ryanair, which uses Charleroi as a hub All flights in and out of Ryanair’s Belgian hub at Charleroi Patterns: /[ORG], which uses [LOC] as a hub / /[ORG]’s hub at [LOC] /

Bootstrapping Issues Issues: How do we represent patterns? How to we assess the patterns? How to we assess the tuples?

Pattern Representation General approaches:

Pattern Representation General approaches: Regular expressions Specific, high precision

Pattern Representation General approaches: Regular expressions Specific, high precision Machine learning style feature vectors More flexible, higher recall

Pattern Representation General approaches: Regular expressions Specific, high precision Machine learning style feature vectors More flexible, higher recall Components in representation: Context before first entity mention Context between entity mentions Context after last entity mention Order of arguments

Assessing Patterns & Tuples Issue:

Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns

Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns Balance two main factors: Performance on seed tuples Hits and misses Productivity in full collection Finds

Assessing Patterns & Tuples Issue: No gold standard! Can use seed patterns Balance two main factors: Performance on seed tuples Hits and misses Productivity in full collection Finds Use conservative strategy to minimize semantic drift E.g. Sydney has a ferry hub at Circular Key.

Relation Analysis Evaluation Approaches: Standardized, labeled data sets

Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus

Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus Evaluate retrieved tuples (e.g. DB contents) Intuition: Don’t care how many times we find same tuple

Relation Analysis Evaluation Approaches: Standardized, labeled data sets Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection No fully labeled corpus Evaluate retrieved tuples (e.g. DB contents) Intuition: Don’t care how many times we find same tuple Checked by human assessors