Open Information Extraction from the Web

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
TEXTRUNNER Turing Center Computer Science and Engineering
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
1 A Description Logic with Concrete Domains CS848 presentation Presenter: Yongjuan Zou.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Open Information Extraction From The Web Rani Qumsiyeh.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Presented by Zeehasham Rasheed
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Question Answering over Implicitly Structured Web Content
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
NTU & MSRA Ming-Feng Tsai
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Information Retrieval in Practice
Automatic Writing Evaluation
Automatically Labeled Data Generation for Large Scale Event Extraction
Linguistic Graph Similarity for News Sentence Searching
Search Engine Architecture
Text Based Information Retrieval
Web News Sentence Searching Using Linguistic Graph Similarity
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Relation Extraction CSCI-GA.2591
Reading Report: Open QA Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Distant supervision for relation extraction without labeled data
Introduction Task: extracting relational facts from text
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Extracting Why Text Segment from Web Based on Grammar-gram
Statistical NLP: Lecture 10
KnowItAll and TextRunner
Presentation transcript:

Nivedha Sivakumar (nivedha@seas.upenn.edu) Feb-11-2019 Open Information Extraction from the Web Michele Banko, Michael J Carafella, Stephen Soderland, Matt Broadhead and Oren Etzioni IJCAI, 2007 Nivedha Sivakumar (nivedha@seas.upenn.edu) Feb-11-2019

Problem & Motivation Standard Information Extraction (IE): Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp. MergerBetween(company1, company2, date) Proposes new self-supervised, domain-independent IE, OpenIE Information extracted: t = (ei , ri,j , ej) (<proper noun>, works with, < proper noun >) Goal for most standard IE paradigms is to allow logical reasoning to draw inferences on logical content of input data.

Problem & Motivation Current works are resource hungry: Not scalable Need to pre-specify relations of interest Multiple passes over corpus to extract tuples Paper introduces TextRunner: Scalable & domain-independent Extracts relational tuples from text Indexed to support extraction via queries Input to TextRunner is corpus, output is set of extractions

Contents: Previous works Novel aspects of this paper TextRunner architecture Self-supervised learner Single pass extractor Redundancy-based assessor Query processing Results Comparison with SoTA Global statistics on facts learned Conclusions, shortcomings and future work

Previous Works Entity chunking common in previous work: Trainable systems with hand-tagged instances as input DIPRE [Brin, 1998], SNOWBALL [Agichtein & Gravano, 2000], Web-based Q&A [Ravichandran & Hovy, 2002] Relation extraction with kernel-based methods [Bunescu & Mooney, 2005], maximum-entropy models [Kambhatla, 2004], & graphical models [Rosario & Hearst, 2004]

Previous SoTA: KnowItAll Previous state of the art: KnowItAll Self-supervised learning: can label training data with domain-independent extraction patterns Corpus heterogeneity: use of POS tagger Parsers are harder mechanisms, hence less scalable Shortcomings of KnowItAll: Needs pre-specified relation names as input Multiple passes over corpus Can take weeks to extract relational data

Contributions of this work Paper proposes Open IE System makes a single data-driven pass over its corpus Extracts a large set of relational tuples without human intervention Paper introduces TextRunner Fully implemented, highly scalable Open IE system Tuples indexed to support efficient extraction via user queries Reports qualitative and quantitative results over TextRunner’s 11 million most probable extractions

TextRunner Architecture TextRunner consists of 3 key modules: Self-supervised learner: given a corpus sample, learner outputs classifier that labels extractions Single Pass Extractor: Makes single pass over entire corpus to extract tuples Self-supervised learner classifies tuples Redundancy-based assessor: computes probability of correct tuple instance

TextRunner Architecture

TextRunner Module 1: Self-supervised Learner Self-supervised learner operates in 2 steps: Automatic labelling Train NB classifier Given tuples, learner labels as positive/negative Maps tuples to feature vector representation Classifier output has no lexical features Domain-independence Learner checks if: Relation word length < defined limit Relation does not cross sentence-like boundary Entities aren’t pronouns only If conditions met, classifies tuple as trustworthy

TextRunner Module 2: Single Pass Extractor Makes single pass over corpus, tagging each sentence with most probable PoS Entities: noun-phrases with chunker Relations: by removing over-specifications Candidate tuples classified by learner module, extracted & stored Given: Scientists from many universities are studying string theory, but Massachusetts Institute of Tech is one of the pioneers POS: (noun, prep, adj, noun, verb, verb, noun, noun) Entities: (Scientists, universities, string theory) Relation simplification: ‘Scientists from many Universities are studying’ > ‘Scientists are studying’ Output: (Scientists, are studying, string theory)

More TextRunner Modules Redundancy-based Assessor: Omits non-essential modifications, merges identical tuples Was developed by = was originally developed by With distinct t-counts, can estimate probability of t being a correct instance Query Processing: Inverted indexing over pool of machines for quick retrieval Query processing: Each relation found during tuple extraction is assigned to a single machine in the pool Every machine computes an inverted index over the text of locally stored tuples, ensuring each machine is guaranteed to store all tuples containing a reference to any relation assigned to the machine

Results: Comparison with Traditional IE Experimental setup: 9 million web page corpus Pre-specify 10 random relations in >1000 sentences Performance: 33% lower errors for TextRunner Higher precision but matches recall Precision = fraction of retrieved tuple extractions relevant to user query Recall = fraction of relevant tuple extractions retrieved Identical correct extractions; improvement by better entity identification Errors in both due to poor noun phrases: Truncated arguments/added stray words Error rate of 32% for TextRunner vs 47% for KnowItAll

Results: Tuple Selection before Global Analysis Given 9 million web pages, extracted 60.5 million tuples Restricted analysis to subset of high probability tuples: Assigned probability of at least 0.8 Relation supported by at least 10 distinct sentences Relation not found in top 0.1% of relations Filtered set has 11.3 million tuples with 278K distinct relations

Results: Global Statistics on Facts Learned Estimating correctness of facts: Random 400 tuples from filtered set Are relations well formed? Are entities reasonable? Classify tuples as concrete/abstract: judge each as T/F Estimating number of distinct facts: Merge relations differing by leading/trailing punctuation Are consistent with = which is consistent with Metric to evaluate synonyms: Difficult as relations can be used in several senses Clustering all tuples left – about a third in ‘synonymy clusters’ -> Relations are well formed if there is some pair of entities X and Y such that (X,r,Y) is a relation between X and Y -> X and Y are well-formed arguments for relation r if X and Y are a type of entity that can form a relation (X,r,Y) -> Concrete – truth of tuple grounded in particular entities -> Abstract – Underspecified

Overview of Global Facts Learned

Conclusions Novel self-supervised, domain-independent paradigm Automatic relation extraction for enhanced speeds TextRunner: OpenIE system for large extractions Better computing speed and scalability that SoTA Significance: Textual representation of actual fact as triples Important but not sufficient for commonsense reasoning

Shortcomings of TextRunner Scientists from many universities are studying string theory, but Massachusetts Institute of Tech is one of the pioneers Use of linguistic parser to label extractions not scalable If Massachusetts Institute of Tech is not defined in parser, learner may not classify as a trustworthy WOE (Wu and Weld): does not need lexicalized features for self-supervised learning; higher precision & recall No definition on the form that entities may take Massachusetts Institute of Tech may not be recognized as one entity ReVerb (Fader, et al): improves precision & recall with relation-phrase identifier with lexical constrains

Future Work & Discussion Integrate scalable methods to detect synonyms Learn types of entities commonly taken by relations Can distinguish between different senses of relations Unify tuples into graph-like structure for complex queries OpenIE visualization in AllenNLP: https://demo.allennlp.org/open-information-extraction