Understanding Text Meaning in Information Applications

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.
Toward Dependency Path based Entailment Rodney Nielsen, Wayne Ward, and James Martin.
1 The PASCAL Recognizing Textual Entailment Challenges - RTE-1,2,3 Ido DaganBar-Ilan University, Israel with …
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Textual Entailment as a Framework for Applied Semantics
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Third Recognizing Textual Entailment Challenge Potential SNeRG Submission.
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Employing Two Question Answering Systems in TREC 2005 Harabagiu, Moldovan, et al 2005 Language Computer Corporation.
Community Planning Training 1-1. Community Plan Implementation Training 1- Community Planning Training 1-3.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
1 Textual Entailment as a Framework for Applied Semantics Ido DaganBar-Ilan University, Israel Joint works with: Oren Glickman, Idan Szpektor, Roy Bar.
Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini, Idan.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Assessing the Impact of Frame Semantics on Textual Entailment Authors: Aljoscha Burchardt, Marco Pennacchiotti, Stefan Thater, Manfred Pinkal Saarland.
Knowledge and Tree-Edits in Learnable Entailment Proofs Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
1 Textual Entailment: A Perspective on Applied Text Understanding Ido DaganBar-Ilan University, Israel Joint works with: Oren Glickman, Idan Szpektor,
Flexible Text Mining using Interactive Information Extraction David Milward
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 Query Operations Relevance Feedback & Query Expansion.
Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Correlating Knowledge Using NLP: Relationships between the concepts of blood cancers, stem cell transplantation, and biomarkers Katy Zou and Weizhong Zhu.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Towards Entailment Based Question Answering: ITC-irst at Clef 2006 Milen Kouylekov, Matteo Negri, Bernardo Magnini & Bonaventura Coppola ITC-irst, Centro.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Data Mining Status and Risks Dr. Gregory Newby UNC-Chapel Hill
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Ministerial Declaration and Concluding Remarks MD PhD Karin Tegmark Wisell Chair AMR expert group, NDPHS.
CS 440 Database Management Systems Web Data Management 1.
Text Based Information Retrieval
Lindsay & Gordon’s Discovery Support Systems Model
Recognizing Partial Textual Entailment
Automatic Detection of Causal Relations for Question Answering
Introduction to Information Retrieval
Presentation transcript:

Understanding Text Meaning in Information Applications Ido Dagan Bar-Ilan University, Israel

Outline – a Vision Why do we need “Text Understanding”? Capture understanding by Textual Entailment Does one text entail another? Major challenge – knowledge acquisition Initial applications Looking 5 years ahead

Text Understanding Vision for improving information access Common search engines Still: text processing mostly matches query keywords Deeper understanding: Consider the meanings of words and the relationships between them Relevant for applications Question answering, information extraction, semantic search, summarization Clearer frameworks in applied morphology, syntax, but not in semantics

Only matches original keywords (throw stop words) Ignores relationships

Towards text understanding: Question Answering

Expansion: made, built Explain errors: Hyundai as object; wrong sense for expansion

Information Extraction (IE) Identify information of pre-determined structure Automatic filling of “forms” Example - extract product information: Company Product Type Product Name Hyundai Car Accent Elantra Suzuki Motorcycle R-350

Search may benefit understanding Query: AIDS treatment Irrelevant document: Hemophiliacs lack a protein, called factor VIII, that is essential for making blood clots. As a result, they frequently suffer internal bleeding and must receive infusions of clotting protein derived from human blood. During the early 1980s, these treatments were often tainted with the AIDS virus. In 1984, after that was discovered, manufacturers began heating factor VIII to kill the virus. The strategy greatly reduced the problem but was not foolproof. However, many experts believe that adding detergents and other refinements to the purification process has made natural factor VIII virtually free of AIDS. (AP890118-0146, TIPSTER Vol. 1) Many irrelevant documents mention AIDS and treatments for other diseases

Relevant Document Query: AIDS treatment Federal health officials are recommending aggressive use of a newly approved drug that protects people infected with the AIDS virus against a form of pneumonia that is the No.1 killer of AIDS victims. The Food and Drug Administration approved the drug, aerosol pentamidine, on Thursday. The announcement came as the Centers for Disease Control issued greatly expanded treatment guidelines recommending wider use of the drug in people infected with the AIDS virus but who may show no symptoms. (AP890616-0048, TIPSTER VOL. 1) Relevant documents may mention specific types of treatments for AIDS

Why is it difficult? Variability Ambiguity Meaning Language Text understanding in a nutshell Levels of representation.

Variability of Semantic Expression The Dow Jones Industrial Average closed up 255 Dow ends up Dow gains 255 points Stock market hits a record high Dow climbs 255 First step towards a broad semantic model of language variation

How to capture “understanding”? Question Expected answer form Who bought Overture? >> X bought Overture Overture’s acquisition by Yahoo … Yahoo bought Overture entails text hypothesized answer Key task is recognizing that one text entails another IE – extract buying events: Y’s acquisition by X X buy Y Search: Find Acquisitions by Yahoo Summarization (multi-document) – identify redundancy

Textual Entailment ≈ Human Reading Comprehension From a children’s English learning book (Sela and Greenberg): Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …” Hypothesis (True/False?): The Bermuda Triangle is near the United States The common for method for testing human reading comprehension is by testing the entailment capability – either directly or via QA The difficulty – variability between question and text Knowledge needed – Florida in the US ???

PASCAL Recognizing Textual Entailment (RTE) Challenges FP-6 Funded PASCAL NOE 2004-7 Bar-Ilan University ITC-irst and CELCT, Trento MITRE Microsoft Research

Some Examples TEXT HYPOTHESIS TASK ENTAIL-MENT 1 Regan attended a ceremony in Washington to commemorate the landings in Normandy. Washington is located in Normandy. IE False 2 Google files for its long awaited IPO. Google goes public. IR True 3 …: a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others. Cardinal Juan Jesus Posadas Ocampo died in 1993. QA 4 The SPD got just 21.5% of the vote in the European Parliament elections, while the conservative opposition parties polled 44.5%. The SPD is defeated by the opposition parties.

Participation and Impact Very successful challenges, world wide: RTE-1 – 17 groups RTE-2 – 23 groups ~150 downloads! RTE-3 25 groups RTE-4 (2008) – moved to NIST (TREC organizers) High interest in the research community Papers, conference keywords, sessions and areas, PhD’s, influence on funded Projects special issue at Journal of Natural Language Engineering

Results Average Precision Accuracy First Author (Group) 80.8% 75.4% Hickl (LCC) 71.3% 73.8% Tatu (LCC) 64.4% 63.9% Zanzotto (Milan & Rome) 62.8% 62.6% Adams (Dallas) 66.9% 61.6% Bos (Rome & Leeds) 58.1%-60.5% 11 groups 52.9%-55.6% 7 groups Average: 60% Median: 59%

What is the main obstacle? System reports point at: Lack of knowledge rules, paraphrases, lexical relations, etc. It seems that systems that coped better with these issues performed best

Research Directions at Bar-Ilan Knowledge Acquisition Inference Applications Oren Glickman, Idan Szpektor, Roy Bar Haim, Maayan Geffet, Moshe Koppel Bar Ilan University Shachar Mirkin Hebrew University, Israel Hristo Tanev, Bernardo Magnini, Alberto Lavelli, Lorenza Romano ITC-irst, Italy Bonaventura Coppola, Milen Kouylekov University of Trento and ITC-irst, Italy In which framework should we work to model semantic phenomena

Distributional Word Similarity “Similar words appear in similar contexts” Harris, 1968 Similar Word Meanings  Similar Contexts Distributional Similarity Model: Similar Word Meanings  Similar Context Features

Measuring Context Similarity Country State Industry (genitive) Neighboring (modifier) Neighboring (modifier) … … Governor (modifier) Visit (obj) Parliament (genitive) … Industry (genitive) Population (genitive) … Governor (modifier) Visit (obj) Parliament (genitive) President (genitive)

Incorporate Indicative Patterns

Acquisition Example Top-ranked entailments for “company”: firm, bank, group, subsidiary, unit, business, supplier, carrier, agency, airline, division, giant, entity, financial institution, manufacturer, corporation, commercial bank, joint venture, maker, producer, factory …

Learning Entailment Rules Q: What reduces the risk of Heart Attacks? Hypothesis: Aspirin reduces the risk of Heart Attacks Text: Aspirin prevents Heart Attacks Entailment Rule: X prevent Y ⇨ X reduce risk of Y Show the need of entailment relations Explain the entailment rule parts: a template – a parsed text fragment with variables, a relation direction State that a large collection of entailment rules, ~100000, is needed to provide good coverage for language variability template template  Need a large knowledge base of entailment rules

Anchor Set Extraction (ASE) TEASE – Algorithm Flow Lexicon Input template: Xsubj-accuse-objY WEB TEASE Sample corpus for input template: Paula Jones accused Clinton… Sanhedrin accused St.Paul… … Anchor Set Extraction (ASE) Anchor sets: {Paula Jonessubj; Clintonobj} {Sanhedrinsubj; St.Paulobj} … State the name of the method – TEASE – and its acronym meaning Emphasize that the ASE part solves the supervision problem we had in previous web-based methods Finish with stating again the two parts of the TEASE method Sample corpus for anchor sets: Paula Jones called Clinton indictable… St.Paul defended before the Sanhedrin … Template Extraction (TE) Templates: X call Y indictable Y defend before X … iterate

Sample of Extracted Anchor-Sets for X prevent Y X=‘sunscreens’, Y=‘sunburn’ X=‘sunscreens’, Y=‘skin cancer’ X=‘vitamin e’, Y=‘heart disease’ X=‘aspirin’, Y=‘heart attack’ X=‘vaccine candidate’, Y=‘infection’ X=‘universal precautions’, Y=‘HIV’ X=‘safety device’, Y=‘fatal injuries’ X=‘hepa filtration’, Y=‘contaminants’ X=‘low cloud cover’, Y= ‘measurements’ X=‘gene therapy’, Y=‘blindness’ X=‘cooperation’, Y=‘terrorism’ X=‘safety valve’, Y=‘leakage’ X=‘safe sex’, Y=‘cervical cancer’ X=‘safety belts’, Y=‘fatalities’ X=‘security fencing’, Y=‘intruders’ X=‘soy protein’, Y=‘bone loss’ X=‘MWI’, Y=‘pollution’ X=‘vitamin C’, Y=‘colds’ Emphasize that the large amount of good anchor-sets enables the learning of many different entailment rules

Sample of Extracted Templates for X prevent Y X reduce Y X protect against Y X eliminate Y X stop Y X avoid Y X for prevention of Y X provide protection against Y X combat Y X ward Y X lower risk of Y X be barrier against Y X fight Y X reduce Y risk X decrease the risk of Y relationship between X and Y X guard against Y X be cure for Y X treat Y X in war on Y X in the struggle against Y X a day keeps Y away X eliminate the possibility of Y X cut risk Y X inhibit Y State the richness Emphasize the impact of systems like question answering

Experiment and Evaluation 48 randomly chosen input verbs 1392 templates extracted ; human judgments Encouraging Results: Future work: improve precision Average Yield per verb 29 correct templates per verb Average Precision per verb 45.3% Explain why yield and not recall State that the precision is comparable to other unsupervised systems

Syntactic Variability Phenomena Template: X activate Y Example Phenomenon Y is activated by X Passive form X activates its companion, Y Apposition X activates Z and Y Conjunction X activates two proteins: Y and Z Set X, which activates Y Relative clause X binds and activates Y Coordination X activates a fragment of Y Transparent head X is a kinase, though it activates Y Co-reference

Takeaway Promising potential for creating huge entailment knowledge bases Millions of rules Speculation: is it possible to have a public effort for knowledge acquisition? Human Genome Project analogy Community effort

Initial Applications: Relation Extraction Semantic Search

Dataset Recognizing interactions between annotated proteins pairs (Bunescu 2005) 200 Medline abstracts Gold standard dataset of protein pairs Input template : X interact with Y

Manual Analysis - Results 93% of interacting protein pairs can be identified with lexical syntactic templates Number of templates vs. recall (within 93%): # templates R(%) 39 60 2 10 73 70 4 20 107 80 6 30 141 90 11 40 175 100 21 50

TEASE Output for X interact with Y A sample of correct templates learned: X binding to Y X bind to Y X Y interaction X activate Y X attach to Y X stimulate Y X interaction with Y X couple to Y X trap Y interaction between X and Y X recruit Y X become trapped in Y X associate with Y X Y complex X be linked to Y X recognize Y X target Y X block Y

TEASE algorithm - Potential Recall on Training Set Experiment 39% input 49% input + iterative 63% input + iterative + morph Iterative - taking the top 5 ranked templates as input Morph - recognizing morphological derivations

Epmpahsize importance for small collections

Integrating IE and Search (w. IBM Research Haifa)

Optimistic Conclusions Good prospects for better levels of text understanding Enabling more sophisticated information access Textual entailment is an appealing framework Boosts research on text understanding Potential for vast knowledge acquisition Thank you!