Download presentation
Presentation is loading. Please wait.
1
Understanding Text Meaning in Information Applications
Ido Dagan Bar-Ilan University, Israel
2
Outline – a Vision Why do we need “Text Understanding”?
Capture understanding by Textual Entailment Does one text entail another? Major challenge – knowledge acquisition Initial applications Looking 5 years ahead
3
Text Understanding Vision for improving information access
Common search engines Still: text processing mostly matches query keywords Deeper understanding: Consider the meanings of words and the relationships between them Relevant for applications Question answering, information extraction, semantic search, summarization Clearer frameworks in applied morphology, syntax, but not in semantics
4
Only matches original keywords (throw stop words)
Ignores relationships
6
Towards text understanding: Question Answering
7
Expansion: made, built Explain errors: Hyundai as object; wrong sense for expansion
8
Information Extraction (IE)
Identify information of pre-determined structure Automatic filling of “forms” Example - extract product information: Company Product Type Product Name Hyundai Car Accent Elantra Suzuki Motorcycle R-350
9
Search may benefit understanding
Query: AIDS treatment Irrelevant document: Hemophiliacs lack a protein, called factor VIII, that is essential for making blood clots. As a result, they frequently suffer internal bleeding and must receive infusions of clotting protein derived from human blood. During the early 1980s, these treatments were often tainted with the AIDS virus. In 1984, after that was discovered, manufacturers began heating factor VIII to kill the virus. The strategy greatly reduced the problem but was not foolproof. However, many experts believe that adding detergents and other refinements to the purification process has made natural factor VIII virtually free of AIDS. (AP , TIPSTER Vol. 1) Many irrelevant documents mention AIDS and treatments for other diseases
10
Relevant Document Query: AIDS treatment Federal health officials are recommending aggressive use of a newly approved drug that protects people infected with the AIDS virus against a form of pneumonia that is the No.1 killer of AIDS victims. The Food and Drug Administration approved the drug, aerosol pentamidine, on Thursday. The announcement came as the Centers for Disease Control issued greatly expanded treatment guidelines recommending wider use of the drug in people infected with the AIDS virus but who may show no symptoms. (AP , TIPSTER VOL. 1) Relevant documents may mention specific types of treatments for AIDS
14
Why is it difficult? Variability Ambiguity Meaning Language
Text understanding in a nutshell Levels of representation.
15
Variability of Semantic Expression
The Dow Jones Industrial Average closed up 255 Dow ends up Dow gains 255 points Stock market hits a record high Dow climbs 255 First step towards a broad semantic model of language variation
16
How to capture “understanding”?
Question Expected answer form Who bought Overture? >> X bought Overture Overture’s acquisition by Yahoo … Yahoo bought Overture entails text hypothesized answer Key task is recognizing that one text entails another IE – extract buying events: Y’s acquisition by X X buy Y Search: Find Acquisitions by Yahoo Summarization (multi-document) – identify redundancy
17
Textual Entailment ≈ Human Reading Comprehension
From a children’s English learning book (Sela and Greenberg): Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …” Hypothesis (True/False?): The Bermuda Triangle is near the United States The common for method for testing human reading comprehension is by testing the entailment capability – either directly or via QA The difficulty – variability between question and text Knowledge needed – Florida in the US ???
18
PASCAL Recognizing Textual Entailment (RTE) Challenges FP-6 Funded PASCAL NOE 2004-7
Bar-Ilan University ITC-irst and CELCT, Trento MITRE Microsoft Research
19
Some Examples TEXT HYPOTHESIS TASK ENTAIL-MENT 1
Regan attended a ceremony in Washington to commemorate the landings in Normandy. Washington is located in Normandy. IE False 2 Google files for its long awaited IPO. Google goes public. IR True 3 …: a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others. Cardinal Juan Jesus Posadas Ocampo died in 1993. QA 4 The SPD got just 21.5% of the vote in the European Parliament elections, while the conservative opposition parties polled 44.5%. The SPD is defeated by the opposition parties.
20
Participation and Impact
Very successful challenges, world wide: RTE-1 – 17 groups RTE-2 – 23 groups ~150 downloads! RTE-3 25 groups RTE-4 (2008) – moved to NIST (TREC organizers) High interest in the research community Papers, conference keywords, sessions and areas, PhD’s, influence on funded Projects special issue at Journal of Natural Language Engineering
21
Results Average Precision Accuracy First Author (Group) 80.8% 75.4%
Hickl (LCC) 71.3% 73.8% Tatu (LCC) 64.4% 63.9% Zanzotto (Milan & Rome) 62.8% 62.6% Adams (Dallas) 66.9% 61.6% Bos (Rome & Leeds) 58.1%-60.5% 11 groups 52.9%-55.6% 7 groups Average: 60% Median: 59%
22
What is the main obstacle?
System reports point at: Lack of knowledge rules, paraphrases, lexical relations, etc. It seems that systems that coped better with these issues performed best
23
Research Directions at Bar-Ilan Knowledge Acquisition Inference Applications
Oren Glickman, Idan Szpektor, Roy Bar Haim, Maayan Geffet, Moshe Koppel Bar Ilan University Shachar Mirkin Hebrew University, Israel Hristo Tanev, Bernardo Magnini, Alberto Lavelli, Lorenza Romano ITC-irst, Italy Bonaventura Coppola, Milen Kouylekov University of Trento and ITC-irst, Italy In which framework should we work to model semantic phenomena
24
Distributional Word Similarity
“Similar words appear in similar contexts” Harris, 1968 Similar Word Meanings Similar Contexts Distributional Similarity Model: Similar Word Meanings Similar Context Features
25
Measuring Context Similarity
Country State Industry (genitive) Neighboring (modifier) Neighboring (modifier) … … Governor (modifier) Visit (obj) Parliament (genitive) … Industry (genitive) Population (genitive) … Governor (modifier) Visit (obj) Parliament (genitive) President (genitive)
26
Incorporate Indicative Patterns
27
Acquisition Example Top-ranked entailments for “company”:
firm, bank, group, subsidiary, unit, business, supplier, carrier, agency, airline, division, giant, entity, financial institution, manufacturer, corporation, commercial bank, joint venture, maker, producer, factory …
28
Learning Entailment Rules
Q: What reduces the risk of Heart Attacks? Hypothesis: Aspirin reduces the risk of Heart Attacks Text: Aspirin prevents Heart Attacks Entailment Rule: X prevent Y ⇨ X reduce risk of Y Show the need of entailment relations Explain the entailment rule parts: a template – a parsed text fragment with variables, a relation direction State that a large collection of entailment rules, ~100000, is needed to provide good coverage for language variability template template Need a large knowledge base of entailment rules
29
Anchor Set Extraction (ASE)
TEASE – Algorithm Flow Lexicon Input template: Xsubj-accuse-objY WEB TEASE Sample corpus for input template: Paula Jones accused Clinton… Sanhedrin accused St.Paul… … Anchor Set Extraction (ASE) Anchor sets: {Paula Jonessubj; Clintonobj} {Sanhedrinsubj; St.Paulobj} … State the name of the method – TEASE – and its acronym meaning Emphasize that the ASE part solves the supervision problem we had in previous web-based methods Finish with stating again the two parts of the TEASE method Sample corpus for anchor sets: Paula Jones called Clinton indictable… St.Paul defended before the Sanhedrin … Template Extraction (TE) Templates: X call Y indictable Y defend before X … iterate
30
Sample of Extracted Anchor-Sets for X prevent Y
X=‘sunscreens’, Y=‘sunburn’ X=‘sunscreens’, Y=‘skin cancer’ X=‘vitamin e’, Y=‘heart disease’ X=‘aspirin’, Y=‘heart attack’ X=‘vaccine candidate’, Y=‘infection’ X=‘universal precautions’, Y=‘HIV’ X=‘safety device’, Y=‘fatal injuries’ X=‘hepa filtration’, Y=‘contaminants’ X=‘low cloud cover’, Y= ‘measurements’ X=‘gene therapy’, Y=‘blindness’ X=‘cooperation’, Y=‘terrorism’ X=‘safety valve’, Y=‘leakage’ X=‘safe sex’, Y=‘cervical cancer’ X=‘safety belts’, Y=‘fatalities’ X=‘security fencing’, Y=‘intruders’ X=‘soy protein’, Y=‘bone loss’ X=‘MWI’, Y=‘pollution’ X=‘vitamin C’, Y=‘colds’ Emphasize that the large amount of good anchor-sets enables the learning of many different entailment rules
31
Sample of Extracted Templates for X prevent Y
X reduce Y X protect against Y X eliminate Y X stop Y X avoid Y X for prevention of Y X provide protection against Y X combat Y X ward Y X lower risk of Y X be barrier against Y X fight Y X reduce Y risk X decrease the risk of Y relationship between X and Y X guard against Y X be cure for Y X treat Y X in war on Y X in the struggle against Y X a day keeps Y away X eliminate the possibility of Y X cut risk Y X inhibit Y State the richness Emphasize the impact of systems like question answering
32
Experiment and Evaluation
48 randomly chosen input verbs 1392 templates extracted ; human judgments Encouraging Results: Future work: improve precision Average Yield per verb 29 correct templates per verb Average Precision per verb 45.3% Explain why yield and not recall State that the precision is comparable to other unsupervised systems
33
Syntactic Variability Phenomena
Template: X activate Y Example Phenomenon Y is activated by X Passive form X activates its companion, Y Apposition X activates Z and Y Conjunction X activates two proteins: Y and Z Set X, which activates Y Relative clause X binds and activates Y Coordination X activates a fragment of Y Transparent head X is a kinase, though it activates Y Co-reference
34
Takeaway Promising potential for creating huge entailment knowledge bases Millions of rules Speculation: is it possible to have a public effort for knowledge acquisition? Human Genome Project analogy Community effort
35
Initial Applications: Relation Extraction Semantic Search
36
Dataset Recognizing interactions between annotated proteins pairs (Bunescu 2005) 200 Medline abstracts Gold standard dataset of protein pairs Input template : X interact with Y
37
Manual Analysis - Results
93% of interacting protein pairs can be identified with lexical syntactic templates Number of templates vs. recall (within 93%): # templates R(%) 39 60 2 10 73 70 4 20 107 80 6 30 141 90 11 40 175 100 21 50
38
TEASE Output for X interact with Y
A sample of correct templates learned: X binding to Y X bind to Y X Y interaction X activate Y X attach to Y X stimulate Y X interaction with Y X couple to Y X trap Y interaction between X and Y X recruit Y X become trapped in Y X associate with Y X Y complex X be linked to Y X recognize Y X target Y X block Y
39
TEASE algorithm - Potential Recall on Training Set
Experiment 39% input 49% input + iterative 63% input + iterative + morph Iterative - taking the top 5 ranked templates as input Morph - recognizing morphological derivations
42
Epmpahsize importance for small collections
43
Integrating IE and Search (w. IBM Research Haifa)
44
Optimistic Conclusions
Good prospects for better levels of text understanding Enabling more sophisticated information access Textual entailment is an appealing framework Boosts research on text understanding Potential for vast knowledge acquisition Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.