Information extraction from text Part 4. 2 In this part zIE from (semi-)structured text zother applications ymultilingual IE yquestion answering systems.

Slides:

Advertisements

Similar presentations

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Processing of large document collections Part 11 (Question answering systems; Closing of the course) Helena Ahonen-Myka Spring 2006.

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.

The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.

Basi di dati distribuite Prof. M.T. PAZIENZA a.a

Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.

CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Knowledge Extraction by using an Ontology- based Annotation Tool Knowledge Media Institute(KMi) The Open University Milton Keynes, MK7 6AA October 2001.

Chapter 2: Algorithm Discovery and Design

Chapter 2: Algorithm Discovery and Design

BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Information Retrieval and Information Extraction Akhil Deshmukh (06D5007) ‏ Narendra Kumar (06D05008) ‏ Kumar Lav (06D05012) ‏ image: google.

Processing of large document collections Part 11 (Information extraction: multilingual IE, IE from web, IE from semi-structured data; Question answering.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

Presenter: Shanshan Lu 03/04/2010

Information extraction from text Spring 2003, Part 4 Helena Ahonen-Myka.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

The Akoma Ntoso Naming Convention Fabio Vitali University of Bologna.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Andy Nguyen Christopher Piech Jonathan Huang Leonidas Guibas. Stanford University.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Text Based Information Retrieval

Multimedia Information Retrieval

Family History Technology Workshop

Presentation transcript:

Information extraction from text Part 4

2 In this part zIE from (semi-)structured text zother applications ymultilingual IE yquestion answering systems ynews event tracking and detection zclosing of the course

3 WHISK zSoderland: Learning information extraction rules for semi-structured and free text, Machine Learning, 1999

4 Semi-structured text (online rental ad) Capitol Hill - 1 br twnhme. Fplc D/W W/D. Undrgrnd Pkg incl $ BR, upper flr of turn of ctry HOME. incl gar, grt N. Hill loc $995. (206) (This ad last ran on 08/03/97.)

5 2 case frames extracted: zRental: yNeighborhood: Capitol Hill yBedrooms: 1 yPrice: 675 zRental: yNeighborhood: Capitol Hill yBedrooms: 3 yPrice: 995

6 Semi-structured text zThe sample text (rental ad) is not grammatical nor has a rigid structure ywe cannot use a natural language parser as we did before ysimple rules that might work for structured text do not work here

7 Rule representation zWHISK rules are based on a form of regular expression patterns that identify ythe context of relevant phrases ythe exact delimiters of the phrases

8 Rule for number of bedrooms and associated price zID:: 1 zPattern:: *( Digit ) ’BR’ * ’$’ ( Number ) zOutput:: Rental {Bedrooms $1}{Price $2} y* : skip any number of characters until the next occurrence of the following term in the pattern (here e.g. : the next digit) ysingle quotes: literal -> exact (case insensitive) match yDigit: a single digit; Number: possibly multi-digit

9 Rule for number of bedrooms and associated price yparentheses (unless within single quotes) indicate a phrase to be extracted xthe phrase within the first set of parentheses (here: Digit) is bound to the variable $1 in the output portion of the rule yif the entire pattern matches, a case frame is created with slots filled as labeled in the output portion yif part of the input remains, the rule is re- applied starting from the last character matched before

10 2 case frames extracted: zRental: yBedrooms: 1 yPrice: 675 zRental: yBedrooms: 3 yPrice: 995

11 Disjunction zThe user may define a semantic class ya set of terms that are considered to be equivalent yDigit and Number are special semantic classes (built-in in WHISK) ye.g. Bdrm = (brs|br|bds|bdrm|bd|bedrooms|bedroom|bed) ya set does not have to be complete or perfectly correct: still it may help WHISK to generalize rules

12 Rule for neighborhood, number of bedrooms and associated price zID:: 2 zPattern:: *( Nghbr ) *( Digit ) ’ ’ Bdrm * ’$’ ( Number ) zOutput:: Rental {Neighborhood $1} {Bedrooms $2}{Price $3} yassuming the semantic classes Nghbr (neighborhood names for the city) and Bdrm

13 Algorithm for learning rules automatically zA supervised learning algorithm: a set of hand-tagged training instances is needed zthe tagging process is interleaved with learning stages zin each iteration, WHISK ypresents the user a set of instances to tag ylearns a set of rules from the expanded training set

14 Creating hand-tagged training instances zIt depends on the domain what constitutes an instance and what preprocessing is done yan entire text may constitute an instance ya text may be broken into multiple instances based on HTML tags or other regular expressions ysemantic tags may be added in preprocessing etc.

15 Creating hand-tagged training instances zThe user adds a tag for each case frame to be extracted from the instance yif the case frame has multiple slots, the tag will be multi-slot ysome of the ”tagged” instances will have no tags, if the user has determined that the instance contains no relevant information

16 Tagged Capitol Hill - 1 br twnhme. Fplc D/W W/D. Undrgrnd Pkg incl $ BR, upper flr of turn of ctry HOME. incl gar, grt N. Hill loc $995. (206) (This ad last ran on 08/03/97.) Rental {Neighborhood Capitol Hill}{Bedrooms 1} {Price 675} Rental {Neighborhood Capitol Hill}{Bedrooms 3} {Price 995}

17 Creating a rule from a seed instance zuntagged instances, training instances zif a rule is applied successfully to an instance, the instance is considered to be covered by the rule zif the extracted phrases exactly match a tag associated with the instance, it is considered a correct extraction, otherwise an error

18 Algorithm WHISK (Reservoir) RuleSet = NULL Training = NULL Repeat at user’s request Select a set of NewInst from Reservoir (User tags the NewInst) Add NewInst to Training Discard rules with errors on NewInst For each Inst in Training For each Tag of Inst If Tag is not covered by RuleSet Rule = GROW_RULE(Inst, Tag, Training)

19 Algorithm zAt each iteration, the user tags a set of instances from the Reservoir of untagged instances zsome of these new training instances may be counterexamples to existing rules y-> the rule is discarded so that a new rule may be grown

20 Algorithm zWHISK then selects an instance-tag pair for which the slot fills of the tag are not extracted from the instance by any rule in RuleSet ythe instance-tag pair becomes a seed to grow a new rule that covers the seed zWHISK induces rules top-down, yfirst finding the most general rule that covers the seed ythen extending the rule by adding terms one at a time

21 Algorithm zThe metric used to select a new term is the Laplacian expected error of the rule: yLaplacian = (e+1)/(n+1) zn: the number of extractions made on the training set ze: the number of errors among those extractions

22 Algorithm zThis metric gives an estimate of the true error of a rule that is sensitive to the amount of support it has in the training set zfor alternate rules with the same number of training errors, the metric will be lowest for the rule with highest coverage za rule that covers a single tag with no error will get an expected error rate of 0.5

23 Anchoring the extraction slots zWHISK grows a rule from a seed, by starting with an empty rule and anchoring the extraction boundaries one slot at a time zto anchor an extraction, WHISK considers ya rule with terms added just within the extraction boundary (Base_1), and ya rule with terms added just outside the extraction (Base_2)

24 Tagged Capitol Hill - 1 br twnhme. Fplc D/W W/D. Undrgrnd Pkg incl $ BR, upper flr of turn of ctry HOME. incl gar, grt N. Hill loc $995. (206) (This ad last ran on 08/03/97.) Rental {Neighborhood Capitol Hill}{Bedrooms 1} {Price 675} Rental {Neighborhood Capitol Hill}{Bedrooms 3} {Price 995}

25 Anchoring the extraction slots zAnchoring Slot 1: yBase_1: * ( Nghbr ) yBase_2: * ) ’ -’ zthe semantic class Nghbr matches the first and only term of slot 1 -> Base_1 zthe terms just outside the first slot are the start of the sentence and the space and hyphen -> Base_2 zassume: Base_1 applies correctly to more instances in the training set -> anchor for slot 1

26 two alternatives for extending the rule to cover slot 2 yBase_1: * ( Nghbr ) * ( Digit ) yBase_2: * ( Nghbr ) * ’- ’ ( * ) ’ br’ yeach operates correctly on the seed instance yBase_1 looks for the 1st digit after the first neighborhood yBase_2 for the 1st ’- ’ after the 1st neighborhood and extracts all characters up to the next ’ br’ as the number of Bedrooms

27 Anchoring Slot 3 yBase_1: * ( Nghbr ) * ( Digit ) * ( Number ) yBase_2: * ( Nghbr ) * ( Digit ) * ’$’ ( * ) ’.’ yassume Base_1 was chosen for slot 2 ythe process is continued for slot 3 ythe final anchored rule operates correctly on the seed instance, but may make some extraction errors on other training instances yWHISK continues adding terms

28 Adding terms to a proposed rule zWHISK extends a rule yby considering each term that could be added yand testing the performance of each proposed extension on the hand-tagged training set zthe new rule must apply to the seed instance -> only terms from this instance need to be considered in growing the rule

29 Adding terms to a proposed rule zIf a term from the instance belongs to a user-defined semantic class, WHISK tries adding either the term itself or its semantic class to the rule zeach word, number, punctuation, or HTML tag in the instance is considered a term

30 GROW-RULE GROW_RULE (Inst, Tag, Training) Rule = empty rule (terms replaced by wildcards) For i = 1 to number of slots in Tag ANCHOR (Rule, Inst, Tag, Training, i) Do until Rule makes no errors on Training or no improvement in Laplacian EXTEND_RULE(Rule, Inst, Tag, Training)

31 ANCHOR ANCHOR (Rule, Inst, Tag, Training, i) Base_1 = Rule + terms just within extraction i Test first i slots of Base_1 on Training While Base_1 does not cover Tag EXTEND_RULE(Base_1, Inst, Tag, Training) Base_2 = Rule + terms just outside extraction i Test first i slots of Base_2 on Training While Base_2 does not cover Tag EXTEND_RULE(Base_2, Inst, Tag, Training) Rule = Base_1 If Base_2 covers more of Training than Base_1 Rule = Base_2

32 Extending a rule zEach proposed extension of a rule is tested on the training set zthe proposed rule with lowest Laplacian expected error is selected as the next version of the rule yuntil the rule either makes no errors or yuntil the Laplacian is below a threshold and none of the extensions reduce the Laplacian

33 Extending a rule zIf several proposed rules have the same Laplacian, WHISK uses heuristics that prefer the semantic class over a word zrationale: to prefer the least restrictive rule that fits the data -> rules probably operate better on unseen data zalso terms near extraction boundaries are preferred

34 EXTEND-RULE EXTEND_RULE (Rule, Inst, Tag, Training) Best_Rule = NULL; Best_L = 1.0 If Laplacian of Rule within error tolerance Best_Rule = Rule Best_L = Laplacian of Rule For each Term in Inst Proposed = Rule + Term Test Proposed on Training If Laplacian of Proposed < Best_L Best_Rule = Proposed Best_L = Laplacian of Proposed Rule = Best_Rule

35 Rule set may not be optimal zWHISK cannot guarantee that the rules it grows are optimal yoptimal = the lowest Laplacian expected error on the hand-tagged training instances zterms are added and evaluated one at a time yit may happen that adding two terms would create a reliable rule with high coverage and few errors, but adding either term alone does not constrain the rule from making extraction errors

36 Rule set may not be optimal zIf WHISK makes a ”wrong” choice of terms to add, it may miss a reliable, high- coverage rule, but will continue adding terms until the rule operates reliably on the training set ysuch a rule will be more restrictive than the optimal rule and will tend to have lower coverage on unseen instances

37 Structured text zWhen the text is rigidly structured, text extraction rules can be learned easily from only a few examples ye.g., structured text on the web is often created automatically by a formatting tool that delimits the variable information with exactly the same HTML tags or other labels

38 Structured text Thursday partly cloudy High: 29 C / 84 F Low: 13 C / 56 F ID:: 4 Pattern:: *( Day )’ ’* ’1> ’( * ) ’ * ’ ’ ( * ) ’ ’ * ’ ’ ( * ) ’ Output:: Forecast {Day $1}{Conditions $2}{High $3}{Low $4}

39 Structured text zWHISK can learn the previous rule from two training instances (as long as the variable information is not accidentally identical) zin experiments, this rule gave recall 100% at precision 100%

40 Evaluation zperfect recall and precision can generally be obtained for structured text in which each set of slots is delimited with some unique sequence of HTML tags or labels zfor less structured text, the performance varies with the difficulty of the extraction task (50% - 100%) yfor the rental ad domain, one rule covers 70% of the recall (with precision 97%)

41 Other applications using IE zmultilingual IE zquestion answering systems z(news) event detection and tracking

42 Multilingual IE zAssume we have documents in two languages (English/French), and the user requires templates to be filled in one of the languages (English) from documents in either language y”Gianluigi Ferrero a assisté à la réunion annuelle de Vercom Corp à Londres.” y”Gianluigi Ferrero attended the annual meeting of Vercom Corp in London.”

43 Both texts should produce the same template fill: z := yorganisation: ’Vercom Corp’ ylocation: ’London’ ytype: ’annual meeting’ ypresent: z := yname: ’Gianluigi Ferrero’ yorganisation: UNCLEAR

44 Multilingual IE: three ways of addressing the problem z1. solution y A full French-English machine translation (MT) system translates all the French texts to English yan English IE system then processes both the translated and the English texts to extract English template structures ythe solution requires a separate full IE system for each target language and a full MT system for each language pair

45 Multilingual IE: three ways of addressing the problem z2. solution ySeparate IE systems process the French and English texts, producing templates in the original source language ya ’mini’ French-English MT system then translates the lexical items occurring in the French templates ythe solution requires a separate full IE system for each language and a mini-MT system for each language pair

46 Multilingual IE: three ways of addressing the problem z3. solution ya general IE system, with separate French and English front ends ythe IE system uses a language-independent domain model in which ’concepts’ are related via bi-directional mappings to lexical items in multiple language-specific lexicons ythis domain model is used to produce a language-independent representation of the input text - a discourse model

47 Multilingual IE: three ways of addressing the problem z3. solution continues… ythe required information is extracted from the discourse model and the mappings from concepts to the English lexicon are used to produce templates with English lexical items ythe solution requires a separate syntactic/semantic analyser for each language, and the construction of mappings between the domain model and a lexicon for each language

48 Multilingual IE zWhich parts of the IE process/systems are language-specific? zWhich parts of the IE process are domain- specific?

49 Question answering systems: TREC (8-10) zParticipants were given a large corpus of newspaper/newswire documents and a test set of questions (open domain) za restricted class of types for questions zeach question was guaranteed to have at least one document in the collection that explicitly answered it zthe answer was guaranteed to be no more than 50 characters long

50 Example questions from TREC-9 zHow much folic acid should an expectant mother get daily? zWho invented the paper clip? zWhat university was Woodrow Wilson president of? zWhere is Rider College located? zName a film in which Jude Law acted. zWhere do lobsters like to live?

51 More complex questions zWhat is epilepsy? zWhat is an annuity? zWhat is Wimbledon? zWho is Jane Goodall? zWhat is the Statue of Liberty made of? zWhy is the sun yellow?

52 TREC zParticipants returned a ranked list of five [document-id, answer-string] pairs per question zall processing was required to be strictly automatic zpart of the questions were syntactic variants of some original question

53 Variants of the same question zWhat is the tallest mountain? zWhat is the world’s highest peak? zWhat is the highest mountain in the world? zName the highest mountain. zWhat is the name of the tallest mountain in the world?

54 Examples of answers zWhat is a meerkat? yThe meerkat, a type of mongoose, thrives in… zWhat is the population of Bahamas? yMr. Ingraham’s charges of ’impropriety’ are unlikely to excite the 245,000 people of the Bahamas zWhere do lobsters like to live? yThe water is cooler, and lobsters prefer that

55 TREC zScoring yif the correct answer is found in the first pair, the question gets a score 1 yif the correct answer is found in the kth pair, the score is 1/k (max k = 5) yif the correct answer is not found, the score is 0 ytotal score for a system: an average of the scores for the questions

56 FALCON, QAS zHarabagiu et al: FALCON: Boosting knowledge for answer engines, 2000 zHarabagiu et al: Answering complex, list and context questions with LCC’s Question-Answering Server, 2001

57 FALCON zNLP methods are used to derive the question semantics: what is the type of the answer? zIR methods (a search engine) are used to find all text paragraphs that may contain the answer zincorrect answers are filtered out from the answer candidates (NLP methods)

58 FALCON zKnowledge sources: yold questions (different variants) and the corresponding answers yWordnet: alternatives for keywords

59 FALCON: system zQuestion processing ynamed entity recognition, phrases -> semantic form for the question yone of the words/phrases in the question may indicate the type of the answer ythis word is mapped to the answer taxonomy (uses Wordnet) to find a semantic class for the answer type (e.g. quantity, food) yother words/phrases are used as keywords for a query

60 FALCON: system zParagraph processing ythe question keywords are structured into a query that is passed to a search engine yonly the text paragraphs defined by the presence of the query keywords within a window of pre-defined size (e.g. 10 lines) are retrieved yif too few or too many paragraphs are returned, the query is reformulated (by dropping or adding keywords)

61 FALCON: system zAnswer processing yeach paragraph is parsed and transformed into a semantic form yif unifications between the question and answer semantic forms are not possible for any of the answer paragraphs, alternations of the question keywords (synonyms, morphological derivations) are considered and sent to the retrieval engine ythe new paragraphs are evaluated

62 FALCON: system zLogical justification yan answer is extracted only, if a logical justification of its correctness can be provided ythe semantic forms of questions and answers are translated into logical forms yinference rules model, e.g. coreferences and some general world knowledge (WordNet) yif the answers cannot be justified, semantic alternations are considered and a reformulated query is sent to the search engine

63 Definition questions zA special case of answer type is associated with questions that inquire about definitions zthere are questions having a syntactic format that indicates that the question asks for the definition of a certain concept ysuch questions are easily identified as they are matched by a set of patterns

64 Definition questions zSome question patterns: yWhat {is|are} ? yWhat is the definition of ? yWho {is|was|are|were} zSome answer patterns: y {is|are} y, {a|an|the} y -

65 FALCON zClearly the best system in TREC-9 (2000) z692 questions zscore: 60 % zthe work has been continued within LCC as the Question-Answering Server (QAS), which participated in TREC-10

66 Insight Q/A system zSoubbotin: Patterns of potential answer expressions as clues to the right answers, 2001 zbasic idea: for each question type, there is a set of predefined patterns yeach such an indicator pattern has a score for each (relevant) question type

67 Insight Q/A system zfirst, answer candidates are retrieved (for query, the most specific words of the question are used) zanswer candidates are checked for the presence of the indicator patterns ycandidates containing the highest-scored indicators are chosen as final answers

68 Insight Q/A system zPreconditions for the use of the method: ydetailed categorization of question types (”Who-Post”; ”Who-Author”,…) ya large variety of patterns for each type (e.g. for ”Who-Author”-type 23 patterns) ya sufficiently large number of candidate answers for each question zTREC-10 results: 68% (the best?)

69 New challenges: TREC-10 zWhat if the existence of an answer is not guaranteed? yIt is not easy to recognize that an answer is not available yin real life applications, an incorrect answer may be worse than not returning an answer at all

70 New challenges zeach question may require information from more than one document yName 10 countries that banned beef imports from Britain in the 1990s. zfollow-up questions yWhich museum in Florence was damaged by a major bomb explosion in 1993? yOn what day did this happen?

71 Question-answering in a closed domain zIn TREC competitions, the types of questions belonged to some closed class, but the topics did not belong to any specific domain (open-domain) zin practice, a question-answering system may be particularly helpful in some closed well-known domain, like within some company

72 Question-answering in a closed domain zspecial features ythe questions can have any type, and they may have errors and spoken-language expressions ythe same questions (variants) probably occur regularly -> extensive use of old questions yclosed domain: extensive use of domain- knowledge feasible xontologies, thesauri, inference rules

73 QA vs IE zopen domain, closed domain? zIE: static task definition, QA: question defines (dynamically) the task zIE: structured answer (”database record”), QA: answer is a fragment of text yin future: also in QA more exact answers zmany similar modules can be used ylanguage analysis, general semantics (WordNet)

74 News event detection and tracking zWe would like to have a system that yreads news streams (e.g. from news agencies) ydetects significant events ypresents the contents of the events to the user as compactly as possible yalerts if new events occur ygives the user the possibility to follow the development of some user-selected events xthe system alerts if follow-up news appear

75 Event detection and tracking zWhat is an event? ysomething that happens in some place at some time ye.g. elections in Zimbabwe in 2002 zan event usually represents some topic ye.g. elections zthe definition of event is not always clear y an event may later split to several subtopics

76 Event detection and tracking zFor each new text: ydecision: is this text about a new event? yif not, to which existing event chain does it belong? zmethods: text categorization, clustering zsimilarity metrics zalso: language analysis yname recognition: proper names, locations, time expressions

77 Event detection and tracking vs IE vs QA zNo query, no task definition ythe user may choose some event chain to follow, but the system has to be prepared to follow any chain zopen domain, WordNet could be used for measuring the similarity of two texts zanalysis of news stories yname recognition: proper names, time expressions, locations etc. important in all

78 Closing zWhat did we study: ystages of an IE process ylearning domain-specific knowledge (extraction rules, semantic classes) yIE from (semi)structured text ysome related approaches/applications

79 Closing zExam: next week on Wednesday at (Auditorio) yalternative: on Tuesday at (Auditorio) zsome model answers for exercises will appear soon zremember Course feedback / Kurssikysely!