Basi di dati distribuite Prof. M.T. PAZIENZA a.a. 2003-2004.

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
Information Retrieval and Extraction -- Course Introduction Chia-Hui Chang National Central University
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Natural Language Understanding
Information Retrieval in Practice
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Mining and Summarizing Customer Reviews
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
9/8/20151 Natural Language Processing Lecture Notes 1.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Survey of Semantic Annotation Platforms
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
Artificial intelligence project
1 Computational Linguistics Ling 200 Spring 2006.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
資訊檢索與擷取 Information Retrieval and Extraction
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
A Language Independent Method for Question Classification COLING 2004.
An Intelligent Analyzer and Understander of English Yorick Wilks 1975, ACM.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Using Semantic Relations to Improve Information Retrieval
Overview of Statistical NLP IR Group Meeting March 7, 2006.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Multimedia Information Retrieval
Social Knowledge Mining
Machine Learning in Natural Language Processing
CS246: Information Retrieval
Information Retrieval
Presentation transcript:

Basi di dati distribuite Prof. M.T. PAZIENZA a.a

INFORMATION EXTRACTION And QUESTION / ANSWERING

Information Extraction Information Extraction generally relates to automatic approaches to locate important facts in large collections of documents aiming at highlighting specific information to be used for enriching other texts and documents while populating summaries, feeding reports, filling in forms or storing information for further processing (e.g. data mining); the extracted information is usually structured in the form of “templates”

Information Extraction The process of Information Extraction consists of two major steps: To extract individual “facts” from the text of a document through local text analysis To integrate extracted facts producing larger facts or new facts (through inference)

Information Extraction Short history (1) IE originated in the natural language processing community under the MUC conferences (starting at 1987 and sponsored by DARPA) with the definition of a task: inside a specific application domain and corpus, a template with the relevant information has to be filled for every event of each foreseen class.

Information Extraction Short history (2) In 1995 further goals for IE were proposed: To identify processing tasks largely domain independent (e.g. NE Named Entity Recognition) To focus on portability in the IE tasks to new event classes To add three new tasks: co- reference resolution, word-sense-disambiguation, predicate-argument syntactic structuring

Information Extraction Terminology A template is a sort of linguistic pattern (a set of attribute-value pairs with the values being texts string) described by experts to represent the structure of a specific event in a given domain. The template relates to the final output format of selected information The scenario identifies the specification of the particular events or relations to be extracted.

Information Extraction General Architecture for an IE system “An IE system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically”. (by J. Hobbs).

Information Extraction General Architecture for an IE system Each system could be characterized by its own set of modules belonging to the following set: Text zoner, pre-processing, filter, preparser, parser, fragment combiner, semantic interpreter, lexical disambiguation, coreference resolution / discourse processing, template generator.

Information Extraction Text zoner This module turns a text into a set of text segments. As a minimum results it would separate the formatted from the unformatted regions.

Information Extraction Pre-processing This module: locates sentence boundaries in the text producing for each sentence a sequence of lexical items (words together with their possible POS). It recognizes also multiword (lexical lookup methods) recognizes and normalizes certain basic types that occur in the genre, such as dates, times, personal and company names, locations, currency amounts, and so on.

Information Extraction Filter For speeding processing time this module uses superficial techniques to filter out (from previously recognized ones) the sentences that are likely to be irrelevant. In any application, subsequent modules will be looking for patterns of words that signal relevant events. If a sentence has none of this words, then there is no reason to process it further.

Information Extraction Preparser This module recognizes very common small-scale structures, simplifying the task of the parser. A few systems at this level recognize noun groups (noun phrases up through the head noun) as well as verb groups (verbs together with their auxiliaries). Appositives can be attached to their head nouns with high reliability (e.g. Prime Minister, President of the Republic, etc.).

Information Extraction Parser This module takes a sequence of lexical items (fragments) and tries to produce a parse tree for the entire sentence. Recently more and more systems are abandoning full-sentence parsing in information extraction applications being interested just in recognizing fragments, then they try only to locate within the sentence various patterns that are of interest for the application.

Information Extraction Fragment combiner This module provides indication on how to combine the previously obtained parse tree fragments

Information Extraction Semantic interpreter This module translates the parse tree or parse tree fragments into any of: a semantic structure, a logical form or event frame. Often lexical disambiguation takes place at this level as well. The method for semantic interpretation is function application or an equivalent process that matches predicates with their arguments.

Information Extraction Lexical disambiguation Lexical disambiguation allows translating a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates. More generally, lexical disambiguation generally happens by constraining the interpretation by the context in which the ambiguous word occurs, perhaps together with the “a priori” probabilities of each word sense.

Information Extraction Coreference resolution / discourse processing This module revolves: co-reference for basic entities such as pronouns, definite noun phrases, and anaphora. the reference for more complex entities like events identified either with an event that was found previously or as a consequence of a previously found event, or it may fill a role in a previous event.

Information Extraction Template generator Semantic structures generated by the natural language processing modules are used to produce the template as described by the final user only in the case events pass the defined threshold of interest.

Information Extraction There is an agreement also on a number of features: named entity recognition, co- reference resolution, template production, scenario template production.

Information Extraction Named entity recognition It refers to named entities (NE) identification (inside the text) and extraction. NEs generally relate to domain concepts and are associated to semantic classes such as person, organization, place, date, amount, etc. The accuracy in NE recognition is very high (more than 90%) and comparable with those of humans.

Information Extraction Co-reference resolution It allows identifying identity relations between previously extracted NEs. Anaphora resolution is widely used to recognize relevant information about either concepts (NE) or events sparse in the text: this activity constitutes an important source of information enabling the system to assign a statistical relevance to recognized events.

Information Extraction Template production As a result of the previous activities, an IE system becomes aware of NEs and their descriptions. This represents a first level of template (called TE – “Template Element”). The TEs collections may be considered as a basic knowledge base to which the system accesses for getting information on main domain concepts, as they have been recognized in the text.

Information Extraction Scenario template production It results in a synthesis of several tasks, mainly the identification of Template Elements that relate among them: it represents an event (scenario) related to the domain under analysis; recognized values are used to fill in a scenario template

Information Extraction Adaptive IE systems As an example, several big companies have millions of documents, stored in different parts of the world, available via intranets, where the knowledge of their employees is stored. Textual documents cannot be queried in a traditional fashion and therefore the stored knowledge can neither be used by automatic systems, nor be easily managed by humans. Knowledge is difficult to capture, share and reuse among employees, reducing the company's efficiency and competitiveness.

Information Extraction Adaptive IE systems IE is the perfect support for knowledge identification and extraction from Web documents as it can provide support in documents analysis either in an automatic approach (unsupervised extraction of information) or in a semi-automatic one (e.g. as support for human annotators in locating relevant facts in documents, via information highlighting). Machine-learning approach may be helpful.

Information Extraction Adaptive IE systems Machine learning (ML) techniques has been successfully applied to some lower level NLP tasks. NE recognition, chunking, co-reference and anaphora resolution, are interesting examples of such approaches.

Question / Answering A Q/A system accepts questions in natural language form, searches for answer over a collection of documents extracts relevant information for the question formulates concise answers.

Question / Answering Short history TREC Conferences Q/A tracks has supported the definition of a common approach to the matter. Q/A systems are open domain, then their performances are tightly coupled with the complexity of the questions asked and the difficulty of answer extraction.

Question / Answering Taxonomy of Q/A systems 1.Linguistic and knowledge resources 2.Natural language processing involved 3.Document processing 4.Reasoning methods 5.Wheather or not answer is explicitely stated in a document 6. Wheather or not answer fusion is necessary

Question / Answering Questions classes 1.Q/A systems capable of processing factual questions 2.Q/A systems enabling simple reasoning mechanisms 3.Q/A systems capable of answer fusion from different documents 4.Interactive Q/A systems 5.Speculative questions

Question / Answering Current approaches 1.Question analysis 2.Document collection processing 3.Candidate document selection 4.Candidate document analysis 5.Answer extraction 6.Response generation

Question / Answering Question analysis The question is analyzed for subsequent processing. The question may be interpreted in the context of an on-going dialogue and in the light of a model which the system has of the user. The user could be asked to clarify his question before processing

Question / Answering Document collection processing The reference document collection is the knowledge source for answering questions. It requires to be preprocessed.

Question / Answering Candidate document selection A subset of documents collection is selected, comprising those documents deemed most likely to contain an answer to the question.

Question / Answering Candidate document analysis Additional detailed analysis of the candidates selected at the preceding stage could be required.

Question / Answering Answer extraction Candidate answers are extracted from the the documents and ranked in terms of probale correctness.

Question / Answering Response generation A response is returned to the user. It may be affected by the dialogue context and user model, if present, and may in turn lead to this neing updated.