Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern Information Retrieval Chapter 1: Introduction
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
Modern Information Retrieval Chapter 1: Introduction
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Evaluating the Performance of IR Sytems
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Introduction to Information Retrieval and Web Search.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval in Practice
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Text mining.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
Information Retrieval
Modern Information Retrieval Lecture 2: Key concepts in IR.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Robust Semantics, Information Extraction, and Information Retrieval
Multimedia Information Retrieval
CSE 635 Multimedia Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Outline Information Management Applications Information Retrieval Techniques Categorization, Clustering Summarization Information Extraction Question Answering Points for discussion

Well-known Applications of Collocations Lexicography Machine Translation NL Generation NL Parsing Terminology Extraction Foreign Language Teaching Speech Recognition

Information Management Applications Information Retrieval Text Categorisation (by language, topic, author, genre...) Clustering Summarisation / Keyword Extraction Information Extraction Question Answering

Information Retrieval Most IR systems don't retrieve information, but documents Boolean retrieval: an unordered set of documents are returned as result for a query Ranked retrieval: an ordered list of documents is returned; relevance of documents is determined by matching with a query

IR System Model Documents RQRQ Matching [0, 1] Query Representation RDRD

Query Languages Co-occurence within document information AN D retrieval Negation information AND (NOT retrieval) Multi-word expression "information retrieval" Proximity operators information NEAR retrieval

Evaluation Precision Recall Precision/recall graphs 11 point average precision TREC (Text Retrieval Conference) TREC Tasks: ad-hoc, web, spoken documents, multimedia, cross-language...

Relevance Relevance is matching of a document with an information need expressed through a query Relevance is considered as binary and determined by human assessors for document-query pairs Relevance is modelled by a similarity measure that compares query and document representations

Document and Query Representations in IR Documents and queries are generally represented as a vector of terms weights Documents are treated as bags of words Preprocessing: stemming or morphological analysis POS, chunking, syntax did not improve information retrieval performance

Similarity Measures Term weighting: TF, TF/ICF, TF/IDF Similarity measures determine how close two documents are, or how alike a document and a query are A common similarity measure is the cosine of the angles between the vector representations

Cosine Similarity term2 term

Result Ranking Adjacency or proximity of search terms can be taken into account in ranking of retrieval results This accounts for phrases and collocations Search terms occurring near each other (e.g. within a paragraph) are more likely to be related than search term occurring in different parts of a document

Latent Semantic Indexing LSI: Singular value decomposition, dimensionality reduction LSI associates terms that share the same context, i.e. can be substituted Applications: information retrieval, cross- language IR, language learning, text categorisation, vocabulary tests

Query Expansion Query expansion with related terms (e.g. from WordNet, thesarus) Relevance Feedback: Query expansion with terms from relevant document Blind Relevance Feedback: Query expansion with terms from top-ranking document. Expansion with co-occurring terms improves precision/recall.

Language Models for IR Language models generate queries from documents Estimate probability that a given query was generated by a particular document Uni-gram language models (special case of probabilistic IR)

Cross-language IR Methods: document translation, query translation, parallel/comparable corpora

Document Categorization Similar techniques to IR (document representation, similarity measures) Document base contains categorized documents New document as query which retrieves the best matching documents from database Support Vector Machines achieve very good performance on various text categorization tasks

Document Clustering Similar techniques to IR (document representation, similarity measures) Each cluster is represented by a centroid Iterative hierarchical grouping of similar documents

Summarization Two approaches: Extraction (of sentencs or keywords) and abstraction (summary generation) Indicative vs. informative summaries Query-independent vs. query-biased summaries Evaluation criteria: informativeness, coherence

Information Extraction Tasks: named entity extraction, coreference, template extraction named entities: person, organisation, location, time, date, money, percentage methods: finite-state grammars, finite-state transducers evaluation: precision, recall, f-measure

Question Answering Answer extraction (passage retrieval) vs. Information extraction + answer generation Combination of IR-based and NLP-based approaches (semantic concepts, dependency relations). TREC open domain QA evaluation: extract 50-word passage containing the answer to a factual question

Co-location spaces Linear (speech, text) document (as bag of words) hierarchical structure (tree, dependency relations) semantic/conceptual space (e.g. WordNet) cyberspace (hyperlinks)

Collocations and IM Collocations are multi-word units with statistical associations with restricted semantic compositionality Are they useful for information management applications?

Collocations and Document Representations Common representations treat terms in the document and query as independent Collocations research shows that they are not independent Implications?

Collocations / Query Formulation Use of collocations for query expansion? (e.g. collocation, corpus, association... vs. collocation, facility, service, server, hosting...) Automatic or interactive?

Collocations / Categorisation & Clustering How much can category-specific collocations improve performance? Collocations for identification of genre, author, dialect?

Collocations / IE IE techniques (finite-state shallow parsing) for collocation identification Use of collocation in IE grammars (Gewinn machen, Umsatz erzielen...)

Collocations / QA Use of collocations for finding answers (e.g. function-proper_name)

Collocations and Summarisation Keyword / key phrase extraction Evaluation of coherence: which association measures can be used?

Questions for Discussion Are collocations a useful level of representation for indexing and retrieval? Or are they only useful in establishing semantic representations?