INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur

Slides:



Advertisements
Similar presentations
Lucene/Solr Architecture
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Evaluating the Performance of IR Sytems
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Chapter 19: Information Retrieval
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Introduction to Information Retrieval and Web Search.
Information Retrieval
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Search Engine Architecture
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Web- and Multimedia-based Information Systems Lecture 2.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Information Retrieval
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
Search Engine Architecture
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Representation of documents and queries
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Lucene/Solr Architecture
Boolean and Vector Space Retrieval Models
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur

Information Retrieval Problem definition: Given a user’s information need, find documents satisfying that need “Document” is the generic term for an information holder (book, chapter, article, webpage, etc) Types of information: text, images/graphics, speech, video, etc. Text is still the most commonly used.

Information Retrieval Information Retrieval is a research-driven theoretical and experimental discipline The focus is on different aspects of the information– seeking process: Computer scientist – fast and accurate search engine Librarian – organization and indexing of information Cognitive scientist – the process in the searcher’s mind … Progress influenced by advances in Computational Linguistics, Information Visualization, Cognitive Psychology, HCI, …

Information Retrieval Basic principle: Document -> list of keywords / content-descriptors / terms User’s information need -> (natural-language) query -> list of Keywords Measure overlap between query and documents.

Stages of IR Indexed and structured information Information Retrieval Searching Browsing Indexing, organizing Creation

IR process Collection of documents Real world Document representations Query Information need Anomalous state of knowledge Matching Results

Document Representation: Indexing Inverted index

Vocabulary Vocabulary (indexing language): The set of concepts (terms or phrases) that can be used to index documents in a collection Controlled Specific for specialized domains Potential for increased consistency of indexing and precision of retrieval Un-controlled (free) Potentially all the terms in the documents Potential for increased recall

Indexing Tokenize: identify individual words. Stopword removal: eliminate common words, e.g. and, of, the, etc. Stemming: reduce words to a common root. e.g. analysis, analyze, analyzing -> analy, use standard algorithms (Porter). Thesaurus: find synonyms for words in the document. Phrases: find multi-word terms e.g. computer science, data mining. use syntax/linguistic methods or “statistical” methods. Named entities: identify names of people, organizations and places; dates; monetary or other amounts, etc.

Boolean Retrieval Model Keywords combined using AND, OR, (AND) NOT e.g. (medicine OR treatment) AND (hypertension OR “high blood pressure”) Efficient and easy to implement (list merging) AND : intersection OR : union Drawbacks OR — one match as good as many AND — one miss as bad as all no ranking

Term Weighting Any text item (“document”) is represented as list of terms and associated weights. Term = keywords or content-descriptors Weight = measure of the importance of a term in representing the information contained in the document

Vector Space Model Term frequency (tf): repeated words are strongly related to content Inverse document frequency (idf): uncommon term is more important Normalization by document length long docs. contain many distinct words. long docs. contain same word many times. term-weights for long documents should be reduced. use # bytes, # distinct words, Euclidean length, etc. Weight = tf x idf / normalization

Retrieval Measure vocabulary overlap between user query and documents. Use inverted index Cosine of the angle between document and query vectors Ranked retrieval

Query Expansion Searching depends on matching keywords between user- query and document Nature of language -> searchers and document creators may use different keywords to denote same “concept” Example: fatalities in road accidents on G.T. Road Vocabulary mismatch -> poor retrieval quality Problem aggravated by short queries + large, heterogeneous databases Solution: expand the query by adding related words/ phrases. Issues: select which terms to add to query calculate weights for added terms

Relevance Feedback Original query is used to retrieve some number of documents. User examines some of the retrieved documents and provides feedback about which documents are relevant and which are non- relevant. System uses the feedback to “learn” a better query: select/emphasize words that occur more frequently in relevant documents than non-relevant documents; eliminate/de-emphasize words that occur more frequently in non- relevant than in relevant documents. Resulting query should bring in more relevant documents and fewer non-relevant documents

Link/Citation Analysis In uncontrolled environments like WWW documents are uncontrolled, untrusted, commercial implications Presence of terms itself do not signify relevance Spamming Importance of author Link/Citation analysis

Page Rank Used in Google Search Engine ’Global’ ranking of every web page calculated based on hyperlink structure of web (content ignored) Documents with matching keywords returned in the global rank order Principle: Highly linked pages are more important than pages with a few links. A page has a high rank if the sum of the ranks of its back- links is high. Most effective for underspecified (general) queries

Page Rank

Open Source Search Engines Lucene Terrier Zettair ….. Lucene is the search engine used by Dspace

Lucene/Solr Architecture 20 Apache Lucene /select/spellXMLCSV XMLBinary JSON Data Import Handler (SQL/RSS) Extracting Request Handler ( PDF/WORD) CachingFaceting Query Parsing Apache Tika binary /admin High- lighting Schema Index Replication Request HandlersUpdate HandlersResponse Writers Query Search Components Spelling Faceting Highlightin g Signature Logging Update Processors Indexing Config Debug Statistics More like this Distributed Search Clustering FilteringSearch Core Search IndexReader/Search er Indexing IndexWriter Text Analysis Analysis

Evaluation Background User has an information need. Information need is converted into a query. Documents are relevant or non-relevant. Ideal system retrieves all and only the relevant documents.

Set Based Metrics

Evaluation Forums TREC, CLEF, NTCIR

References Introduction to Information Retrieval Manning, Raghavan, Schultz Lucene in Action Manning