Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Chapter 19: Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Query Relevance Feedback and Ontologies How to Make Queries Better.
1 Chapter 21: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Information Retrieval.
Search Engines and Information Retrieval Chapter 1.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com 1 Chapter 19: Information Retrieval.
Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Computing & Information Sciences Kansas State University Monday, 04 Dec 2006CIS 560: Database System Concepts Lecture 41 of 42 Monday, 04 December 2006.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 21: Information.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Information Retrieval
The Business Model of Google MBAA 609 R. Nakatsu.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines By: Faruq Hasan.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com 1 Chapter 19: Information Retrieval.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Automated Information Retrieval
Information Retrieval in Practice
Search Engine Architecture
Introduction to Information Retrieval and Web Search
LECTURE 3: DATABASE SEARCHING PRINCIPLES
Text Based Information Retrieval
Information Retrieval
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Web Search Engines.
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

Information Retrieval Ugochukwu Chimbo EJIKEME

Structured Vs Unstructured Data Coperate information not stored in the database In General * The structure of the data itself. * The structure of the container that hosts the data. * The structure of the access method used to access the data.

Information Retrieval Systems (IRS) Information-retrieval systems are used to store and query textual data such as documents. They use a simpler data model than do database systems. Traditional examples of information- retrieval systems are online library catalogs and online document-management systems such as those that store newspaper articles.

Characteristics of IRS Documents are typically described by a set of keywords. Information in the database is organized simply as a collection of unstructured documents. Cares less about transactional requirements.

Relevance Ranking Using Terms (Keywords) Hyperlinks (WEB) Ranking Using TF-IDF Similarity-Based Retrieval Hyperlinks (WEB) Popularity Ranking (prestige ranking) PageRank Combining TF-IDF and Popularity Ranking Measures

Ranking using TF-IDF Σ (TF(d,ti)) Term Frequency (TF) – Relevance of a document (d) to a term (t). “Multiple Keyword” Queries ? n Σ (TF(d,ti)) i=1

Inverse document frequency (IDF) Query: “Facebook Ugo”???. Relevance therefore: Proximity??? The closer the word to each other in the document, the higher the rank.

Similarity-Based Retrieval Retrieve document similar to another. Similarity may be defined on the basics of terms. Cosine similarity metrics Relevance feedback – start new search based on user feedback on prior search.

Hyperlink Popularity Ranking Rank “popular” documents higher among set of documents with specific keywords. Determining “Popularity” Access rate ? How to get accurate data? Bookmarks? Might be private? Links to related pages? Using web crawler to analyze external links.

transfer of prestige a link from a popular page x to a page y is treated as conferring more prestige to page y than a link from a not-so-popular page z.

PageRank A measure of popularity of a page based on the popularity of pages that link to the page. Understanding PageRank. Random walk model: The PageRank of a page is the probability that a random walker is visiting a page at any given point in time. Drawback: does not take query keywords into account.

Other Measures of Popularity Click fraction search engine provides an indirect link through the search engine site, which records the page click, and transparently redirects the browser to the original link. Anchor text + Page Rank Anchor text + Page Rank + TF–IDF measures

The HITS algorithm: Hubs and Authorities compute popularity using set of related pages only. Hubs and Authorities Hub - A page that stores links to many related pages (may not in itself contain actual information on a topic) Authority - A page that contains actual information on a topic (may not store links to many related pages). Each page gets a prestige value as a hub (hub- prestige), and another prestige value as an authority (authority-prestige).

Search Engine Spamming Practice of creating Web pages, or sets of Web pages, designed to get a high relevance rank for some queries, even though the sites are not actually popular sites.

Synonyms, Homonyms, and Ontologies Define alternative words for keywords E.g Class room <==> (Class or Lecture) room Homonyms single words with multiple meanings Concept-based querying analyze each document to disambiguate each word in the document, and replace it with the concept that it represents; disambiguation is usually done by looking at other surrounding words in the document.

Ontologies are hierarchical structures that reflect relationships between concepts. Common relationships include: is – a, part of,.. etc.

Indexing of Documents Inverted index maps each keyword Ki to a list Si of the documents that contain Ki. Document 1 (d1), Document 2 (d2), Document 3 (d3) 56,89,201 12, 18, 19 5 Inverted Index = “d1/56,89,201; d2/12,18,19; d3/5” *May also include Term Frequency in documents.

Measuring Retrieval Effectiveness Keywords are maintained in a compressed form (to keep space usage of the index low). index sometimes stored such that the retrieval is approximate; a few relevant documents may not be retrieved (called a false drop or false negative), or a few irrelevant documents may be retrieved (called a false positive).

Measurement metrics Precision Recall measures the percentage retrieved documents relevant to a given query. Recall Measures percentage of the documents (relevant to the query) retrieved.

Beyond Page Ranking Information Extraction Question Answering convert information from textual form to a more structured form. Sample application: google scholar. Question Answering system attempts to provide direct answers to questions posed by users.

Summary Information-retrieval systems are used to store and query textual data such as documents. Queries attempt to locate documents that are of interest by specifying, for example, sets of keywords. Relevance ranking makes use of several types of information, such as: ◦ Term frequency: how important each term is to each document. ◦ Inverse document frequency. ◦ Popularity ranking.

Search engine spamming attempts to get (an undeserved) high ranking for a page. • Synonyms and homonyms complicate the task of information retrieval. Concept- based querying aims at finding documents containing specified concepts, regardless of the exact words (or language) in which the concept is specified. Ontologies are used to relate concepts using relationships such as is-a or part-of. Inverted indices are used to answer keyword queries. Precision and recall are two measures of the effectiveness of an information retrieval system. Techniques have been developed to extract structured information from textual data and to give direct answers to simple questions posed in natural language.