Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.

Slides:



Advertisements
Similar presentations
Database Searching: How to Find Journal Articles? START.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Modern Information Retrieval Chapter 1: Introduction
Search Techniques Boolean Logic and Keyword Searching.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modeling Modern Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 1 Introduction.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Search Engines By: Faruq Hasan.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback in Image Retrieval System: A Survey Tao Huang Lin Luo Chengcui Zhang.
INFORMATION STROAGE AND RETRIEVAL SYSTEM By Ms. Preeti Patel Lecturer School of Library And Information Science DAVV, Indore
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
Definition, purposes/functions, elements of IR systems Lesson 1.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
 The web is referred to as a “massive collection of web pages stored on millions of computers across the world that are linked by the Internet” (Chowdhury,
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Text Based Information Retrieval
Web Information retrieval
Information Retrieval on the World Wide Web
Multimedia Information Retrieval
Introduction into Knowledge and information
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries

IR Information retrieval started from bibliography retrieval to become full-text term retrieval in a dataset, to be finally expanded to web information retrieval The information retrieval system anlyses the contents of the sources of information and the sources of the user’s queries and matches the two to retrieve the relevant items COMPONENTS  The document subsystem  The indexing subsystem  The searching subsystem  The matching subsystem The searching subsystem is one of the fundamental parts of a information retrieval system

IR searching Models Searching models can be seen as searching strategies Boolean search model Probabilistic retrieval model Vector processing model

The boolean search model IR system use boolean logic to allow the users to express their choice using these operators George Boole initiated a system of symbolic logic formed by three operators:  The logical sum + (OR) Allows to specify alternatives between (or among) search terms  The logical product X (AND) Allows to specify the search for the coincidence of two concepts  The logical difference – Allows to exclude terms from the search

The boolean operators The logical sum + (OR) House OR castle  The logical product X (AND) house AND castle  The logical difference – (NOT) House NOT castle Boolean operators can be visualized with the so called Venn diagrams ANDOR NOT House CastleHouse

The boolean model: Pro and contra It is an easy search model Despite its simplicity users are not able to effectively use the three boolean operators, especially for more complicated queries. The search is sometimes not too precise, i.e. the search can give too many items after the search is the search is too broad, or too few responses if the search is too strict (probability to miss important items). Boolean search does not permit ranking, i.e. the importance of items in an document are not ordered.

Boolean search: example Catalogue RUG library  There are index terms: Boole as author is indexed is different than boolean or Boole in the titel index-term The three boolean operators are used There is integration of wildcards (see later) NE/DB=1/ NE/DB=1/

Probabilistic retrieval models Tags the last problem outlined for the boolean model: Probabilistic models try to rank the found documents in order of decreasing probability of usefulness or relevance given by the user

Vector space models Documents are characterized/evaluated according to their index-terms Each document is identified with a vector The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. The value regarding an index is the number of times a specific term appears (sometimes the value is 0) A metrics for the similarity between two documents is the co-sinus of the angle between their vectors Searches are interpreted as well in terms of vectors

Vector space models

Evaluation for a search Precision:  How many of the found documents are relevant to the search? Recall  How many of the relevant documents are found to the search? Fall-out  How many of the irrelevant documents are found to the search?

Wildcards 1 Wilcards are characters that can be a substitute for any subset of all possible characters In other words they are unknown subparts in a term  Usually wildcards are signaled with an asterisk *  Usually the asterisk is a wildcard character that substitutes zero or more unknown characters.  Example: aphas* → aphasiology, aphasia, aphasic, aphasics, aphasiological etc… Wildcards are an advantage for the user of the system but it is not convenient for the system self  The user does not have to repeatedly ask for different searches  But the system needs to interpret the term and test (search) all the possible terms stemming from it

Wildcards 2 Wilcard characters usually substitute a group of letters that can not stand alone as words, but can form a word is united to a specific root  Sun* → wc:0= sun. wc: -s = suns. Ws: -set = sunset … The search via wildcards in the beginning of a word or within a word is not so easy (the resulting possibilities are larger)

Wildcards 3: Permuterm index Wilcard

Web information retrieval IR was created for bibliography retrieval. Nevertheless there is much information that has to be accessed in the web. IR addresses even this search Traditional and web IR differ on a number of characteristics

Web information retrieval 1. The web is far more distributed and larger than the traditional set of information sources 2. The web is increasingly growing 3. The web has different levels of depth for a search 4. The web has different type and format of documents 5. The quality of documents in the web varies 6. The information in the web changes rapidly 7. Distributed users

Web information retrieval 1. The web is far larger than the traditional set of information sources 1. Not only the amount of information and documents is larger but the retrieval system (in traditional IR systems) has to deal with different a different set of standards (sofware etc). Actually the web does not have a “set of standards” 2. As a consequence the search is more difficult

Web information retrieval 1. The web is increasingly growing 1. The amount of information in the web is growing (and it will probably grow). 2. The conventional text retrieval systems should be tested and readapted to work with larger datasets

Web information retrieval 1. The web has different levels of depth for a search 1. The web can have two types of access: one free and the other one the “deep” one accessible only with passwords or special programs. WIR can get access only to the surface information. 2. The web has different type and format of documents 1. Traditional IR works with texts. In the web there are several types of documents (Images, soundfiles etc..). Both indexing and information retrieval are therefore more complex

Web information retrieval 1. The quality of documents in the web varies 1. IR systems are not designed to check the quality of the information resources, therefore there is no control over the quality 2. The information in the web changes rapidly 1. This differs from traditional text retrieval systems which are quite static according to the rapid changes of the web. Keeping track of the rapid changes is a challenge The sources often move. There is a difficulty to track them back

Web information retrieval 1. Distributed users 1. The builder of conventional IR systems knows approximately the target of users for a IR system. A builder for web information retrieval system does not have any “typical” user

Search engines Search engines can are a sort of IR systems  They allow to run the search using search terms and using keywords or key sentences  Most search engines allow the use of boolean operators (AND, OR, NOT) Special programs called “spiders” regularly collect information on web pages The search engine finds documents that match the search The web engine does not search the web for every search but searches a given database formed by the spider programs. This database is regularly updated. There are many types of web engines according to different specialties as well (Google, Altavista news, Google Images, etc)

Digital libraries A digital library:  “must accomplish all essential services of traditional libraries and also exploit the well-known advantage of a digital storage”  Digital libraries provide access to different information sources, in various forms (text, images, audiofiles etc)  Digital libraries create the access for a variety of information via different sources The web E-journals Online databases Remote digital libraries  Every digital library has a library-user interface

The digital library USER E-journals www Online databeses Remote digital libraries Digital library interface

Digital libraries Some digital libraries  Alexandria Digital Library Project Alexandria Digital Library Project   

Digital libraries Digital libraries use features of IR systems Users can browse or search the collections Some digital libraries permit to search in a network of digital libraries Boolean search is most used in digital libraries The search is via keywords or sentences with the use of wildcards

Introduction to modern information retrieval (Chowdhury, G.G.)