Download presentation
Presentation is loading. Please wait.
1
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries
2
IR Information retrieval started from bibliography retrieval to become full-text term retrieval in a dataset, to be finally expanded to web information retrieval The information retrieval system anlyses the contents of the sources of information and the sources of the user’s queries and matches the two to retrieve the relevant items COMPONENTS The document subsystem The indexing subsystem The searching subsystem The matching subsystem The searching subsystem is one of the fundamental parts of a information retrieval system
3
IR searching Models Searching models can be seen as searching strategies Boolean search model Probabilistic retrieval model Vector processing model
4
The boolean search model IR system use boolean logic to allow the users to express their choice using these operators George Boole initiated a system of symbolic logic formed by three operators: The logical sum + (OR) Allows to specify alternatives between (or among) search terms The logical product X (AND) Allows to specify the search for the coincidence of two concepts The logical difference – Allows to exclude terms from the search
5
The boolean operators The logical sum + (OR) House OR castle The logical product X (AND) house AND castle The logical difference – (NOT) House NOT castle Boolean operators can be visualized with the so called Venn diagrams ANDOR NOT House CastleHouse
6
The boolean model: Pro and contra It is an easy search model Despite its simplicity users are not able to effectively use the three boolean operators, especially for more complicated queries. The search is sometimes not too precise, i.e. the search can give too many items after the search is the search is too broad, or too few responses if the search is too strict (probability to miss important items). Boolean search does not permit ranking, i.e. the importance of items in an document are not ordered.
7
Boolean search: example Catalogue RUG library There are index terms: Boole as author is indexed is different than boolean or Boole in the titel index-term The three boolean operators are used There is integration of wildcards (see later) http://opc.ub.rug.nl/IMPLAND=Y/SRT=YOP/LNG= NE/DB=1/ http://opc.ub.rug.nl/IMPLAND=Y/SRT=YOP/LNG= NE/DB=1/
8
Probabilistic retrieval models Tags the last problem outlined for the boolean model: Probabilistic models try to rank the found documents in order of decreasing probability of usefulness or relevance given by the user
9
Vector space models Documents are characterized/evaluated according to their index-terms Each document is identified with a vector The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. The value regarding an index is the number of times a specific term appears (sometimes the value is 0) A metrics for the similarity between two documents is the co-sinus of the angle between their vectors Searches are interpreted as well in terms of vectors
10
Vector space models
11
Evaluation for a search Precision: How many of the found documents are relevant to the search? Recall How many of the relevant documents are found to the search? Fall-out How many of the irrelevant documents are found to the search?
12
Wildcards 1 Wilcards are characters that can be a substitute for any subset of all possible characters In other words they are unknown subparts in a term Usually wildcards are signaled with an asterisk * Usually the asterisk is a wildcard character that substitutes zero or more unknown characters. Example: aphas* → aphasiology, aphasia, aphasic, aphasics, aphasiological etc… Wildcards are an advantage for the user of the system but it is not convenient for the system self The user does not have to repeatedly ask for different searches But the system needs to interpret the term and test (search) all the possible terms stemming from it
13
Wildcards 2 Wilcard characters usually substitute a group of letters that can not stand alone as words, but can form a word is united to a specific root Sun* → wc:0= sun. wc: -s = suns. Ws: -set = sunset … The search via wildcards in the beginning of a word or within a word is not so easy (the resulting possibilities are larger)
14
Wildcards 3: Permuterm index Wilcard
15
Web information retrieval IR was created for bibliography retrieval. Nevertheless there is much information that has to be accessed in the web. IR addresses even this search Traditional and web IR differ on a number of characteristics
16
Web information retrieval 1. The web is far more distributed and larger than the traditional set of information sources 2. The web is increasingly growing 3. The web has different levels of depth for a search 4. The web has different type and format of documents 5. The quality of documents in the web varies 6. The information in the web changes rapidly 7. Distributed users
17
Web information retrieval 1. The web is far larger than the traditional set of information sources 1. Not only the amount of information and documents is larger but the retrieval system (in traditional IR systems) has to deal with different a different set of standards (sofware etc). Actually the web does not have a “set of standards” 2. As a consequence the search is more difficult
18
Web information retrieval 1. The web is increasingly growing 1. The amount of information in the web is growing (and it will probably grow). 2. The conventional text retrieval systems should be tested and readapted to work with larger datasets
19
Web information retrieval 1. The web has different levels of depth for a search 1. The web can have two types of access: one free and the other one the “deep” one accessible only with passwords or special programs. WIR can get access only to the surface information. 2. The web has different type and format of documents 1. Traditional IR works with texts. In the web there are several types of documents (Images, soundfiles etc..). Both indexing and information retrieval are therefore more complex
20
Web information retrieval 1. The quality of documents in the web varies 1. IR systems are not designed to check the quality of the information resources, therefore there is no control over the quality 2. The information in the web changes rapidly 1. This differs from traditional text retrieval systems which are quite static according to the rapid changes of the web. Keeping track of the rapid changes is a challenge The sources often move. There is a difficulty to track them back
21
Web information retrieval 1. Distributed users 1. The builder of conventional IR systems knows approximately the target of users for a IR system. A builder for web information retrieval system does not have any “typical” user
22
Search engines Search engines can are a sort of IR systems They allow to run the search using search terms and using keywords or key sentences Most search engines allow the use of boolean operators (AND, OR, NOT) Special programs called “spiders” regularly collect information on web pages The search engine finds documents that match the search The web engine does not search the web for every search but searches a given database formed by the spider programs. This database is regularly updated. There are many types of web engines according to different specialties as well (Google, Altavista news, Google Images, etc)
23
Digital libraries A digital library: “must accomplish all essential services of traditional libraries and also exploit the well-known advantage of a digital storage” Digital libraries provide access to different information sources, in various forms (text, images, audiofiles etc) Digital libraries create the access for a variety of information via different sources The web E-journals Online databases Remote digital libraries Every digital library has a library-user interface
24
The digital library USER E-journals www Online databeses Remote digital libraries Digital library interface
25
Digital libraries Some digital libraries Alexandria Digital Library Project Alexandria Digital Library Project http://www.cdlib.org/ http://www.cdlib.org/ http://www.gutemberg.org http://www.gutemberg.org http://www.theeuropeanlibrary.org http://www.theeuropeanlibrary.org
26
Digital libraries Digital libraries use features of IR systems Users can browse or search the collections Some digital libraries permit to search in a network of digital libraries Boolean search is most used in digital libraries The search is via keywords or sentences with the use of wildcards
27
Introduction to modern information retrieval (Chowdhury, G.G.)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.