Download presentation
Presentation is loading. Please wait.
Published byBrenda Nelson Modified over 9 years ago
1
Web- and Multimedia-based Information Systems Lecture 2
2
Vector Model Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results
3
Vector Model Document Vector with weights for every index term Query Vector with weights for every index term Vectors of the dimension of the total number of index terms in the collection
4
Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6
5
Vector Model Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position
6
Vector Model Cosine of the angle between the vectors taken as similarity measure Sorting/Ranking of results Threshold for results More precise answer with more relevant docs on the top
7
Similarity Function
8
Vector Model Index Terms Weighting Binary Weights Raw Term Weights Term frequency x Inverse document frequency
9
Binary Weights Only the presence (1) or absence (0) of a term is included in the vector
10
Raw Term Weights The frequency of occurrence for the term in each document is included in the vector
11
Term frequency x Inverse document frequency
12
IDF Example IDF provides high values for rare words and low values for common words
13
Probabilistic Model Based on Probability For every document, a probability is calculated for: – Document being relevant – Document being irrelevant to the query Documents more relevant than not ranked in decreasing order of relevance
14
Text Operations in Detail Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space requirements Rules for extraction from documents – Rules for divison of terms Punctuation Dashes – List of Stop Words Articles, prepositions, conjunctions
15
Word-oriented Reduction Schemes Lemmatisations Smaller term lists Generalization of terms Methods – Reduction to the infinitive – Reduction to a stem Algorithmic Methods for English German: – Biggest Problems: Prefixes & Compositions – Only with dictionaries Explicit listing of all forms Or rules to derive forms
16
Stemming Different Methods Most efficiently: Affix removal – Porter Algorithm – Implement later – Series of rules to strip suffixes s -> nil sses -> ss
17
Word Type Index Term Selection Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science) – Noun groups – Maximum distance between terms
18
Thesauri „Treasury of words“ For every entry – Definition – Synonyms Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained Difficult with a large and dynamic document collection as the web
19
Creation of Inverted List Create Vocabulary Note document, position in Document for each term Sort List (first by terms, then by positions) Split Terms & Positions
20
Basic Query Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present
21
Advanced Query Functionality Comparison Operators for Metadata String of multiple terms More general: take into account distance and order of terms Truncation (Wildcards)
22
Information Retrieval System Evaluation Functionality Analysis Performance – Time – Space Retrieval Performance – Batch vs. Interactive mode
23
Retrieval Performance Measures Recall – The fraction of relevant documents which has been retrieved Precision – The fraction of the retrieved documents which is relevant
24
Precision vs. Recall User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system 1. d12. d53. d2 For the second result, recall is at 50%, precision is also 50% For the third result, recall is 100%, precision is 67%
25
Programming Assignment
26
Different part each week Web Search Engine
27
WWW Search Engine Search Engine Indexer Robot DB WWW-Server Index WWW-ServerWWW-Client Query Result List QueryResults FilesRequest Documents
28
Assignment Part 1 Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable in a tree-like datastructure Stores result code & important header fields for every request to disk in a format suitable for further processing
29
Assignment Part 1 (cont.) Implementation in Java Pure TCP socket communications No need to save documents in this assignment Robot shall identify itself via HTTP User- Agent header Extensibility required for future assignments
30
Example HTTP session telnet www 80 GET / HTTP/1.0 HTTP/1.0 200 Document follows Date: Tue, 10 Sep 1996 14:34:06 GMT Server: NCSA/1.4.2 Content-type: image/gif Last-modified: Tue, 10 Sep 1996 13:25:26 GMT Content-length: 9755 TCP connection HTTP Request Response Headers Start of content
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.