Web- and Multimedia-based Information Systems Lecture 2
Vector Model Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results
Vector Model Document Vector with weights for every index term Query Vector with weights for every index term Vectors of the dimension of the total number of index terms in the collection
Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6
Vector Model Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position
Vector Model Cosine of the angle between the vectors taken as similarity measure Sorting/Ranking of results Threshold for results More precise answer with more relevant docs on the top
Similarity Function
Vector Model Index Terms Weighting Binary Weights Raw Term Weights Term frequency x Inverse document frequency
Binary Weights Only the presence (1) or absence (0) of a term is included in the vector
Raw Term Weights The frequency of occurrence for the term in each document is included in the vector
Term frequency x Inverse document frequency
IDF Example IDF provides high values for rare words and low values for common words
Probabilistic Model Based on Probability For every document, a probability is calculated for: – Document being relevant – Document being irrelevant to the query Documents more relevant than not ranked in decreasing order of relevance
Text Operations in Detail Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space requirements Rules for extraction from documents – Rules for divison of terms Punctuation Dashes – List of Stop Words Articles, prepositions, conjunctions
Word-oriented Reduction Schemes Lemmatisations Smaller term lists Generalization of terms Methods – Reduction to the infinitive – Reduction to a stem Algorithmic Methods for English German: – Biggest Problems: Prefixes & Compositions – Only with dictionaries Explicit listing of all forms Or rules to derive forms
Stemming Different Methods Most efficiently: Affix removal – Porter Algorithm – Implement later – Series of rules to strip suffixes s -> nil sses -> ss
Word Type Index Term Selection Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science) – Noun groups – Maximum distance between terms
Thesauri „Treasury of words“ For every entry – Definition – Synonyms Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained Difficult with a large and dynamic document collection as the web
Creation of Inverted List Create Vocabulary Note document, position in Document for each term Sort List (first by terms, then by positions) Split Terms & Positions
Basic Query Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present
Advanced Query Functionality Comparison Operators for Metadata String of multiple terms More general: take into account distance and order of terms Truncation (Wildcards)
Information Retrieval System Evaluation Functionality Analysis Performance – Time – Space Retrieval Performance – Batch vs. Interactive mode
Retrieval Performance Measures Recall – The fraction of relevant documents which has been retrieved Precision – The fraction of the retrieved documents which is relevant
Precision vs. Recall User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system 1. d12. d53. d2 For the second result, recall is at 50%, precision is also 50% For the third result, recall is 100%, precision is 67%
Programming Assignment
Different part each week Web Search Engine
WWW Search Engine Search Engine Indexer Robot DB WWW-Server Index WWW-ServerWWW-Client Query Result List QueryResults FilesRequest Documents
Assignment Part 1 Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable in a tree-like datastructure Stores result code & important header fields for every request to disk in a format suitable for further processing
Assignment Part 1 (cont.) Implementation in Java Pure TCP socket communications No need to save documents in this assignment Robot shall identify itself via HTTP User- Agent header Extensibility required for future assignments
Example HTTP session telnet www 80 GET / HTTP/1.0 HTTP/ Document follows Date: Tue, 10 Sep :34:06 GMT Server: NCSA/1.4.2 Content-type: image/gif Last-modified: Tue, 10 Sep :25:26 GMT Content-length: 9755 TCP connection HTTP Request Response Headers Start of content