HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000, Page 1
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Introduction –Information Retrieval: development of algorithms and models for retrieving information from document repositories (speech, image,video) –Ad-hoc retrieval problem: query submitted by the user describing the desired information –Return list of documents: exact match or ranking according to their estimated relevance to the query –Relevance Feedback – Text Categorization 20/07/2000, Page 2
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Common design features of IR Systems –Techniques introduced by Robertson and S.Jones use of simple terms for indexing both request and document texts term weighting exploiting statistical information about term occurrences scoring for request document matching, using these weights or term sets in iterative searching 20/07/2000, Page 3
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Common design features of IR Systems (cont.) –Techniques introduced by Robertson and S.Jones (cont.) Normal implementation: via an inverted file organization using term list with linked document identifiers plus counting data, and pointers to the actual text Basic Features: –Terms and matching: »stemmed content words terms used for indexing »Stop words are excluded 20/07/2000, Page 4
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Basic Features (cont.): –Weights= selectivity –Weighting Measures: a. Collection Frequency: N : number of document term t (i) occurs in n : the number of documents in the collection b. Term Frequency: terms occurring more often in a document is more likely to be important for that document 20/07/2000, Page 5
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI 20/07/2000, Page 6 Basic Features (cont.): –Weighting Measures (cont.): c. Document Length: serves for the evaluation of Term Frequency (the same Term Frequency of a term in a short document and in a long one shows that this term is more valuable for the short one) d. Combined Weight: combination of the weight measures described above k1(=2) : affects the extent of the influence of Term Frequency b(=0.75) : affects the extent of Document Length’s influence.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Implementation of IR Component in HyperGeo Corpus –Based on all the statistic measures described above –Basic Characteristics : First Part: Training calculation of all the necessary statistics for each document in the corpus and for each term appearing in these documents 1.Term dependent measures (CFW(i)) 2.Document dependent measures (DL(j)) 3.Term - Document dependent measures (TF(i,j), CW(i,j)) 4.Storage of statistics in files 20/07/2000, Page 7
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI 20/07/2000, Page 8 –Basic Characteristics (cont): Second Part: Document Retrieval 1.Query terms are given by the user 2.Stemming of the query terms (Simple and Porter Stemmer) 3.Look up of each query term in the structure that holds term-document-combined weight 4.Document’s score calculation: sum of the combined weights of all the query terms in the specific document 5.Document Ranking: determined by the user a. according to their estimated score b. according to i) the number of query terms that appear in it and ii) their estimated score
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Output –Output files: TermFrequency file, Combined Weight file, Idf file (number and names of documents each term occurs in), QueryResult file (contains the ranked document returned by the query) 20/07/2000, Page 9
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Results –Frequencies of the first 20 terms of the corpus museum 2050collect 766home 582book 534 hotel 1348citi 758page 573build 483 room 781town 653hous 556new 481 open 779art 650servic 548place 479 centuri 775reserv 591year 548work477 –Number of documents first 20 terms occur in museum 298includ238open230collect217 centuri 263room235hous229hotel215 year 258offer234place229new214 citi 251inform232build226visit206 time 247locat230servic225dai203 20/07/2000, Page 10
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Recall – Precision Graph for the query “museum” 20/07/2000, Page 11
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Future Developping –Iterative Searching Relevance Weighting: modification of the the request terms weights Query Expansion: modification of the request composition by adding more terms (reweighting of original terms) Probabilistic Approaches 20/07/2000, Page 12