HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Multimedia Database Systems
Improved TF-IDF Ranker
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI NATURAL LANGUAGE REQUEST ANALYSIS COMPONENT.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval in Practice
20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Evaluating the Performance of IR Sytems
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI HYPERGEO CORPUS AQUISITION I. Tarnanas, C. Kotropoulos, F.
23/3/2001 Aristotle University of Thessaloniki Informatics Department Digital Days Information Retrieval Bassiou Nikoletta Artificial Intelligence and.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search Engines and Information Retrieval Chapter 1.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Chapter 23: Probabilistic Language Models April 13, 2004.
Web- and Multimedia-based Information Systems Lecture 2.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Natural Language Processing Topics in Information Retrieval August, 2002.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Automated Information Retrieval
Information Retrieval in Practice
Text Based Information Retrieval
Information Retrieval and Web Search
Multimedia Information Retrieval
Compact Query Term Selection Using Topically Related Text
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Simple, Proven Approaches To Text Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000, Page 1

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Introduction –Information Retrieval: development of algorithms and models for retrieving information from document repositories (speech, image,video) –Ad-hoc retrieval problem: query submitted by the user describing the desired information –Return list of documents: exact match or ranking according to their estimated relevance to the query –Relevance Feedback – Text Categorization 20/07/2000, Page 2

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Common design features of IR Systems –Techniques introduced by Robertson and S.Jones use of simple terms for indexing both request and document texts term weighting exploiting statistical information about term occurrences scoring for request document matching, using these weights or term sets in iterative searching 20/07/2000, Page 3

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Common design features of IR Systems (cont.) –Techniques introduced by Robertson and S.Jones (cont.) Normal implementation: via an inverted file organization using term list with linked document identifiers plus counting data, and pointers to the actual text Basic Features: –Terms and matching: »stemmed content words  terms used for indexing »Stop words are excluded 20/07/2000, Page 4

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Basic Features (cont.): –Weights= selectivity –Weighting Measures: a. Collection Frequency: N : number of document term t (i) occurs in n : the number of documents in the collection b. Term Frequency: terms occurring more often in a document is more likely to be important for that document 20/07/2000, Page 5

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI 20/07/2000, Page 6 Basic Features (cont.): –Weighting Measures (cont.): c. Document Length: serves for the evaluation of Term Frequency (the same Term Frequency of a term in a short document and in a long one shows that this term is more valuable for the short one) d. Combined Weight: combination of the weight measures described above k1(=2) : affects the extent of the influence of Term Frequency b(=0.75) : affects the extent of Document Length’s influence.

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Implementation of IR Component in HyperGeo Corpus –Based on all the statistic measures described above –Basic Characteristics : First Part: Training  calculation of all the necessary statistics for each document in the corpus and for each term appearing in these documents 1.Term dependent measures (CFW(i)) 2.Document dependent measures (DL(j)) 3.Term - Document dependent measures (TF(i,j), CW(i,j)) 4.Storage of statistics in files 20/07/2000, Page 7

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI 20/07/2000, Page 8 –Basic Characteristics (cont): Second Part: Document Retrieval 1.Query terms are given by the user 2.Stemming of the query terms (Simple and Porter Stemmer) 3.Look up of each query term in the structure that holds term-document-combined weight 4.Document’s score calculation: sum of the combined weights of all the query terms in the specific document 5.Document Ranking: determined by the user a. according to their estimated score b. according to i) the number of query terms that appear in it and ii) their estimated score

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Output –Output files: TermFrequency file, Combined Weight file, Idf file (number and names of documents each term occurs in), QueryResult file (contains the ranked document returned by the query) 20/07/2000, Page 9

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Results –Frequencies of the first 20 terms of the corpus museum 2050collect 766home 582book 534 hotel 1348citi 758page 573build 483 room 781town 653hous 556new 481 open 779art 650servic 548place 479 centuri 775reserv 591year 548work477 –Number of documents first 20 terms occur in museum 298includ238open230collect217 centuri 263room235hous229hotel215 year 258offer234place229new214 citi 251inform232build226visit206 time 247locat230servic225dai203 20/07/2000, Page 10

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Recall – Precision Graph for the query “museum” 20/07/2000, Page 11

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Future Developping –Iterative Searching Relevance Weighting: modification of the the request terms weights Query Expansion: modification of the request composition by adding more terms (reweighting of original terms) Probabilistic Approaches 20/07/2000, Page 12