Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140.

Slides:



Advertisements
Similar presentations
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Introduction to Information Retrieval
Basic IR: Modeling Basic IR Task: Slightly more complex:
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Introduction to Genetic Algorithms Yonatan Shichel.
A machine learning approach to improve precision for navigational queries in a Web information retrieval system Reiner Kraft
1 Genetic Algorithms. CS The Traditional Approach Ask an expert Adapt existing designs Trial and error.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 Genetic Algorithms. CS 561, Session 26 2 The Traditional Approach Ask an expert Adapt existing designs Trial and error.
Information Retrieval
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Genetic Algorithm.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
More on Heuristics Genetic Algorithms (GA) Terminology Chromosome –candidate solution - {x 1, x 2,...., x n } Gene –variable - x j Allele –numerical.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Applying Genetic Algorithm to the Knapsack Problem Qi Su ECE 539 Spring 2001 Course Project.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Tuning Before Feedback: Combining Ranking Discovery and Blind Feedback for Robust Retrieval* Weiguo Fan, Ming Luo, Li Wang, Wensi Xi, and Edward A. Fox.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Clustering C.Watters CS6403.
Reference Collections: Collection Characteristics.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
A Genetic Algorithm-Based Approach to Content-Based Image Retrieval Bo-Yen Wang( 王博彥 )
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Agenda  INTRODUCTION  GENETIC ALGORITHMS  GENETIC ALGORITHMS FOR EXPLORING QUERY SPACE  SYSTEM ARCHITECTURE  THE EFFECT OF DIFFERENT MUTATION RATES.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Genetic Algorithm(GA)
Genetic (Evolutionary) Algorithms CEE 6410 David Rosenberg “Natural Selection or the Survival of the Fittest.” -- Charles Darwin.
Automated Information Retrieval
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Information Retrieval on the World Wide Web
Genetic Algorithms Artificial Life
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Citation-based Extraction of Core Contents from Biomedical Articles
Inverted Indexing for Text Retrieval
Web Information retrieval (Web IR)
Presentation transcript:

Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140

Genetic Learning The Core Algorithm Crossover, Mutation, Reproduction Fitness proportionate selection Genetic Algorithms Chromosome is an array Genetic Programming Chromosome is an abstract syntax tree {A B C D E F} X { } X

Information Retrieval (Text) Online Systems –Dialog, LexisNexis, etc. Web Systems –Alta Vista, Excite, Google, etc. Scientific Literature Systems –CiteSeer, PubMed, BioMedNet, etc. Question: –How should scientific literature be ranked? Less time searching / More time researching Higher exposure for “good” work

How Google Works PageRank –Document ranking from PageRank –A document’s PageRank is some factor (d) of the rank of incoming citations –A document’s influence is some factor of its rank and its outgoing citations Characteristics of Scientific Literature –Citations unidirectional (backwards in time) –12 month publication cycle –Scientific citation “cliques”

How IR works Indexing –Build the dictionary –Construct the Postings ( pairs) Searching –Look up terms in dictionary –Boolean resolution –Rank on density (probability, vector space, etc.) Performance –Recall and precision Record1: Of Otago Record2: Otago University Record3: Otago Record4: Of OFOF OTAGO UNIVERSITY dictionary postings

Structured-IR Sci-Lit documents have structure Title, abstract, conclusions, etc. becomes 1 University of Otago New Zealand 3 University of Otago top 2 New Zealand sailing doc:1 rank:7 sport:6cntry:5place:3docid:2 name:4

Using Structure in Ranking Documents have structure –Title, Abstract, Conclusions, etc. Weight each structure on “importance” –Title higher than Abstract higher than … How to choose the weights –Specified in the query (XIRQL) –Query feedback –Learn with a Genetic Algorithm Adapt ranking model to use structure Each tree node is a locus Weights are genes

Experiment 50 training queries 50 evaluation queries 25 generations Probabilistic IR Vector Space IR PROBABILISTIC IR 75.5% queries improved 6.7% increase in MAP (8.8% max) VECTOR SPACE IR 61% queries improved 4.7% increase in MAP (5.4% max) Results

Ranking Algorithms Multitude exist –Probability, vector space, Boolean –Several published nomenclatures Over 100,000 “published” algorithms Purpose –Put relevant documents first –Sorting –Performance measures with precision Sources –Some guy thought it up

Experiment 50 training queries 50 evaluation queries 31 runs Weekend time limit Compare to Probabilistic 67% queries improved 15% increase in MAP Results

Function Comparison w dq =S tÎq (((((((((U / sqrt(sqrt(n t ))) / (m q / sqrt((((L q / (sqrt(sqrt(L d )) / sqrt((U / n c )))) * min(m q, N)) / sqrt(((((((T max / sqrt(U)) / sqrt((((log 2 (sqrt(n t )) / sqrt(n t )) / sqrt(U max )) / (M / n c )))) / sqrt((U / n c ))) - u q ) / m q ) / sqrt(n t ))))))) / sqrt((log(T max ) / n c ))) / sqrt(n t )) / sqrt(n t )) / sqrt((L q / sqrt(((sqrt((sqrt(sqrt(L d )) / sqrt((min(m q, sqrt((((log(T max ) / n c ) / sqrt(U max )) / (m q / sqrt(((N * min((sqrt(n c ) / sqrt(U)), L d )) / sqrt(N))))))) / sqrt(L d ))))) / sqrt((T max / n c ))) / sqrt(n t )))))) / sqrt((min(m q, N) / n c ))) / sqrt((log(T max ) / n c ))) / sqrt(n t )) Vector Space Probability Learned

Conclusions Using document structure improved ranking Structure weights can be learned with a GA GP can be used to learn ranking functions Speculation Combining GA and GP to learn a structure ranking algorithm will better GA and GP alone

Questions?

Random Numbers Random Numbers Are your results an artifact of your random number generator?