Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140
Genetic Learning The Core Algorithm Crossover, Mutation, Reproduction Fitness proportionate selection Genetic Algorithms Chromosome is an array Genetic Programming Chromosome is an abstract syntax tree {A B C D E F} X { } X
Information Retrieval (Text) Online Systems –Dialog, LexisNexis, etc. Web Systems –Alta Vista, Excite, Google, etc. Scientific Literature Systems –CiteSeer, PubMed, BioMedNet, etc. Question: –How should scientific literature be ranked? Less time searching / More time researching Higher exposure for “good” work
How Google Works PageRank –Document ranking from PageRank –A document’s PageRank is some factor (d) of the rank of incoming citations –A document’s influence is some factor of its rank and its outgoing citations Characteristics of Scientific Literature –Citations unidirectional (backwards in time) –12 month publication cycle –Scientific citation “cliques”
How IR works Indexing –Build the dictionary –Construct the Postings ( pairs) Searching –Look up terms in dictionary –Boolean resolution –Rank on density (probability, vector space, etc.) Performance –Recall and precision Record1: Of Otago Record2: Otago University Record3: Otago Record4: Of OFOF OTAGO UNIVERSITY dictionary postings
Structured-IR Sci-Lit documents have structure Title, abstract, conclusions, etc. becomes 1 University of Otago New Zealand 3 University of Otago top 2 New Zealand sailing doc:1 rank:7 sport:6cntry:5place:3docid:2 name:4
Using Structure in Ranking Documents have structure –Title, Abstract, Conclusions, etc. Weight each structure on “importance” –Title higher than Abstract higher than … How to choose the weights –Specified in the query (XIRQL) –Query feedback –Learn with a Genetic Algorithm Adapt ranking model to use structure Each tree node is a locus Weights are genes
Experiment 50 training queries 50 evaluation queries 25 generations Probabilistic IR Vector Space IR PROBABILISTIC IR 75.5% queries improved 6.7% increase in MAP (8.8% max) VECTOR SPACE IR 61% queries improved 4.7% increase in MAP (5.4% max) Results
Ranking Algorithms Multitude exist –Probability, vector space, Boolean –Several published nomenclatures Over 100,000 “published” algorithms Purpose –Put relevant documents first –Sorting –Performance measures with precision Sources –Some guy thought it up
Experiment 50 training queries 50 evaluation queries 31 runs Weekend time limit Compare to Probabilistic 67% queries improved 15% increase in MAP Results
Function Comparison w dq =S tÎq (((((((((U / sqrt(sqrt(n t ))) / (m q / sqrt((((L q / (sqrt(sqrt(L d )) / sqrt((U / n c )))) * min(m q, N)) / sqrt(((((((T max / sqrt(U)) / sqrt((((log 2 (sqrt(n t )) / sqrt(n t )) / sqrt(U max )) / (M / n c )))) / sqrt((U / n c ))) - u q ) / m q ) / sqrt(n t ))))))) / sqrt((log(T max ) / n c ))) / sqrt(n t )) / sqrt(n t )) / sqrt((L q / sqrt(((sqrt((sqrt(sqrt(L d )) / sqrt((min(m q, sqrt((((log(T max ) / n c ) / sqrt(U max )) / (m q / sqrt(((N * min((sqrt(n c ) / sqrt(U)), L d )) / sqrt(N))))))) / sqrt(L d ))))) / sqrt((T max / n c ))) / sqrt(n t )))))) / sqrt((min(m q, N) / n c ))) / sqrt((log(T max ) / n c ))) / sqrt(n t )) Vector Space Probability Learned
Conclusions Using document structure improved ranking Structure weights can be learned with a GA GP can be used to learn ranking functions Speculation Combining GA and GP to learn a structure ranking algorithm will better GA and GP alone
Questions?
Random Numbers Random Numbers Are your results an artifact of your random number generator?