Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140.

Similar presentations


Presentation on theme: "Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140."— Presentation transcript:

1 Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140

2 Genetic Learning The Core Algorithm Crossover, Mutation, Reproduction Fitness proportionate selection Genetic Algorithms Chromosome is an array Genetic Programming Chromosome is an abstract syntax tree {A B C D E F} X {1 2 3 4 5 6} X

3 Information Retrieval (Text) Online Systems –Dialog, LexisNexis, etc. Web Systems –Alta Vista, Excite, Google, etc. Scientific Literature Systems –CiteSeer, PubMed, BioMedNet, etc. Question: –How should scientific literature be ranked? Less time searching / More time researching Higher exposure for “good” work

4 How Google Works PageRank –Document ranking from PageRank –A document’s PageRank is some factor (d) of the rank of incoming citations –A document’s influence is some factor of its rank and its outgoing citations Characteristics of Scientific Literature –Citations unidirectional (backwards in time) –12 month publication cycle –Scientific citation “cliques”

5 How IR works Indexing –Build the dictionary –Construct the Postings ( pairs) Searching –Look up terms in dictionary –Boolean resolution –Rank on density (probability, vector space, etc.) Performance –Recall and precision Record1: Of Otago Record2: Otago University Record3: Otago Record4: Of OFOF OTAGO UNIVERSITY dictionary postings

6 Structured-IR Sci-Lit documents have structure Title, abstract, conclusions, etc. becomes 1 University of Otago New Zealand 3 University of Otago top 2 New Zealand sailing doc:1 rank:7 sport:6cntry:5place:3docid:2 name:4

7 Using Structure in Ranking Documents have structure –Title, Abstract, Conclusions, etc. Weight each structure on “importance” –Title higher than Abstract higher than … How to choose the weights –Specified in the query (XIRQL) –Query feedback –Learn with a Genetic Algorithm Adapt ranking model to use structure Each tree node is a locus Weights are genes

8 Experiment 50 training queries 50 evaluation queries 25 generations Probabilistic IR Vector Space IR PROBABILISTIC IR 75.5% queries improved 6.7% increase in MAP (8.8% max) VECTOR SPACE IR 61% queries improved 4.7% increase in MAP (5.4% max) Results

9 Ranking Algorithms Multitude exist –Probability, vector space, Boolean –Several published nomenclatures Over 100,000 “published” algorithms Purpose –Put relevant documents first –Sorting –Performance measures with precision Sources –Some guy thought it up

10 Experiment 50 training queries 50 evaluation queries 31 runs Weekend time limit Compare to Probabilistic 67% queries improved 15% increase in MAP Results

11 Function Comparison w dq =S tÎq (((((((((U / sqrt(sqrt(n t ))) / (m q / sqrt((((L q / (sqrt(sqrt(L d )) / sqrt((U / n c )))) * min(m q, N)) / sqrt(((((((T max / sqrt(U)) / sqrt((((log 2 (sqrt(n t )) / sqrt(n t )) / sqrt(U max )) / (M / n c )))) / sqrt((U / n c ))) - u q ) / m q ) / sqrt(n t ))))))) / sqrt((log(T max ) / n c ))) / sqrt(n t )) / sqrt(n t )) / sqrt((L q / sqrt(((sqrt((sqrt(sqrt(L d )) / sqrt((min(m q, sqrt((((log(T max ) / n c ) / sqrt(U max )) / (m q / sqrt(((N * min((sqrt(n c ) / sqrt(U)), L d )) / sqrt(N))))))) / sqrt(L d ))))) / sqrt((T max / n c ))) / sqrt(n t )))))) / sqrt((min(m q, N) / n c ))) / sqrt((log(T max ) / n c ))) / sqrt(n t )) Vector Space Probability Learned

12 Conclusions Using document structure improved ranking Structure weights can be learned with a GA GP can be used to learn ranking functions Speculation Combining GA and GP to learn a structure ranking algorithm will better GA and GP alone

13 Questions?

14 Random Numbers Random Numbers Are your results an artifact of your random number generator?


Download ppt "Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140."

Similar presentations


Ads by Google