Autumn Web Information retrieval (Web IR) Handout #11:FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Autumn Web Crawling Search engines do not index the entire Web Therefore, we have to focus on the most valuable and appealing ones To do this, a better crawling criterion is required –FICA
Autumn Breadth-First Crawling p r q t s v u x w z y BFS Advantages Why it is a acceptable algorithm?
Autumn Logarithmic Distance Crawling p r q t s v u x w z y d pt =log4 log4 d pz =log4+log2=0.9 d pv =log4+log3=1.07 When i points to j then:
Autumn FICA Intelligent surfer model It is based on reinforcement learning
Autumn FICA (On-line) Downloader Web Web Priority Queue FICA scheduler URLs Web pages Repository Text and Metadata Distance is used as the priority value URL1 URL2 … Seeds
Autumn Comparison with Others Downloader Web Web Repository Partial Ranking Algorithm URLs and Links URL1 URL2 … Seeds
Autumn Experimental Results Experiment was done on UK web graph including 18 million web pages We chose PageRank as an ideal ranking mechanism
Autumn FICA Properties Its time complexity is O(ElogV) –Complexity of Partial PageRank is FICA outperforms others in discovering highly important pages It requires small memory for computation It is online & adaptive
Autumn FICA as a Ranking Algorithm AlgorithmKendall's Tau Partial PageRank 0.18 Back Link0.09 Breadth-first0.11 OPIC0.62 FICA0.61 We used Kendall's metric for correlation between two rank lists Ideal is PageRank
Autumn Dynamic Version of FICA