Download presentation
Presentation is loading. Please wait.
Published byAugustine Lawrence Modified over 9 years ago
1
Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City, IA 52246 ** School of Informatics Indiana University, Bloomington, IN 47408
2
Overview Topical Crawling The Business Intelligence Problem Test Bed Crawling Algorithms Results Finding Better Seeds
3
Crawling as Graph Search History Frontier Seeds Node expansion – Downloading and parsing a page Open list - Frontier Closed list – History Expansion order – Crawl path
4
Exhaustive vs. Preferential Crawling Exhaustive - blind expansion order (e.g. Breadth First ) Preferential - heuristic-based expansion order (e.g. Best First) Topical Crawling: the guiding heuristic is based on a topic or a set of topics
5
Business Intelligence Problem Web based information about related business entities Related through the area of competence, research thrust etc. Topical crawlers can help in creating a small but focused collection of Web pages that is rich in information about related business entities
6
Business Intelligence Problem A list of business entities is available We create a focused document collection that can be further explored with ranking, indexing and text-mining tools We investigate the crawling techniques for the task
7
Finding paths in a competitive community.com. edu,.org,.gov.com
8
Test Bed DMOZ Categories – “Companies”, “Consultants”, “Manufacturers” DMOZ 159 topics seeds, targets, keywords and description Each crawler crawl up-to 10,000 pages for each topic
9
Sample Topic
10
Performance Metrics Precision@N Target Recall@N Relevant Targets Crawled |Crawled ∩ Relevant| / |Relevant| |Crawled ∩ Targets| / |Targets|
11
Crawling Infrastructure
12
Crawling Algorithms Breadth First Naïve Best First
13
Crawling Algorithms – DOM Crawler
14
Hub-Seeking Crawler n – number of seed hosts
15
Performance
16
Improving the Seed Set Top 10 hubs based on back- links from Google Avoiding mirrors of DMOZ Augmented seed set
17
Performance
18
Related work Chakrabarti et. al. [1998] Use of Hubs Menczer et. al. [2001] Framework for evaluating topical crawlers Chakrabarti et. al. [2002] Use of DOM
19
Conclusion Investigated the problem of creating a small collection through topical crawling for locating related business entities Hub Seeking crawler that seeks hubs at crawl time and exploits the tag tree structure of Web pages outperforms Naïve Best-First Positive effects of identifying hubs before and during the crawl process Future Work – Find optimal aggregation node Compare the benefits of identifying hubs in competitive vs. collaborative communities
20
Thank You gautam-pant@uiowa.edu Acknowledgements: Robin McEntire (GlaxoSmithKline R&D) Valdis A. Dzelzkalns (GlaxoSmithKline R&D) Paul Stead (GlaxoSmithKline R&D) NSF grant to FM
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.