Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City, IA 52246 ** School of Informatics Indiana University, Bloomington, IN 47408

Overview  Topical Crawling  The Business Intelligence Problem  Test Bed  Crawling Algorithms  Results  Finding Better Seeds

Crawling as Graph Search History Frontier Seeds  Node expansion – Downloading and parsing a page  Open list - Frontier  Closed list – History  Expansion order – Crawl path

Exhaustive vs. Preferential Crawling  Exhaustive - blind expansion order (e.g. Breadth First )  Preferential - heuristic-based expansion order (e.g. Best First) Topical Crawling: the guiding heuristic is based on a topic or a set of topics

Business Intelligence Problem  Web based information about related business entities  Related through the area of competence, research thrust etc.  Topical crawlers can help in creating a small but focused collection of Web pages that is rich in information about related business entities

Business Intelligence Problem  A list of business entities is available  We create a focused document collection that can be further explored with ranking, indexing and text-mining tools  We investigate the crawling techniques for the task

Finding paths in a competitive community.com. edu,.org,.gov.com

Test Bed  DMOZ Categories – “Companies”, “Consultants”, “Manufacturers” DMOZ  159 topics  seeds, targets, keywords and description  Each crawler crawl up-to 10,000 pages for each topic

Sample Topic

Crawling Infrastructure

Crawling Algorithms  Breadth First  Naïve Best First

Crawling Algorithms – DOM Crawler

Hub-Seeking Crawler n – number of seed hosts

Performance

Improving the Seed Set  Top 10 hubs based on back- links from Google  Avoiding mirrors of DMOZ  Augmented seed set

Performance

Related work  Chakrabarti et. al. [1998] Use of Hubs  Menczer et. al. [2001] Framework for evaluating topical crawlers  Chakrabarti et. al. [2002] Use of DOM

Conclusion  Investigated the problem of creating a small collection through topical crawling for locating related business entities  Hub Seeking crawler that seeks hubs at crawl time and exploits the tag tree structure of Web pages outperforms Naïve Best-First  Positive effects of identifying hubs before and during the crawl process  Future Work – Find optimal aggregation node Compare the benefits of identifying hubs in competitive vs. collaborative communities

Thank You gautam-pant@uiowa.edu Acknowledgements: Robin McEntire (GlaxoSmithKline R&D) Valdis A. Dzelzkalns (GlaxoSmithKline R&D) Paul Stead (GlaxoSmithKline R&D) NSF grant to FM

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,

Similar presentations

Presentation on theme: "Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,

Similar presentations

Presentation on theme: "Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,"— Presentation transcript:

Similar presentations

About project

Feedback