Exploiting Inter-Class Rules for Focused Crawling İsmail Sengör Altıngövde Bilkent University Ankara, Turkey.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,
Information Retrieval in Practice
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Crawling the WEB Representation and Management of Data on the Internet.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems.
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Adversarial Information Retrieval The Manipulation of Web Content.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.
A Web Crawler Design for Data Mining
Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Crawling Slides adapted from
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Crawling The Web. 2 Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CS 440 Database Management Systems Web Data Management 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Data mining in web applications
Information Retrieval in Practice
DATA MINING Introductory and Advanced Topics Part III – Web Mining
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Restrict Range of Data Collection for Topic Trend Detection
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

Exploiting Inter-Class Rules for Focused Crawling İsmail Sengör Altıngövde Bilkent University Ankara, Turkey

Our Research: The Big Picture Goal: Metadata based modeling and querying of web resources Stages: Semi automated metadata extraction from web resources Focused crawling fits here! Extending SQL to support ranking and text- based operations in an integrated manner Developing query processing algorithms Prototyping a digital library application for CS resources

Overview Motivation Background & related work Interclass rules for focused crawling Preliminary results

Motivation Crawlers a.k.a. bots,spiders, robots Goal: Fetching all the pages on the Web, to allow succeding useful tasks (e.g., indexing) “all pages”: means somewhat 4 billion pages today (due to Google) Requires enormous hardware and network resources Consider the growth rate & refresh rate of Web What about hidden-Web and dynamic content?

Motivation Certain applications do need such powerful (and expensive) crawlers e.g., a general purpose search engine And some others don’t... e.g., a portal on computer science papers, or people homepages...

Motivation Let’s relax the problem space: “Focus” on a restricted target space of Web pages that may be of some “type” (e.g., homepages) that may be of some “topic” (CS, quantum physics) The “focused” crawling effort would use much less resources, be more timely, be more qualified for indexing & searching purposes

Motivation Goal: Design and implement a focused Web crawler that would gather only pages on a particular “topic” (or class) use interclass relationships while choosing the next page to download Once we have this, we can do many interesting things on top of the crawled pages I plan to be around for a few more years!!!

Background: A typical crawler Starts from a set of “seed pages” Follows all hyperlinks it encounters, to eventually traverse the entire Web Applies breadth-first search (BFS) Runs endless in cycles to revisist modified pages to access unseen content

Our simple BFS crawler

Crawling issues... Multi-threading Use separate and dedicated threads for DNS resolution and actual page downloading Cache and prefetch DNS resolutions Content-seen test Avoid duplicate content, e.g., mirrors Link extraction and normalization Canonical URLs

More issues... URL-seen test Avoid being trapped in a cycle! Hash visited URLs by MD5 algorithm and store in a database. 2-level hashing to exploit spatio-temporal locality Load balancing among hosts: Be polite! Robot exclusion protocol Meta tags

Even more issues?! Our crawler is simple, since issues like Refreshing crawled web pages Performance monitoring Hidden-Web content are left out... And some of the implemented issues can be still improved “Busy queue” for the politeness policy!

Background: Focused crawling “A focused crawler seeks and acquires [...] pages on a specific set of topics representing a relatively narrow segment of the Web.” (Soumen Chakrabarti) The underlying paradigm is Best-First Search instead of the Breadth-First Search

Breadth vs. Best First Search

Two fundamental questions Q1: How to decide whether a downloaded page is on-topic, or not? Q2: How to choose the next page to visit?

Early algorithms FISHSEARCH: Query driven A1: Pages that match to a query A2: Neighborhood of the pages in the above SHARKSEARCH: Use TF-IDF & cosine measure from IR to determine page relevance Cho et. al. Reorder crawl frontier based on “page importance” score (PageRank, in-links, etc.)

Chakrabarti’s crawler Chakrabarti’s focused crawler A1: Determines the page relevance using a text classifier A2: Adds URLs to a max-priority queue with their parent page’s score and visits them in descending order! What is original is using a text classifier!

The baseline crawler A simplified implementation of Chakrabarti’s crawler It is used to present & evaluate our rule based strategy Just two minor changes in our crawler architecture, and done!!!

Our baseline crawler

The baseline crawler An essential component is text classifier Naive-Bayes classifier called Rainbow Training the classifier Data: Use a topic taxonomy (The Open Directory, Yahoo). Better than modeling a negative class

Baseline crawler: Page relevance Testing the classifier User determines focus topics Crawler calls the classifier and obtains a score for each downloaded page Classifier returns a sorted list of classes and scores (A 80%, B 10%, C 7%, D 1%,...) The classifier determines the page relevance!

Baseline crawler: Visit order The radius-1 hypothesis: If page u is an on- topic example and u links to v, then the probability that v is on-topic is higher than the probability that a random chosen Web page is on-topic.

Baseline crawler: Visit order Hard-focus crawling: If a downloaded page is off-topic, stops following hyperlinks from this page. Assume target is class B And for page P, classifier gives: A 80%, B 10%, C 7%, D 1%,... Do not follow P’s links at all!

Baseline crawler: Visit order Soft-focus crawling: obtains a page’s relevance score (a score on the page’s relevance to the target topic) assigns this score to every URL extracted from this particular page, and adds to the priority queue Example: A 80%, B 10%, C 7%, D 1%,... Insert P’s links with score 0.10 into PQ

Rule-based crawler: Motivation Two important observations: Pages not only refer to pages from the same class, but also pages from other classes. e.g., from “bicycle” pages to “first aid” pages Relying on only radius-1 hypothesis is not enough!

Rule-based crawler: Motivation Baseline crawler can not support tunneling “University homepages” link to “CS pages”, which link to “researcher homepages”, and which futher link to “CS papers” Determining score only w.r.t. the similarity to the target class is not enough!

Our solution Extract rules that statistically capture linkage relationships among the classes (topics) and guide crawler accordingly Intuitively, we determine relationships like “pages in class A refer to pages in class B with probability X” A B (X)

Our solution When crawler seeks for class B and page P at hand is of class A, consider all paths from A to B compute an overall score S add links from P to the PQ with this score S Basically, we revise radius-1 hypothesis with class linkage probabilities.

How to obtain rules?

An example scenario Assume our taxonomy have 4 classes: department homepages (DH) course homepages (CH) personal homepages (PH) sports pages (SP) First, obtain train-0 set Next, for each class, assume 10 pages are fetched pointed to by the pages in train-0 set

An example scenario The distribution of links to classes Inter-class rules for the above distribution

Seed and target classes are both from the class PH.

Seed and target classes are both from the class PH.

Rule-based crawler Rule-based approach succesfully uses class linkage information to revise radius-1 hypothesis to reach an immediate award

Rule-based crawler: Tunneling Rule based approach also support tunneling by a simple application of transitivity. Consider URL#2 (of class DH) A direct rule is: DH  PH (0.1) An indirect rule is: from DH  CH (0.8) and CH  PH (0.4) obtain DH  PH (0.8 * 0.4 = 0.32) And, thus DH  PH ( = 0.42)

Rule-based crawler: Tunneling Observe that i) In effect, the rule based crawler becomes aware of a path DH  CH  PH, although it has only trained with paths of length 1. ii) The rule based crawler can succesfully imitate tunneling.

Rule-based score computation Chain the rules up to some predefined MAX- DEPTH number (e.g., 2 or 3) Merge the paths with the function SUM If no rules whatsoever, stick on soft-focus score Note that Rule db can be represented as a graph, and For a given target class all cycle free paths (except self loop of T) can be computed (e.g., modify BFS)

Rule-based score computation

Preliminary results: Set-up DMOZ taxonomy leafs with more than 150 URLs 1282 classes (topics) Train-0 set: 120K pages Train-1 set: 40K pages pointed to by 266 interrelated classes (all about science) Target topics are also from these 266 classes

Preliminary results: Set-up Harvest ratio: the average relevance of all pages acquired by the crawler to the target topic

Preliminary results Seeds are from DMOZ and Yahoo! Harvest rate improve from 3 to 38% Coverage also differs

Harvest Rate

Future Work Sophisticated rule discovery techniques (e.g., topic citation matrix of Chakrabarti et al.) On-line refinement of the rule database Using the entire taxonomy but not only leafs

Acknowledgments We gratefully thank Ö. Rauf Atay for the implementation.

References I. S. Altıngövde, Ö. Ulusoy, “Exploiting Inter- Class Rules for Focused Crawling”, IEEE Intelligent Systems Magazine, to appear. S. Chakrabarti, “Mining the Web Discovering Knowledge from Hypertext Data.” Morgan Kaufmann Publishers, 352 pages, S. Chakrabarti, M. H. van den Berg, and B.E. Dom, “Focused crawling: a new approach to topic-specific web resource discovery,” In Proc. of 8th International WWW Conference (WWW8), 1999.

Any questions???