Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

Slides:

Advertisements

Similar presentations

Florida International University COP 4770 Introduction of Weka.

Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,

A Quality Focused Crawler for Health Information Tim Tang.

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

Web Crawling Notes by Aisha Walcott

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

Exploiting Inter-Class Rules for Focused Crawling İsmail Sengör Altıngövde Bilkent University Ankara, Turkey.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve.

March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Presented by Zeehasham Rasheed

Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

Classification.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Overview of Search Engines

1 Archive-It Training University of Maryland July 12, 2007.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Adversarial Information Retrieval The Manipulation of Web Content.

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.

CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.

Algorithmic Detection of Semantic Similarity WWW 2005.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Search Engine-Crawler Symbiosis: Adapting to Community Interests

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai.

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Focused Crawler for Topic Specific Portal Construction Ruey-Lung, Hsiao 25 Oct, 2000 Toward A Full Automatic Web Site Construction & Service (II)

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.

Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.

The Structure of Broad Topics on the Web

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Restrict Range of Data Collection for Topic Trend Detection

Revision (Part II) Ke Chen

Ben Markines Mira Stoilova Fulya Erdinc

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Classification and Prediction

CSCI N317 Computation for Scientific Applications Unit Weka

Information Retrieval

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Presentation transcript:

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin

First-generation focused crawling  Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics  Guess relevance of unseen node v based on the relevance of u (u  v) evaluated by topic classifier Baseline learner Dmoz topic taxonomy Class models consisting of term stats Frontier URLS priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough then enqueue all outlinks v of u with priority Pr(c*|u) Crawl database Seed URLs

Baseline crawling results  20 topics from  Half to two-thirds of pages fetched are irrelevant  “Every click on a link is a leap of faith”  Humans leap better than focused crawlers  Adequate clues in text + DOM to leap better

How to seek out distant goals?  Manually collect paths (context graphs) leading to relevant goals  Use a classifier to predict link-distance to goal from page text  Prioritize link expansion by estimated distance to relevant goals  No discrimination among different links on a page Goal 1 23 Crawler Classifier

The apprentice+critic strategy Baseline learner (Critic) Dmoz topic taxonomy Class models consisting of term stats Frontier URLs priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough... An instance (u,v) for the apprentice u v Pr(c*|v) Pr(c|u) for all classes c Crawl database Apprentice learner Class models +- Online training... submit (u,v) to the apprentice Apprentice assigns more accurate priority to node v u v Good Good / Bad

Design rationale  Baseline classifier specifies what to crawl  Could be a user-defined black box  Usually depends on large amounts of training data, relatively slow to train (hours)  Apprentice classifier learns how to locate pages approved by the baseline classifier  Uses local regularities in HTML, site structure  Less training data, faster training (minutes)  Guards against high fan-out (~10)  No need to manually collect paths to goals

Apprentice feature design  HREF source page u represented by DOM tree  Leaf nodes marked with offsets wrt the HREF  Many usable representations for term at offset d  A  t,d  tuple, e.g.,  “download”, -2    t,p,d  where p is path length from t to d through LCA a HREF TEXTfont TEXT li ul li TEXT em TEXT tt Offsets  “download” LCA

Offsets of good  t,d  features  Plot information gain at each d averaged over terms t  Max at d=0, falls off on both sides, but…  Clear saw-tooth pattern for most topics—why?  …  Topic-independent authoring idioms, best handled by apprentice

Apprentice learner design  Instance  (u,v),Pr(c*|v)  represented by   t,d  features for d up to some d max  HREF source topics {  c,Pr(c|u)   c}  Cocitation: w 1, w 2 siblings, w 1 good  w 2 good  Learning algorithm  Want to map features to a score in [0,1]  Discretize [0,1] into ranges, each a label   Class label  has an associated value q   Use a naïve Bayes classifier to find Pr(  |(u,v))  Score =   q  Pr(  |(u,v))

Apprentice accuracy  Better elimination of useless outlinks with increasing d max  Good accuracy with d max < 6  Using DOM offset info improves accuracy  Small accuracy gains  LCA distance in  t,p,d   Source page topics  Cocitation features

Offline apprentice trials  Run baseline, train apprentice  Start new crawler at the same URLs  Let it fetch any page it schedules  (Recall) limit it to pages visited by the baseline crawler  Baseline loss > recall loss > apprentice loss  Small URL overlap

Online guidance from apprentice  Run baseline  Train apprentice  Re-evaluate frontier  Apprentice not as optimistic as baseline  Many URLs downgraded  Continue crawling with apprentice guidance  Immediate reduction in loss rate

Summary  New generation of focused crawler  Discriminates between links, learning online  Apprentice easy and fast to train online  Accurate with small d max around 4—6  DOM-derived features better than text  Effective division of labor (‘what’ vs. ‘how’)  Loss rate reduced by 30—90%  Apprentice better at guessing relevance of unvisited nodes than baseline crawler  Benefits visible after 100—1000 page fetches

Ongoing work  Extending to larger radius and deeper DOM + site structure  Public domain C++ software  Crawler Asynchronous DNS, simple callback model Can saturate dedicated 4Mbps with a Pentium2  HTML cleaner Simple, customizable, table-driven patch logic Robust to bad HTML, no crash or memory leak  HTML to DOM converter Extensible DOM node class