Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,

Slides:

Advertisements

Similar presentations

BEST FIRST SEARCH - BeFS

Advertisements

Chapter 5: Tree Constructions

Heuristic Searches. Feedback: Tutorial 1 Describing a state. Entire state space vs. incremental development. Elimination of children. Closed and the solution.

Heuristic Search techniques

Informed search algorithms

A* Search. 2 Tree search algorithms Basic idea: Exploration of state space by generating successors of already-explored states (a.k.a.~expanding states).

Ch 4. Heuristic Search 4.0 Introduction(Heuristic)

Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part R4. Disjoint Sets.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Problem Solving by Searching Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 3 Spring 2007.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Analysis and Modeling of Social Networks Foudalis Ilias.

Artificial Intelligence Chapter 9 Heuristic Search Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.

More Graphs COL 106 Slides from Naveen. Some Terminology for Graph Search A vertex is white if it is undiscovered A vertex is gray if it has been discovered.

Part2 AI as Representation and Search

Artificial Intelligence Lecture No. 7 Dr. Asad Safi Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.

Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.

Best-First Search: Agendas

Artificial Intelligence Lecture

Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Xyleme A Dynamic Warehouse for XML Data of the Web.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Uninformed Search Reading: Chapter 3 by today, Chapter by Wednesday, 9/12 Homework #2 will be given out on Wednesday DID YOU TURN IN YOUR SURVEY?

Problem Solving and Search in AI Heuristic Search

D Nagesh Kumar, IIScOptimization Methods: M1L4 1 Introduction and Basic Concepts Classical and Advanced Techniques for Optimization.

Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.

Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.

Dijkstra’s Algorithm and Heuristic Graph Search David Johnson.

Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,

Introduction to search Chapter 3. Why study search? §Search is a basis for all AI l search proposed as the basis of intelligence l inference l all learning.

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.

Section 9 Graph search algorithms. Breadth-first search Idea: Let |n| denote a distance of node n from the initial node. We visit nodes in order: All.

Problem Reduction Search: AND/OR Graphs & Game Trees Department of Computer Science & Engineering Indian Institute of Technology Kharagpur.

A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Review: Tree search Initialize the frontier using the starting state While the frontier is not empty – Choose a frontier node to expand according to search.

Lecture 3: Uninformed Search

Algorithmic Detection of Semantic Similarity WWW 2005.

Search CPSC 386 Artificial Intelligence Ellen Walker Hiram College.

1 Branch and Bound Searching Strategies Updated: 12/27/2010.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Advanced Artificial Intelligence Lecture 2: Search.

The Goldilocks Problem Tudor Hulubei Eugene C. Freuder Department of Computer Science University of New Hampshire Sponsor: Oracle.

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.

Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Search Techniques CS480/580 Fall Introduction Trees: – Root, parent, child, sibling, leaf node, node, edge – Single path from root to any node Graphs:

Introduction to Artificial Intelligence (G51IAI) Dr Rong Qu Blind Searches - Introduction.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Game tree search Chapter 6 (6.1 to 6.3 and 6.6) cover games. 6.6 covers state of the art game players in particular. 6.5 covers games that involve uncertainty.

Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

Best-first search is a search algorithm which explores a graph by expanding the most promising node chosen according to a specified rule.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.

CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.

BEST FIRST SEARCH -OR Graph -A* Search -Agenda Search CSE 402

CSE (c) S. Tanimoto, 2001 Search-Introduction

(1) Breadth-First Search  S Queue S

Presentation transcript:

Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000, Israel

Introduction What does the paper talk about?  Subject: Multiple goal Search  Application Domain: Web Crawling

Web is viewed as a large graph  Page: nodes  Link : arcs Web Crawling Graph Searching (Kumar et al. 2000) Do the traditional graph search algorithms work? Representation

characteristics A set of goal states Success criteria  Don ’ t complete as soon as a single goal is found  To collect as many goals as possible

Heuristics How about traditional heuristic ?  Most are based on a heuristic function that estimates the distance from a node to the nearest goal node.  Not useful for multiple-goal search

Example

Method of Experimentation and Evaluation Two alternative stopping criteria When it spends a given allocation of resource When it finds a given portion of the goal states Accordingly we have two evaluation methods Number of goal states being found using the given resource Resources spent for finding the required portion

Sum-of-distance heuristic Distance heuristic does not take into account goal concentration Given S G, we use the sum of distance to S G One problem: we tend to try to progress towards all the goals simultaneously. One possible remedy to the problem: giving higher weight to progress

Front Advancement Given either explicitly goal list S G or a set of distance heuristics to goals or goal groups Instead of measuring the global progress towards the whole goal set, we measure the global progress towards each of the goals or goal groups and prefer steps that lead to progress towards more goals.

Yield Heuristic Definition:  Deal explicitly with the expected cost and expected benefit of searching from the given node.  We prefer subgraphs where the cost is low and the benefit is high.  We would like high return for our resource investment.  A heuristic that tries to estimate this return is a yield heuristic

Yield Heuristic Application It can be used in the traditional heuristic search algorithms such as best-first search One difference is the stopping criteria: when a goal is encountered, the algorithm collects it and continues.

Multiple-goal best-first search

Pessimistic estimation Optimistic estimation  Can include a depth limit d, on both methods Two Simple Yield Estimation

Side-effect of Yield Heuristic The found goals continue to attract the search front while we would have preferred that the search would would progress towards undiscovered goals Reduce the weight of the subset of the discovered goal. Goal Elimination

Learning yield heuristics In many domains, such heuristics are very difficult to design We can use learning approach to acquire such yield heuristics Accumulate partial yield information for every node in the search graph. Assume that the yield of the explored part of a subtree is a good predictor for the yield of the unexplored part Accumulate yield statistics for explored nodes. Create a set of examples of nodes with high yield and nodes with low yield. Apply an induction algorithm to infer a classifier for identifying nodes with high yield.

Inferring from partial yield Partial yield of node n at time t Need a predefined depth limit D Partial yield is used to do the estimation  We use the partial yield of a node to estimate the expected yield of its brothers and their descendants.  For computing the partial yield we keep in the node, for each depth, the number of nodes generated and the number of goals discovered so far to this depth.  When generating a new node, we initialize the counters for the node and recursively update its ancestors up to D levels above.

Inferring from partial yield  Remember to avoid updating an updated ancestor twice due to multiple paths. We must mark already updated ancestors.  The algorithm is shown in Fig. 3 Partial yield estimation  The estimated yield of a node is the average yields of its supported children(those with sufficiently large expanded subtrees).  If there are no such children, the yield is estimated(recursively) by the average yield of the node parents. The depth values are adjusted appropriately.  The algorithm is shown in Fig. 4

Updating the Partial yield

Yield Estimation Algorithm

Generalizing yield Key idea : to explore domain-specific features and use it to induct the estimated yield with some induction algorithm. Method : to infer the yield function; Or simply distinguish between states with high yield and low yield. Learning Cost Discussion :

Application to WWW domain Subject : Focused Crawling in the Web Task : to find as many goal pages as possible using limited resources. Various other issues related to implementation.

Experimental Methodology Most are done on Web domain Experiment is not done on the real web Dynamic, changing over time Need enormous time A reduced web collection from stanford.edu 350,000 valid and accessible HTML pages

Experimental Methodology Algorithm Compared: Standard BFS Best First minimal distance Best First sum & goal elimination Best First front advancement & goal elimination

Front Advancement in WWW

Learning yield

Combining heuristic and yield

Conclusions Presents a new framework for heuristic search : Multiple goal search Introduces the yield heuristic, and two methods of online learning of the yield heuristic The framework is applicable for a wide range of problems

The End Thank You !!!