Reducing Human Interactions in Web Directory Searches ORI GERSTEL - Cisco SHAY KUTTEN - Technion EDUARDO SANY LABER - PUC-Rio RACHEL MATICHIN and DAVID.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

Heuristic Search techniques
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Fast Algorithms For Hierarchical Range Histogram Constructions
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
CPSC 322, Lecture 9Slide 1 Search: Advanced Topics Computer Science cpsc322, Lecture 9 (Textbook Chpt 3.6) January, 23, 2009.
Lecture - 1 on Data Structures. Prepared by, Jesmin Akhter, Lecturer, IIT,JU Data Type and Data Structure Data type Set of possible values for variables.
The Cache Location Problem IEEE/ACM Transactions on Networking, Vol. 8, No. 5, October 2000 P. Krishnan, Danny Raz, Member, IEEE, and Yuval Shavitt, Member,
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Complexity Analysis (Part I)
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 10 th Edition.
Link Analysis, PageRank and Search Engines on the Web
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
CSC 2300 Data Structures & Algorithms February 6, 2007 Chapter 4. Trees.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
Confidence Interval Estimation
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Unit III : Introduction To Data Structures and Analysis Of Algorithm 10/8/ Objective : 1.To understand primitive storage structures and types 2.To.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Analysis of Algorithms
Querying Structured Text in an XML Database By Xuemei Luo.
Sorting with Heaps Observation: Removal of the largest item from a heap can be performed in O(log n) time Another observation: Nodes are removed in order.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Research Academic Computer Technology Institute (RACTI) Patras Greece1 An Algorithmic Framework for Adaptive Web Content Christos Makris, Yannis Panagis,
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Data Structure and Algorithms
10/3/2017 Chapter 6 Index Structures.
Mehdi Kargar Department of Computer Science and Engineering
Advanced Algorithms Analysis and Design
Storage Access Paging Buffer Replacement Page Replacement
CS522 Advanced database Systems
Multiway Search Trees Data may not fit into main memory
A Study of Group-Tree Matching in Large Scale Group Communications
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B+ Tree.
Objective of This Course
Indexing and Hashing Basic Concepts Ordered Indices
DATA MINING Introductory and Advanced Topics Part II - Clustering
Presentation transcript:

Reducing Human Interactions in Web Directory Searches ORI GERSTEL - Cisco SHAY KUTTEN - Technion EDUARDO SANY LABER - PUC-Rio RACHEL MATICHIN and DAVID PELEG - The Weizmann Institute of Science ARTUR ALVES PESSOA - UFF CRISTON SOUZA - PUC-Rio

The big problem Find information in a large collection of data. There are two basic ways to handle information-finding. One is a “flat” approach which views the information as a nonhierarchical structure and provides a query language to extract the relevant data from the database. An example of this approach on the web is the Google search engine. The other method is based on a hierarchical index to the database according to a taxonomy of categories. Example of such index on the web is Yahoo.

The “small” problem Create the optimal hotlinks for the hierarchical approach. In this approach it is necessary to traverse a path in the taxonomy tree from the root to the desired node in the tree. Human engineering considerations further aggravate this problem, since it is very hard to choose an item from a long list (typical convenient numbers are 7–10). Thus, the degree of the taxonomy tree should be rather low and its average depth, therefore, high. Another problem in the hierarchical approach is that the depth of an item in the taxonomy tree is not based on the access pattern. As a result, items which have very high access frequency may require long access paths each time they are needed, while those which are “unpopular” may still be very accessible in the taxonomy tree. It is desirable to find a solution that does not change the taxonomy tree itself, since this taxonomy is likely to be meaningful and useful for the user

The solution A partial solution to this problem is currently used in the web, and consists of a list of “hot” pointers which appears in the top level of the index tree and leads directly to the most popular items. We refer to a link from a hotlist to its destination as a hotlink. This approach is not scalable in the sense that only a small number of items can appear in such a list In the current article we study the generalization of this “hotlist” This approach allows us to have such lists in multiple levels in the index tree, not just the top level. The resulting structure is termed a hotlink-enhanced index structure (or enhanced structure for short). This article also addresses the optimization problem faced by the index designer, namely, to find a set of hotlinks that minimizes the expected number of links (either tree edges or hotlinks) traversed by a user from the root to a leaf.

The solution cont... There are many applications for such hotlink-enhanced index structures. A partial list includes: —a web index (such as Yahoo), with multilevel hotlists in which the access statistics is influenced by accesses of all users to various sites; —a personalized web index in which the browser records personalized statistics; —large library index systems. —Application menu trees are currently designed by the application developer, or statically customized to the needs of a user. By adding hotlists, the application can learn the usage pattern of the user and adjust to changes in this pattern. —It is even possible to use the preceding idea in file systems, in which files in the static tree structure of the file system can be augmented by links to frequently accessed subdirectories or files.

Assumptions The model considered here is based on the assumption that the user does not have knowledge of the enhanced index structure. This forces the user to deploy a greedy strategy. In this user model, the user starts at the root of the tree and, at each node in the enhanced structure, can infer from the link labels which tree edges or hotlinks available at the current page lead to a page that is closest in the original tree structure to the desired destination. The user always chooses that link. In other words, the user is not aware of other hotlinks that may exist from other pages in the tree. This means that the user’s estimate for the quality of a tree edge or hotlink leading from the root to some node is based solely on the height of that node in the original tree.

The algorithm Given a tree T with n nodes representing an index, a hotlink is an edge that does not belong to the tree. The hotlink starts at some node v and ends at (or leads to) some node u that is a descendant of v. We assume without loss of generality that u is not a child of v. Each leaf x of T has a weight p(x), representing the proportion of the user’s visits to that leaf compared with the total set of user’s visits. Hence, if normalized, p(x) can be interpreted as the probability that a user wants to access the leaf x. Another parameter of the problem is an integer K, specifying an upper bound on the number of hotlinks that may start at any given node (there is no a priori limiton the number of hotlinks that lead to a given node).

The algorithm Let S be a set of hotlinks constructed on the tree (obeying the bound of K outgoing hotlinks per node) and let DS(v) denote the greedy path (including hotlinks) from the root to node v. The expected number of operations needed to get to an item is The problem of optimizing this function is referred to as the hotlink enhancement problem. Two different static problems arise, according to whether the probability distribution p is known to us in advance. Assuming a known distribution, our goal is to find a set of hotlinks S which minimizes f (T, p, S) and achieves the optimal cost Such a set is termed an optimal set of hotlinks.

The algorithm cont... On the other hand, under the unknown distribution assumption, the worst- case expected access cost for a tree T with a set of hotlinks S is and our goal is to find a set of hotlinks S minimizing f (T, S) and achieving the optimal cost

The algorithm cont... The algorithm uses dynamic programming and the greedy assumption to limit the search operations. The authors also show how to generalize the solution to arbitrary degree trees and to hotlink enhancement schemes that allow up to K hotlinks per node. The algorithm can be used for trees with unbounded degree. For an input tree T with n nodes, our algorithm runs in time, requiring space. Thus, it runs in polytime for trees of logarithmic depth.

Experiments The motivation is twofold: 1. to show that adding hotlinks to websites indeed provides considerable gains in terms of the user-expected path length; 2. to argue that the algorithm proposed in this work for the hotlink enhancement problem (with known probabilities) allows a practical implementation. All of our experiments were executed in a Xeon 2.4 GHz machine with 1GB of RAM memory. In addition, we allowed at most 1 hotlink leaving each node in all experiments.

Results In order to have a clear idea of the improvement produced by an algorithm, we present our results in terms of the gain metric. The gain of an algorithm for an instance (T, p) is given by ( f (T, p, ∅ )− f (T, p, S))/ f (T, p, ∅ ), where S is the set of hotlinks that algorithm adds to the tree T. Hotlink Assignment to the puc-rio.br Domain. Of special interest in the instance obtained from PUC’s domain is the fact that it has actual access probabilities. We note that this instance is one of the biggest that we tested, with 17,379 nodes and H = 8. The optimal solution obtained by the algorithm provides a gain of 18.62%. The algorithm executed in less than 2 seconds and consumed around 10MB of memory space.

Results Hotlink Assignment to Actual Websites with Zipf’s Distribution. A critical aspect of the algorithm is the use of memory for the dynamic programming table. With 23MB, the algorithm provided the optimal solution of 82 out of 84 instances. The 2 hard instances (not solved with 23MB) were those with the greatest heights. One of the hard instances has 10,484 nodes and height 16, requiring 488MB. The other instance has 512,484 nodes and height 14, requiring more than 1GB.

Conclusion We have presented new exact algorithms for several variations of the problem of assigning hotlinks to hierarchically arranged data such as in web directories. For most of these variations, we have proved that the proposed algorithms run efficiently from a theoretical point of view. In the case of one of the algorithms, the running time is polynomial if the depth of the tree is logarithmic. We have run some experiments to evaluate both the efficiency and efficacy of the algorithm that solves the problem for known distributions and at most one hotlink leaving each node. These experiments show that significant improvement in the expected number of accesses per search can be achieved in websites using this algorithm. In addition, the proposed algorithm consumed a reasonable amount of computational resources to obtain optimum hotlink assignments.