Download presentation
Presentation is loading. Please wait.
Published byRosaline Bates Modified over 8 years ago
1
TEMPLATE DESIGN © 2008 www.PosterPresentations.com Crawling is the process of automatically exploring a web application to discover the states of the application. A crawling strategy is an algorithm that decides how crawling proceeds. For example, the Breadth-First and the Depth-First are the standard crawling strategies. The result of crawling is “a model” of the application. A model is a directed graph that represents the discovered states and the connections between them. Model-based Crawling of Rich Internet Applications M. Emre Dincturk, Suryakant Choudhary, Guy-Vincent Jourdan, Gregor v. Bochmann, Iosif Viorel Onut School of Electrical Engineering and Computer Science - University of Ottawa Introduction – RIAs and Crawling Rich Internet Applications (RIAs) are a new generation of web applications that are more interactive and responsive than the traditional web applications. The key factor in RIAs that enhances the user interaction is the ability to execute code (such as JavaScript ) on the client-side (web browser) to modify the current page into a new one. The client-side code execution is triggered by either user- interaction events (e.g. mouse clicks) or time events (e.g. time-outs). In addition, using technologies like AJAX (Asynchronous JavaScript and XML), a RIA can communicate with the server asynchronously. Methodology – Model-based Crawling Our approach to design efficient crawling strategies is called “Model-based Crawling” whose main idea is to define a meta-model which is a set of assumptions about the behavior of the application, to design a crawling strategy which is optimized for the case that the application follows these assumptions, and to specify how to adapt the strategy when crawling applications that do not satisfy these assumptions. Results We currently have three crawling strategies using model-based crawling approach. It is important to note that although each model-based strategy makes some initial assumptions about the application behavior, any RIA (including the ones that does not follow these assumptions) can be crawled with any model-based strategy. 1. The Hypercube Strategy is based on the Hypercube meta-model whose assumptions are: the state reached by a sequence of events from the initial state is independent of the order of the events, and the enabled events at a state are those at the initial state minus those executed to reach that state. Acknowledgments This work is supported in part by IBM and the Natural Science and Engineering Research Council of Canada. DISCLAIMER The views expressed in this poster are the sole responsibility of the authors and do not necessarily reflect those of the Center for Advanced Studies of IBM. Motivation and Aim Conclusion & Future Work The experimental results show that the model-based crawling strategies are more efficient than the other existing crawling strategies. Some of the areas that we are currently working on are techniques that will allow us to define a meta-model for the application during crawling, crawling algorithms for complex RIAs, which are the applications that have a very large state-space, and cannot be crawled exhaustively in a reasonable time (such as the widget based applications or the applications that display a large catalogue of similar items), techniques to prevent a crawler being stuck in a particular part of the application so that a bird-eye view of the application can be obtained earlier, and improving our crawler prototype so that it will be capable of crawling a large number of real RIAs. Implementation Our algorithms are implemented in a prototype of IBM ® Security AppScan ® Enterprise. Example: Figure 2 shows a page in a RIA. The RIA is a file manager for the Web. In this page, currently a file is selected. When the info button (highlighted with a red border) is clicked, the page is modified into a new one (shown in Figure 3). In a RIA, each distinct page is called a client-state (or simply a state ) of the application. Figure 1. Asynchronous Communication Pattern in RIAs Figure 2. A page in a RIA Figure 3. The page after the info button is clicked Two important motivations for crawling are Content Indexing : To make the content of an application searchable by the Web users. Testing : To detect security vulnerabilities, to find accessibility issues or to test functionality. The traditional crawlers only follow links (URLs) to discover new pages. As a result, large portions of RIAs, which are only reachable by executing events, are not searchable and testable. For RIAs, event-based crawling techniques are needed. We aim at crawling RIAs efficiently. We define the crawling efficiency as the ability to discover the states of the application as soon as possible (using as few events and resets as possible). The standard crawling strategies, the Breadth-First and the Depth-First, are not efficient for RIAs since they are not designed towards the specific needs of RIAs. Model-based Crawling Strategies With these assumptions, the model of an application is assumed to be a hypercube structure. Hypercube strategy is an optimal strategy for the applications following the hypercube meta-model. Figure 4. A four dimensional hypercube is an anticipated model for an application whose initial state has 4 events 2. The Menu Strategy is based on the Menu meta-model whose assumption is: the result of an event execution is independent of the state the event is executed and always results in the same resultant state. The Menu strategy first categorizes each event into one of three categories, which are menu, self-loop and other, and prioritizes them based on their category and their closeness to the current state. Figure 6. (right) A page in a RIA with menu events (within the red border). Figure 5. (top) A model showing two “menu” events (e 1 and e 2 ) and a “self-loop” event (e 3 ). 3. The Probability Strategy is based on a statistical model which assumes that an event which was often observed to lead to new states in the past will be more likely to lead to new states in the future. The strategy is simply to prioritize events based on their probability of discovering a new state and their closeness to the current state. The probability of an event is estimated dynamically during crawling, using the following Bayesian formula where S(e) is the number of times the event e discovered a new state, N(e) is the number of times e is executed from different states, and p s and p n are pre-set parameters to set an initial probability. We have also implemented a tool to visualize the extracted model of an application. Crawling Efficiency (costs of discovering all states) Figure 9. Performances of different strategies to discover the states on 4 applications (log scale) Figure 7. Architecture of our RIA crawler Figure 8. Data-flow diagram for the visualization tool Cost of Reset: 8 18 2 2 Table 1. The costs at the time crawl terminates Costs of Complete Crawls
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.