Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Slides:

Advertisements

Similar presentations

BARNALI CHAKRABARTY. What is an Operating System ?

Advertisements

A lightweight framework for testing database applications Joe Tang Eric Lo Hong Kong Polytechnic University.

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

1 MDV, April 2010 Some Modeling Challenges when Testing Rich Internet Applications for Security Kamara Benjamin, Gregor v. Bochmann Guy-Vincent Jourdan,

Efficient Reachability Analysis for Verification of Asynchronous Systems Nishant Sinha.

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

Embedded Web Hyung-min Koo. 2 Table of Contents Introduction of Embedded Web Introduction of Embedded Web Advantages of Embedded Web Advantages of Embedded.

Languages for Dynamic Web Documents

IPA Herfstdagen Software Engineering Research Group Delft University of Technology Dynamic Analysis and Testing of Ajax User Interfaces Ali.

Web Mining Research: A Survey

1 State-Based Testing of Ajax Web Applications A. Marchetto, P. Tonella and F. Ricca CMSC737 Spring 2008 Shashvat A Thakor.

WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.

Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.

Web Programming Language Dr. Ken Cosh Week 1 (Introduction)

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

CSCI 6962: Server-side Design and Programming Course Introduction and Overview.

Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.

Ruth Betcher Ruth Christie

Dynamic Web Pages (Flash, JavaScript)

TEMPLATE DESIGN © Efficient Crawling of Complex Rich Internet Applications Ali Moosavi, Salman Hooshmand, Gregor v. Bochmann,

Software Security Research Group (SSRG), University of Ottawa in collaboration with IBM Software Security Research Group (SSRG), University of Ottawa In.

Solving Some Modeling Challenges when Testing Rich Internet Applications for Security Software Security Research Group (SSRG), University of Ottawa In.

TEMPLATE DESIGN © Non-URL-Based Crawling strategy :  In a RIA one URL corresponds to many states of DOM. Unlike traditional.

JavaScript II ECT 270 Robin Burke. Outline JavaScript review Processing Syntax Events and event handling Form validation.

Artificial Intelligence Techniques Internet Applications 1.

Software Security Research Group (SSRG), University of Ottawa in collaboration with IBM Software Security Research Group (SSRG), University of Ottawa In.

1 A Static Analysis Approach for Automatically Generating Test Cases for Web Applications Presented by: Beverly Leung Fahim Rahman.

Master Thesis Defense Jan Fiedler 04/17/98

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Abstract We present two Model Driven Engineering (MDE) tools, namely the Eclipse Modeling Framework (EMF) and Umple. We identify the structure and characteristic.

1 Welcome to CSC 301 Web Programming Charles Frank.

Knowledge-oriented Maintenance at the University of Ottawa Timothy C Lethbridge KOM Banff.

Building Rich Web Applications with Ajax Linda Dailey Paulson IEEE – Computer, October 05 (Vol.38, No.10) Presented by Jingming Zhang.

FITTEST F UTURE I NTERNET T ESTING (ICT , ) Industrial Track of the Seventh IEEE International Conference on RESEARCH CHALLENGES in INFORMATION.

WebSphere Portal Technical Conference U.S Creating Rich Internet (AJAX) Applications with WebSphere Portlet Factory.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 14 Database Connectivity and Web Technologies.

1 MSCS 237 Overview of web technologies (A specific type of distributed systems)

Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.

1 Qualitative Reasoning of Distributed Object Design Nima Kaveh & Wolfgang Emmerich Software Systems Engineering Dept. Computer Science University College.

DynaRIA: a Tool for Ajax Web Application Comprehension Dipartimento di Informatica e Sistemistica University of Naples “Federico II”, Italy Domenico Amalfitano.

TEMPLATE DESIGN © Non-URL-Based Crawling strategy :  In a RIA one URL corresponds to many states of DOM. Unlike traditional.

A Self-Configuring Test Harness for Web Applications Jairo Pava School of Computing and Information Sciences Florida International University Courtney.

Dynamic Web Pages Jin Wu INF 385E Information Architecture School of Information 11/2/2006 Jin Wu INF 385E Information Architecture School of Information.

Using Social Network Analysis Methods for the Prediction of Faulty Components Gholamreza Safi.

1 Gregor v. Bochmann, University of Ottawa ICTSS 2015 Sharjah and Dubai (UAE), November 2015 Gregor v. Bochmann School of Electrical Engineering and Computer.

Web Technologies Lecture 8 Server side web. Client Side vs. Server Side Web Client-side code executes on the end-user's computer, usually within a web.

nd Joint Workshop between Security Research Labs in JAPAN and KOREA Polymorphic Worm Detection by Instruction Distribution Kihun Lee HPC Lab., Postech.

Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.

TEMPLATE DESIGN © Non-URL-Based Crawling strategy :  In a RIA one URL corresponds to many states of DOM. Unlike traditional.

 Web pages originally static  Page is delivered exactly as stored on server  Same information displayed for all users, from all contexts  Dynamic.

JavaScript 101 Introduction to Programming. Topics What is programming? The common elements found in most programming languages Introduction to JavaScript.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

TEMPLATE DESIGN © Crawling is the process of automatically exploring a web application to discover the states of the application.

Overview Web Technologies Computing Science Thompson Rivers University.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.

1 Visual Computing Institute | Prof. Dr. Torsten W. Kuhlen Virtual Reality & Immersive Visualization Till Petersen-Krauß | GUI Testing | GUI.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

TEMPLATE DESIGN © Automatic Classification of Parameters and Cookies Ali Reza Farid Amin 1, Gregor v. Bochmann 1, Guy-Vincent.

C.P. Patidar Meena Sharma Varsha Sharda

Software Security Research Group (SSRG),

Web Technologies Computing Science Thompson Rivers University

Web Mining Department of Computer Science and Engg.

Web Technologies Computing Science Thompson Rivers University

Presentation transcript:

Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant Choudhary, M. Emre Dincturk, Seyed M. Mirtaheri, Ali Moosavi, Gregor von Bochmann, Guy-Vincent Jourdan, Iosif Viorel Onut CASCON 2012 November 5, 2012

Overview Introduction ▫The evolution of the Web applications and Crawling Crawling RIAs ▫Challenges and Common Assumptions Research on Crawling RIAs ▫Crawling for Indexing ▫Crawling for Testing ▫Research on Crawling Strategies  Greedy Strategy  Model-based Crawling  Hypercube Strategy  Probability Strategy  Menu Strategy Experimental Results Future of RIA Crawling 2

Introduction - Traditional Web Applications ▫HTML pages identified by a URL ▫Synchronous communication 3

Introduction - Rich Internet Applications (1) Client-side code (JavaScript) execution The page can be modified by the client-side code. Document Object Model (DOM): A tree data structure to represent the page in the client.  Events : Occurrences that cause code execution (mouse click, timeout etc.) 4

Introduction - Rich Internet Applications (2) Asynchronous Communication (AJAX) 5

Introduction - Crawling (1) Crawling: Exploring an application automatically Motivations ▫Content indexing (by search engines) ▫Testing (for security, accessibility, functionality) Objectives ▫Find all (or ‘important’) pages ▫Find the connections between the pages (obtaining a complete model of the application, for example for page ranking) 6

Introduction - Crawling (2) Crawling extracts “a model” of the application ▫States are the “distinct” pages ▫Transitions are the connections between the states 7

Crawling RIAs RIAs have events that change the page without changing the URL. ▫URL –> Many States The aim is to find all the states reachable from a given URL by executing events. ▫The Initial State: The state reached by loading the URL ▫Reset: Loading the URL to go back to the initial state. An event’s behaviour may depend on the state it is executed. We have to execute in each state all the enabled events of the state. 8

Crawling RIAs – Challenges and Assumptions State Identification ▫A state needs to be identified by its DOM. ▫A DOM Equivalence Relation is needed. Event Identification Assumption: No Server-side States Assumption: Finite Representative User Inputs Intermediate States Efficiency of Crawling Strategies 9

Crawling RIAs for Indexing Duda et al. [1][2][ 3] ▫Uses a Breadth-First crawling strategy ▫Introduced AjaxRank [3]: Adaptation of PageRank to RIAs to sort the results of a search query Mesbah et al. [4][5] introduced “Crawljax” ▫uses a strategy similar to the Depth-First ▫outputs static HTML snapshots of the discovered DOMs which can be indexed by the search engines 10

Crawling RIAs for Testing Crawljax is also used for testing RIAs ▫Regression Testing of Ajax Applications [6] ▫Security Testing of web widget interactions [7] ▫Invariant-based Testing of Ajax Applications [8] Marchetto et al. [9] testing to reveal faulty behaviour ▫Combines analysis of user traces, static analysis of the code and human validation to produce a model of an application Amalfitano et al. [10] [11] [12] [13] ▫Modeling and testing based on user execution traces obtained by  User sessions and/or  Automated trace generation using a Depth-First strategy 11

Crawling Strategies for RIAs Crawling Strategy: an algorithm that decides what event should be explored next. ▫An efficient strategy discovers the states as soon as possible (our definition) ▫Time to find all the states ~ the number of events executed and the resets used during crawling The standard strategies used in the mentioned research, the Breadth-First and the Depth-First, are not efficient for RIAs. ▫No predictions for the event outcomes. ▫A strict order of state exploration: Leads to increased number of event executions and resets (used to transfer from the current state to the currently explored state). 12

Research on Crawling Strategies for RIAs Greedy Strategy [14] ▫A simple strategy that gives priority to the event closest to the current state ▫Tries to minimize the transfer sequences but still no prediction of event outcomes Model-Based Crawling Strategies ▫Hypercube Strategy [15] ▫Probability Strategy [16] ▫Menu Strategy [17] 13

Model-Based Crawling Meta-model: assumed structure of the application Crawling strategy is optimized for the case that the application follows these assumptions Adaptation of the strategy: the crawling strategy must be able to deal with applications that do not satisfy these assumptions 14

The Hypercube Strategy The Hypercube Meta-Model anticipates the application to have a hypercube model. Hypercube strategy is an “optimal” strategy for this meta-model. Example: 4-Dimensional Hypercube 15

Prioritizes events based on their probability of discovering a new state  N(e) = number of executions  S(e) = number of new states found  Bayesian formula  p S = 1 and p N = 2 -> initial probability = 0.5 Aim: Choose an event e to explore such that ▫P(e) is high ▫The transfer sequence from the current state to a state where e is unexecuted is short The Probability Strategy 16

The Menu Strategy The Menu Meta-Model defines three categories of events: ▫1. Menu-Event: Leads to the same state independent of where it is executed. (e 1 and e 2 ) ▫2. Self-Loop Event: Do not cause any state change. (e3) ▫3. Other Event: An event that is neither of the above. Simple example: 17

Experimental Results - Strategies We compare the performance of the model- based strategies with ▫The Optimized Breadth-First Strategy ▫The Optimized Depth-First Strategy ▫The Greedy Strategy (explore the event closest to the current state) ▫Optimal (calculated when the model is known) 18

Experimental Results - Applications Real Applications ▫Periodic Table (Local version: ▫Clipmarks (Local version: Test Applications ▫TestRIA ( ▫Altoro Mutual ( 19

Experimental Results – Measuring Efficiency Efficiency of a strategy is measured by the cost of discovering the states, which is based on the number of events executed and the resets used. Before crawling an application we measure average event execution time and average reset time for the application. For simplicity, we assume each event has the same unit cost (which is the average event execution time). The cost of reset is defined in terms of event execution cost. The cost of a strategy is calculated by (#events executed) +(#resets used) *(cost of reset) 20

Experimental Results – Crawling Efficiency 21 Plots are in logarithmic scale. Cost of Reset 8 Cost of Reset 18 Cost of Reset 2 Cost of Reset 2

Future of RIA Crawling Avoid New States Without New Information ▫Automatically identify the parts of a page that can be crawled independently to reduce the state space explosion Adaptive Crawling ▫Decide the meta-model for the application during the crawling Greater Diversity ▫Try to get a bird-eye-view of the application model as soon as possible Distributed Crawling ▫Crawl applications using multiple processes running concurrently to reduce crawling time 22

References [1] C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou, “Ajax crawl: Making ajax applications searchable,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89, IEEE Computer Society, 2009 [2] C. Duda, G. Frey, D. Kossmann, and C. Zhou, “Ajax search: crawling, indexing and searching web 2.0 applications,” Proc. VLDB Endow., vol. 1, pp. 1440– 1443, Aug [3] G. Frey, “Indexing ajax web applications,” Master’s thesis, ETH Zurich, 2007 [4] A. Mesbah, E. Bozdag, and A. v. Deursen, “Crawling ajax by inferring user interface state changes,” in Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pp. 122–134, IEEE Computer Society, [5] A. Mesbah, A. van Deursen, and S. Lenselink, “Crawling ajax-based web applications through dynamic analysis of user interface state changes,” TWEB, vol. 6, no. 1, p. 3, [ 6] D. Roest, A. Mesbah, and A. van Deursen, “Regression testing ajax applications: Coping with dynamism.,” in ICST, pp. 127–136, IEEE Computer Society, [7] A C.-P. Bezemer, A. Mesbah, and A. van Deursen, “Automated security testing of web widget interactions,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09, [8] A. Mesbah and A. van Deursen, “Invariant-based automatic testing of ajax user interfaces,” in Software Engineering, ICSE IEEE 31 st International Conference on, pp. 210 –220, may [9] A. Marchetto, P. Tonella, and F. Ricca, “State-based testing of ajax web applications,” in Proceedings of the 2008 International Conference on Software Testing, Verification, and Validation, ICST ’08, pp. 121–130, IEEE Computer Society,

References [10] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Reverse engineering finite state machines from rich internet applications,” in Proceedings of the th Work-ing Conference on Reverse Engineering, WCRE ’08, pp. 69–73, IEEE Computer Society, [11] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Rich internet application testing using execution trace data,” in Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 274–283, IEEE Computer Society, [12] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “An iterative approach for the reverse engineering of rich internet application user interfaces,” in Proceedings of the 2010 Fifth International Conference on Internet and Web Applications and Services, ICIW ’10, pp. 401–410, IEEE Computer Society, [13] D. Amalfitano, A. R. Fasolino, A. Polcaro, and P. Tramontana, “Dynaria: A tool for ajax web application comprehension.,” in ICPC, pp. 46–47, IEEE Computer Society, [14] Z. Peng, N. He, C. Jiang, Z. Li, L. Xu, Y. Li, and Y. Ren, “Graph-based ajax crawl: Mining data from rich inter-net applications,” in Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference on, vol. 3, pp. 590 –594, march [15] K. Benjamin, G. v. Bochmann, M. E. Dincturk, G.-V. Jourdan, and I. V. Onut, “A strategy for efficient crawling of rich internet applications,” in Proceedings of the 11th international conference on Web engineering, ICWE’11, [16] M. E. Dincturk, S. Choudhary, G. v. Bochmann,, G.-V. Jourdan, and I. V. Onut, “A statistical approach for efficient crawling of rich internet applications,” in Proceedings of the 12th international conference on Web engineering, ICWE’12, [17] Choudhary, S., M-crawler: Crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa,

Thank You 25