Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
CoopIS2001 Trento, Italy The Use of Machine-Generated Ontologies in Dynamic Information Seeking Giovanni Modica Avigdor Gal Hasan M. Jamil.
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Interception of User’s Interests on the Web Michal Barla Supervisor: prof. Mária Bieliková.
SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.
Finding Presentation Failures Using Image Comparison Techniques Sonal Mahajan and William G.J. Halfond Department of Computer Science University of Southern.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Interfaces for Selecting and Understanding Collections.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
By: Bihu Malhotra 10DD.   A global network which is able to connect to the millions of computers around the world.  Their connectivity makes it easier.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
ASP.NET AJAX. Content ASP.NET AJAX Ajax Control Toolkit Muzaffer DOĞAN - Anadolu University2.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Dynamic Web Pages (Flash, JavaScript)
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA.
PLATFORM INDEPENDENT SOFTWARE DEVELOPMENT MONITORING Mária Bieliková, Karol Rástočný, Eduard Kuric, et. al.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Amy Dai Machine learning techniques for detecting topics in research papers.
Chapter 6: Information Retrieval and Web Search
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Session: 1. © Aptech Ltd. 2Introduction to the Web / Session 1  Explain the evolution of HTML  Explain the page structure used by HTML  List the drawbacks.
Algorithmic Detection of Semantic Similarity WWW 2005.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
 2008 Pearson Education, Inc. All rights reserved Document Object Model (DOM): Objects and Collections.
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Text Clustering Hongning Wang
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GROUP PresentsPresents. WEB CRAWLER A visualization of links in the World Wide Web Software Engineering C Semester Two Massey University - Palmerston.
Motivation Conclusion Effective Access Over Public Conversations William Lee, Hui Fang and Yifan Li University of Illinois at Urbana-Champaign Clustering.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Collaborative Deep Learning for Recommender Systems
Data mining in web applications
Julián ALARTE DAVID INSA JOSEP SILVA
Site-Level Web Template Extraction
Based on Menu Information
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Dynamic Web Pages (Flash, JavaScript)
Web Data Extraction Based on Partial Tree Alignment
Presentation transcript:

Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance and Reengineering Genova, Italy

Motivation Sitemaps and meta-keywords are essential for communicating the content of a website to its users and search engines. The manual extraction is a tedious and quite subjective task. Current solutions support static and dynamic websites, where the states are determined by unique URLs (URL-based approaches) 17th European Conference on Software Maintenance and Reengineering Genova, Italy 2

The problem In traditional dynamic web applications the states are explicit and correspond to pages with unique URLs. Ajax technology allows to dynamically manipulate the DOM tree rendered in the client’s browser. This means that user actions may generate multiple DOM states under the same URL. 17th European Conference on Software Maintenance and Reengineering Genova, Italy 3

The idea Why don’t we group the DOM states based on their structural similarity? Assumptions – DOM states having a strong similarity should be part of the same feature. – Clustering the DOM states  each cluster should represent a unique feature. However, determining similarity is not always easy. 17th European Conference on Software Maintenance and Reengineering Genova, Italy 4

17th European Conference on Software Maintenance and Reengineering Genova, Italy 5 /petstore/faces/search.jsp

String-edit distance 17th European Conference on Software Maintenance and Reengineering Genova, Italy 6

Tree-edit distance Apply a tree-alignment algorithm that takes as input two trees T 1 and T 2, and computes an edit script transforming T 1 to T 2 The node-level edit operations in the edit script are: – Match – Change – Insert – Remove – Move 17th European Conference on Software Maintenance and Reengineering Genova, Italy 7

Tree-edit distance 17th European Conference on Software Maintenance and Reengineering Genova, Italy 8 insertedchanged 1 node 76 nodes

Tree-edit distance 17th European Conference on Software Maintenance and Reengineering Genova, Italy 9

What is a composite edit operation? 17th European Conference on Software Maintenance and Reengineering Genova, Italy 10 Composite edit operation insertedchanged

Composite-change-aware Tree-edit distance 17th European Conference on Software Maintenance and Reengineering Genova, Italy 11

Feature extraction process Determine the number of clusters Apply clustering Compute distances between states Collect DOM states 17th European Conference on Software Maintenance and Reengineering Genova, Italy 12 Crawljax hierarchical agglomerative 1. L method 2. Silhouette

L method [Salvador & Chan, 2004] 17th European Conference on Software Maintenance and Reengineering Genova, Italy 13 knee

Silhouette 17th European Conference on Software Maintenance and Reengineering Genova, Italy 14

Silhouette coefficient 17th European Conference on Software Maintenance and Reengineering Genova, Italy 15

Evaluation Process 17th European Conference on Software Maintenance and Reengineering Genova, Italy 16 Compare the results Perform feature extraction Create reference sets of features Select Ajax web apps Using 3 distance metrics Precision Recall F-measure

Google Maps 17th European Conference on Software Maintenance and Reengineering Genova, Italy 17 # DOM nodesDOM depth Average96820 Median Standard deviation

Garage.ca 17th European Conference on Software Maintenance and Reengineering Genova, Italy 18 # DOM nodesDOM depth Average54712 Median54012 Standard deviation

Kayak 17th European Conference on Software Maintenance and Reengineering Genova, Italy 19 # DOM nodesDOM depth Average89426 Median Standard deviation

Reference sets Screenshots of the DOM states as rendered in the browser. Two authors independently grouped the DOM states according to their visual perception of the feature they take part in. The initial agreement was for Garage, for Google Maps and for Kayak. The results were merged by decomposing higher-level features to lower-level ones. 17th European Conference on Software Maintenance and Reengineering Genova, Italy 20

Evaluation Results 17th European Conference on Software Maintenance and Reengineering Genova, Italy 21 distanceL methodSilhouette kF-measurePrecisionRecallkF-measurePrecisionRecall dSdS dCdC dLdL Actual number of clusters: 4

Evaluation Results 17th European Conference on Software Maintenance and Reengineering Genova, Italy 22 distanceL methodSilhouette kF-measurePrecisionRecallkF-measurePrecisionRecall dSdS dCdC dLdL Actual number of clusters: 19

Evaluation Results 17th European Conference on Software Maintenance and Reengineering Genova, Italy 23 distanceL methodSilhouette kF-measurePrecisionRecallkF-measurePrecisionRecall dSdS dCdC dLdL Actual number of clusters: 16

Conclusions & Future Work We have shown that a composite-change-aware tree-edit distance is more suitable for detecting features in Ajax-enabled web applications compared to traditional distance measures. Automatic labeling of the discovered features – By applying Latent Dirichlet allocation (LDA) on the textual content of the DOM states in each group. – Giving higher weight to text inside headings and other emphasis-related tags. 17th European Conference on Software Maintenance and Reengineering Genova, Italy 24

Thank you! 17th European Conference on Software Maintenance and Reengineering Genova, Italy 25