“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp.481-486 Today presented by Kenny Kwok.

Slides:



Advertisements
Similar presentations
AS ICT Finding your way round MS-Access The Home Ribbon This ribbon is automatically displayed when MS-Access is started and when existing tables.
Advertisements

Indexing DNA Sequences Using q-Grams
Backing Up a Hard Disk CGS2564. Why Backup Programs? Faster Optimized to copy files Can specify only files that have changed Safer Can verify backed up.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Aki Hecht Seminar in Databases (236826) January 2009
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
The Application Layer Chapter 7. Electronic Mail Architecture and Services The User Agent Message Formats Message Transfer Final Delivery.
Elementary Data Types Scalar Data Types Numerical Data Types Other
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Russell Taylor Lecturer in Computing & Business Studies.
Guide To UNIX Using Linux Third Edition
Word Processing. ► This is using a computer for:  Writing  EditingTEXT  Printing  Used to write letters, books, memos and produce posters etc.  A.
Introduction to JavaScript. Aim To enable you to write you first JavaScript.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Basics of HTML. Example Code Hello World Hello World This is a web page.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.
Silvio Cesare Ph.D. Candidate, Deakin University.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
INTRODUCTION TO DHTML. TOPICS TO BE DISCUSSED……….  Introduction Introduction  UsesUses  ComponentsComponents  Difference between HTML and DHTMLDifference.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Basic tasks of generic software Chapter 3. Contents This presentation covers the following: – The basic tasks of standard/generic software including:
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
Database-Driven Web Sites, Second Edition1 Chapter 8 Processing ASP.NET Web Forms and Working With Server Controls.
Languages in WEB Presented by: Jenisha Kshatriya BCM SS09.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
WORKING WITH XSLT AND XPATH
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.
HTML presentation Embedding Graphics in Web Pages n HTML uses an empty tag called the (image tag) n n n or n n n Note: all web production tools do insert.
IST 210: PHP BASICS IST 210: Organization of Data IST210 1.
USING PERL FOR CGI PROGRAMMING
nd Joint Workshop between Security Research Labs in JAPAN and KOREA Profile-based Web Application Security System Kyungtae Kim High Performance.
Tutorial 10 Programming with JavaScript
Creating Dynamic Web Pages Using PHP and MySQL CS 320.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
Slide 12.1 Chapter 12 Implementation. Slide 12.2 Learning outcomes Produce a plan to minimize the risks involved with the launch phase of an e-business.
Intro to PHP IST2101. Review: HTML & Tags 2IST210.
Introduction to PHP Advanced Database System Lab no.1.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
HTML Concepts and Techniques Fourth Edition Project 5 Creating an Image Map.
Overview of Bioinformatics 1 Module Denis Manley..
INTRODUCTION TO CSS. TOPICS TO BE DISCUSSED……….  Introduction Introduction  Features of CSS Features of CSS  Creating Style Sheet Creating Style Sheet.
JavaScript Defined JavaScript Basics Definitions JavaScript is an object-oriented programming language designed for the world wide web. JavaScript code.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
ASP (Active Server Pages) by Bülent & Resul. Presentation Outline Introduction What is an ASP file? How does ASP work? What can ASP do? Differences Between.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Internet & World Wide Web How to Program, 5/e © by Pearson Education, Inc. All Rights Reserved.
Cross Language Clone Analysis Team 2 February 3, 2011.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
JavaScript 101 Introduction to Programming. Topics What is programming? The common elements found in most programming languages Introduction to JavaScript.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
IST 210: PHP Basics IST 210: Organization of Data IST2101.
Review session for Web development. Today’s class Review the web designing. Filling out instructor evaluation form.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Automatic Web Security Unit Testing: XSS Vulnerability Detection Mahmoud Mohammadi, Bill Chu, Heather Richter, Emerson Murphy-Hill Presenter:
Julián ALARTE DAVID INSA JOSEP SILVA
Intro to PHP & Variables
Web Data Extraction Based on Partial Tree Alignment
Lesson 1 The Web.
Bioinformatics Algorithms and Data Structures
JavaScript CS 4640 Programming Languages for Web Applications
An Introduction to JavaScript
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Presentation transcript:

“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp Today presented by Kenny Kwok

Why need to do that? Web pages are loosely organized Usually coded in incremental way Reuse code of existing pages to write new pages (copy & paste) Lack of inline documentation usually

Why need to do that? With techniques to identify duplicated web pages: Feasible to carry out testing Web pages maintenance more efficient Possible to detect possible plagiarism Duplicated code => clones Two or more pages are considered as clones if, They have the same, or a very similar, structure, or They are characterized by the same values of the defined metrics

Types of Web Pages Server Pages Pages stored in the web server May contain server-side scripts Client Pages Static pages Saved in file with permanent content Dynamic pages Built by server at run time That paper only covered static pages and server-side scripts Since the result on server-side scripts is not conclusive, we discuss the former type only.

How to detect duplicated Web Pages? Two proposed approaches: Levenshtein distance (Edit distance) Occurrence frequency

Levenshtein distance A.k.a. Edit distance The minimal transformation distance between two strings Requires O(n 2 ) computation time where n is the size of the longer string For example, the strings u, v are –ABCDEFG –A DE G The Levenshtein distance between the strings u, v is: D(u, v) = 3

Levenshtein distance of Web Pages Alphabet Symbols: HTML tags (/div, /td, td, img, div, …, etc.) Extract those tags and replace with alphabet. (e.g. /div -> a, /td -> b, …) Translate the web page into “HTML-string” that compose of those symbols Levenshtien distance of pages is then the distance of their corresponding HTML-strings

Leveshtein distance (example) With the following HTML alphabet table: HTML-string u = hifgieb HTML-string v = hidcfgieab

Leveshtein distance (example) The optimal alignment of u and v is: The Levenshtein distance D(u, v) = 3 They are considered as duplicated pages (similar pages) if their distance is small But the paper has not quantitatively defined what is mean by “small”.

Problems and possible improvements May detect misleading similarities Due to sequence of HTML attributes False positive, different page has small distance value Suggestion: Substitute each composite tag in alphabet A with its equivalent tag in new set of alphabet A’ –But the paper does not mention any further about the A’ alphabet set

Problems and possible improvements May not detect meaning similarities Due to different tag with similar nature e.g. formatting tag (H1, H2, H3) Suggestion: Define alphabet of formatting tags in A’’. Eliminate the HTML-string symbols that contains alphabet A’’. –Again, the paper does not mention any further about the A’’ alphabet set

Occurrence frequency Make use of HTML-array Compare the Euclidean distance of their HTML- array ED(u, v) = Much faster in computation Make identify all clones in previous method More likely to detect false positive clones The paper, again, does not describe the criteria of clone and the value of ED. Not clue of how “small” it should be

Experiment Result Levensthein: –Accurate –Slow Frequency measure: –Introduce false positive –Much faster Suggestions: –Frequency measure method to extract candidates, use Levensthein distance to verify the result

Conclusion Two web page clones detection method are proposes and evaluated Each has its strength and weaknesses but possible to combine into refinement process Clone detection techniques is useful in: Identify a case of plagiarism Highlight reuse of pattern of HTML tags Facilitates Web maintenance Facilitates testing process of web applications

Final Note It has not mentioned the translation alphabet table and how to obtain it correctly The paper does not mention the distance similarity criteria for the experiment The experiment does not cover the detection of plagiarism although it may be possible

Q&A Thank You