Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata.

Slides:



Advertisements
Similar presentations
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Advertisements

The Web Warrior Guide to Web Design Technologies
CPSC 203 Introduction to Computers Tutorial 59 & 64 By Jie (Jeff) Gao.
A Fuzzy Web Surfer Model Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata.
Chapter 3 Simulation Software
HTML Introduction (cont.) 10/01/ Lecture 8, MAT 279, Fall 2009.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Aki Hecht Seminar in Databases (236826) January 2009
Microsoft Access 2003 Introduction To Microsoft Access 2003.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Guide To UNIX Using Linux Third Edition
Chapter 9 Introduction to the Document Object Model (DOM) JavaScript, Third Edition.
XP New Perspectives on Microsoft Office Access 2003, Second Edition- Tutorial 1 1 Microsoft Access 2003 Tutorial 1 – Introduction To Microsoft Access 2003.
1st Project Introduction to HTML.
THE BASICS OF THE WEB Davison Web Design. Introduction to the Web Main Ideas The Internet is a worldwide network of hardware. The World Wide Web is part.
Tutorial 3: Adding and Formatting Text. 2 Objectives Session 3.1 Type text into a page Copy text from a document and paste it into a page Check for spelling.
WEB DESIGNING Prof. Jesse A. Role Ph. D TM UEAB 2010.
Lesson 46: Using Information From the Web copy and paste information from a Web site print a Web page download information from a Web site customize Web.
Lesson 46: Using Information From the Web copy and paste information from a Web site print a Web page download information from a Web site customize Web.
Chapter ONE Introduction to HTML.
Product Retrieval Statistics Canada / Statistique Canada Chuck Humphrey ACCOLEDS/DLI Training December, 2001.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
Section 2.1 Compare the Internet and the Web Identify Web browser components Compare Web sites and Web pages Describe types of Web sites Section 2.2 Identify.
Chapter 33 CGI Technology for Dynamic Web Documents There are two alternative forms of retrieving web documents. Instead of retrieving static HTML documents,
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
WORKING WITH XSLT AND XPATH
1 Web Basics Section 1.1 Compare the Internet and the Web Compare Web sites and Web pages Identify Web browser components Describe types of Web sites Section.
Dreamweaver MX Unit A CIS 205—Web Site Design & Development.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
JavaScript II ECT 270 Robin Burke. Outline JavaScript review Processing Syntax Events and event handling Form validation.
Section 4.1 Format HTML tags Identify HTML guidelines Section 4.2 Organize Web site files and folder Use a text editor Use HTML tags and attributes Create.
XP Dreamweaver 8.0 Tutorial 3 1 Adding Text and Formatting Text with CSS Styles.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Chapter 10 Fireworks: Part II The Web Warrior Guide to Web Design Technologies.
JavaScript, Fourth Edition
1 By: Nour Hilal. Microsoft Access is a database software where data is stored in one or more Tables. A Database is a group of related Tables. Access.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
In Business Series © Prentice Hall Microsoft Office Word 2007 In Business Core Chapter 3 Word Document Enhancements.
XP New Perspectives on Microsoft Access 2002 Tutorial 1 1 Microsoft Access 2002 Tutorial 1 – Introduction To Microsoft Access 2002.
XP New Perspectives on Microsoft Access 2002 Tutorial 1 1 Microsoft Access 2002 Tutorial 1 – Introduction To Microsoft Access 2002.
CSCI 3327 Visual Basic Chapter 13: Databases and LINQ UTPA – Fall 2011.
August 2005 TMCOps TMC Operator Requirements and Position Descriptions Phase 2 Interactive Tool Project Presentation.
Algorithmic Detection of Semantic Similarity WWW 2005.
Tutorial 3 Adding and Formatting Text with CSS Styles.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
1 State and Session Management HTTP is a stateless protocol – it has no memory of prior connections and cannot distinguish one request from another. The.
Transportation Agenda 165. Transportation About Pages Pages organize and present information Pages are files that end in.aspx 166.
Web Site Development - Process of planning and creating a website.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Glencoe Introduction to Web Design Chapter 4 XHTML Basics 1 Review Do you remember the vocabulary terms from this chapter? Use the following slides to.
LINKED LISTS.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Section 4.1 Section 4.2 Format HTML tags Identify HTML guidelines
Chapter 1 Introduction to HTML
Basics of Website Development
Project 1 Introduction to HTML.
Product Retrieval Statistics Canada / Statistique Canada Title page
Chapter 25 - Automated Web Search (Search Engines)
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Intro Project Introduction to HTML.
Information Retrieval and Web Design
Presentation transcript:

Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Contents Introduction Objective Significance Procedure Experiments Future directions

The Web: A Directed Graph (V, A) Vertices  Web pages V = {v 1, v 2, …, v N } Arcs  Hyperlinks A = {e ij : v j  v i } Path: p 1.p 2. ….p n with arcs from p i to p i+1 Cycle: A Path with p n = p 1

Sequences of Web Pages Paths consisting of adjacent web pages Order sensitive A surfer may follow one such sequence when browsing pages

Cycles of Web Pages versities/ versities/ versities/United_States/ versities/United_States/

What are we looking for ? A particular kind of sequences and cycles Regular Consisting of similar units Units having similar relationship Reasonably sized

Why are these Sequences and Cycles Interesting ? Individual units form a single object These were intended to be together They collectively include the complete information Despite being part of a collection, individuality is maintained

Significance of Detecting Such Sequences and Cycles Compression Merge groups of pages Fewer pages  fewer links Pre-fetching Know where the surfer wants to be next Fetch the page(s) before being requested Saves time Errors: pre-fetching wrong pages

Significance of Detecting Such Sequences and Cycles (Contd.) Fair comparison Comparison independent of how content is presented Content split into multiple pages should be treated equivalent to the same in a single page Better retrieval Retrieval independent of the presentation Output a set of pages instead of a single one as a match

Fair Comparison

Improved Retrieval Retrieve only portions of interest Instead of, whole (huge) documents Avoid rewarding more content

How to Detect Sequences and Cycles of Web Pages ? Find navigational links Find consecutive pages Define what the elements of the sequence would satisfy Identify subsequences (or units) Concatenate Check for cycles

Finding Navigational Links: Background The purpose of a link may be Navigation Reference Advertisement Links between pages on the same server are treated as navigational Have also been treated as noise

Finding Navigational Links: Our Method Avoid treating links on the same server as navigational links Appear mostly either at the top or at the bottom Navigational links are generally huddled together Fewer text and images around such links

Advantages and Limitations Simple and fast Navigational links across servers are also identified Heuristics need not always work – fall back on sophisticated methods

Units of the Sequences A  B  C is a unit if C is “related” to B in the same way as B is “related” to A “related” is defined in terms of how they are linked Relation is stored as “position” of the link Several ways of defining “position”

Combining the units into sequences D  E  F B  C  D A  B  C C  D  E A  B  C  D  E  F

Cycle detection Existing cycle detection algorithms Cycle detection in number theory Special case of cycle detection in graph theory Stack based algorithm

Improvements and Speedups Believe the “rel” information provided by the (author of the) pages Use keywords like “next” and “previous” to perceive the relationships Utilize the information of the naming convention

Experimental Results Data Toy data: python tutorial in HTML Tutorial split into several chapters and sections Several cycles Mutilated data Certain pages deleted (missing links) 100% detection in all cases

Other experiments planned Real test: unorganized web pages Difficulties: Finding navigational links Noise (advertisements, etc) Dynamically generated Will the relationships hold ?

Leads us to … Concatenate detected sequences for analysis Modify retrieval mechanism Return sets of pages as results Improve mirror/duplicate detection

Future Work Consider other relations Unifying framework ? Improve identification of navigational links