Context-Based Metrics For Evaluating Changes to Web Pages Thesis Defense By Suvendu Kumar Dash Texas A&M University.

Slides:



Advertisements
Similar presentations
WeB application development
Advertisements

AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
6/2/ An Automatic Personalized Context- Aware Event Notification System for Mobile Users George Lee User Context-based Service Control Group Network.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Software Quality Metrics
Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla,
Exploring Windows 98 and Essential Computing Concepts - Chapter 1 1 Exploring the Internet Chapter 1 Welcome to Cyberspace: The Internet and World Wide.
Interfaces for Selecting and Understanding Collections.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.
Managing Change in Distributed Collections Frank M. Shipman III Luis Francisco-Revilla Richard Furuta Center for the Study of Digital Libraries Texas A&M.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Managing Change on the Web Luis Francisco-Revilla Frank M. Shipman Richard Furuta Unmil Karadkar Avital Arora Center for the Study of Digital Libraries.
Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
The Walden's Paths Virtual Directories Unmil P. Karadkar, Luis Francisco-Revilla, Richard Furuta, Frank M. Shipman III Texas A&M University Structuring.
Including images in Web pages Skills: use the tag IT concepts: none This work is licensed under a Creative Commons Attribution-Noncommercial- Share Alike.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Internet. Internet is Is a Global network Computers connected together all over that world. Grew out of American military.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Disciplined Software Engineering Lecture #4 Software Engineering Institute Carnegie Mellon University Pittsburgh, PA Sponsored by the U.S. Department.
Exploring Microsoft Office Welcome to Cyberspace: The Internet and World Wide Web1 Exploring the Internet Chapter 1 Welcome to Cyberspace: The Internet.
Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Just Enough HTML How to Create Basic HTML Documents.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Personalized Web Search by Mapping User Queries to Categories Fang Liu Presented by Jing Zhang CS491CXZ February 26, 2004.
Product Evaluation & Quality Improvement. Overview Objectives Background Materials Procedure Report Closing.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Perception of Content, Structure, and Presentation Changes in Web-based Hypertext Luis Francisco-Revilla Frank M. Shipman III Richard Furuta Unmil Karadkar.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
CPSC 203 Introduction to Computers Lab 33 By Jie Gao.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 1 1 Disciplined Software Engineering Lecture #4 Software Engineering.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Directions for Hypertext Research: Exploring the Design Space for Interactive Scholarly Communication John J. Leggett & Frank M. Shipman Department of.
Chapter 3 MATLAB Fundamentals Introduction to MATLAB Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Company Confidential OVERVIEW. Application and activity control software for networked computer classrooms …. designed to be used by teachers in order.
Web- and Multimedia-based Information Systems Lecture 2.
Sequencing The most simple type of program uses sequencing, a set of instructions carried out one after another. Start End Display “Computer” Display “Science”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Hierarchical Segmentation: Finding Changes in a Text Signal Malcolm Slaney and Dulce Ponceleon IBM Almaden Research Center.
A wiki is a collaborative web application which allows people to add and edit content using a browser… …it creates communities and empowers users as they.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Program design and algorithm development We will consider the design of your own toolbox to be included among the toolboxes already available with your.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
OARE Module 5A: Scopus (Elsevier)
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval and Web Search
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Internet.
Lecture 5 – Improved Monte Carlo methods in finance: lab
إستراتيجيات ونماذج التقويم
Author: Kazunari Sugiyama, etc. (WWW2004)
Section 3.1 – Vectors in Component Form
Mashup Service Recommendation based on User Interest and Service Network Buqing Cao ICWS2013, IJWSR.
WSExpress: A QoS-Aware Search Engine for Web Services
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Context-Based Metrics For Evaluating Changes to Web Pages Thesis Defense By Suvendu Kumar Dash Texas A&M University

Overview The Walden’s Paths Project The Path Manager Context-Based Module Results Future Work Questions

The Walden’s Paths Project Goals: Walden’s Paths is an application to be used by K-12 educators to organize World-Wide Web material for their students use. Students using the path get a cohesive view of the material, and may browse off the path freely, with the assurance that they can return to the path with ease.

Goals: Path Manager is the module in the Walden’s Paths project that manages the ever-changing web pages. Previous Work: Detect the Content-Based Change, Presentation Change, and the Structural Changes in the web pages. Path Manager

Context-Based Module Goal- To give meaning to the changes with respect to the path not the previous version of the page. Input- The output of the parser in Path Manager that gives the terms present in the document Output- Context-Based as well as the Content-Based Metrics of Change

Creation of Context-Based and Content-Based Metrics of Change Parser Signature File Contextual Analyzer Context-Based Metrics And Content-Based Metrics Path File

Implementation Steps: Find the Term Vector of the individual web pages in the path. This is done by putting all the words present in the document (except the stop words like a, and, the, etc…) in a vector. The Page Term Vectors for all pages are saved in the Signature File of the Path. A Path Vector is computed using a composition of the Term Vectors for all the web pages in a particular path except the page whose change is being evaluated. Calculate the Cosine Similarity angle between the Path Vector and Page Term Vector. Then compare this angle to that for the previous version of the page. The difference between these two angles is used to compute the degree of change to the web page. The algorithm was tested with existing paths to determine change values that convey the different degrees of change for web pages.

Testing Steps:  Pages were collected from Yahoo! Directories from a particular Category and paths were built on those pages.  One of the page was changed to a page talking about Elephants and the Context-Based Metrics were evaluated.  The Same page was changed to a CNN Financials page and the Context- Based Metrics were evaluated.  Then the same page was changed to a similar page (within the same context) and the Context-Based Metrics were evaluated.

Results (From 20 Collections/Paths) The whole page changed to a page talking about Elephants The whole page changed to a page on CNN Financials The whole page changed to a Similar Page Angle of the changed page to the original page (in degrees) Average Ranges30.77 to to to Standard Deviation Proportional Algorithm Degrees of change (High Level, Medium and Lowest) High Level Angle of the page to the path (in degrees) Average Ranges to to to 14.3 Standard Deviation

Results (Path About Movies) The whole page changed to a page talking about Elephants The whole page changed to a page on CNN Financials The whole page changed to a page of NY Times Movies (Similar Page) Angle of the changed page to the original page (in degrees) Proportional Algorithm High Level Degrees of change (High Level, Medium and Lowest)

Results (Path About Search Engines) The whole page changed to a page talking about Elephants The whole page changed to a page on CNN Financials The whole page changed to a page talking about Internet (Similar Page) Angle of the changed page to the original page (in degrees) Proportional Algorithm High Level Degrees of change (High Level, Medium and Lowest)

Results (Path About Texas History ) The whole page changed to a page about Mexican History The whole page changed to a page talking about Elephants The whole page changed to a page on CNN Financials The whole page changed to a page about Texas History (Similar Page) Angle of the changed page to the original page (in degrees) Proportional Algorithm High Level High Level Degrees of change (High Level, Medium and Lowest)

Results (Path About Indian History ) The whole page changed to a page talking about Giraffes The whole page changed to a page talking about Elephants The whole page changed to a page on CNN Financials The whole page changed to a page about Indian History (Similar Page) Angle of the changed page to the original page (in degrees) Proportional Algorithm High Level High Level Degrees of change (High Level, Medium and Lowest)

Future Work Give more weight to the headings, bold text etc. For this to work, the Parser in the Path Manager needs to be modified so that it can get this information.

Results: Path about Elephants (with Headings given more weight) Page with one paragraph changed The whole page changed to a page talking about Giraffe The whole page changed to a page on CNN Financials The whole page changed to a page talking about Elephants (Similar Page) Angle of the changed page to the original page (in degrees) Angle of the changed page to the original page (in degrees) with more weights given to headings Proportional Algorithm Degrees of change (High Level, Medium and Lowest) Medium Level (Green)High Level Angle of the page to the path (in degrees) Angle of the page to the path (in degrees) with more weights given to headings

Questions?