Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.

Slides:



Advertisements
Similar presentations
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Aki Hecht Seminar in Databases (236826) January 2009
Content-based Image Retrieval CE 264 Xiaoguang Feng March 14, 2002 Based on: J. Huang. Color-Spatial Image Indexing and Applications. Ph.D thesis, Cornell.
An efficient and effective region-based image retrieval framework Reporter: Francis 2005/5/12.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Tutorial 8 Sharing, Integrating and Analyzing Data
EVALUATING THE TEMPORAL COHERENCE OF ARCHIVED PAGES SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27 – MAY 1, 2015 STANFORD.
Content-Based Image Retrieval using the EMD algorithm Igal Ioffe George Leifman Supervisor: Doron Shaked Winter-Spring 2000 Technion - Israel Institute.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
INTRODUCTION TO CLIENT-SIDE WEB PROGRAMMING ACM 511 ACM 262 Course Notes.
© 2011 Delmar, Cengage Learning Chapter 11 Adding Media and Interactivity with Flash and Spry.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Compression is the reduction in size of data in order to save space or transmission time. And its used just about everywhere. All the images you get on.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Chapter 4 Adding Images. Chapter 4 Lessons Introduction 1.Insert and align images 2.Enhance an image and use alternate text 3.Insert a background image.
Multimedia Databases (MMDB)
Archival HTTP Redirection Retrieval Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum, Michael L. Nelson Old Dominion University.
Persistent Annotations Deserve New URIs Abdulla Alasaadi Old Dominion University Michael L. Nelson JCDL 2011 Ottawa,
Podcasting Randy Graff, PhD Symposium on Teaching and Learning with Technology ELearning in Sakai.
Presented by Tienwei Tsai July, 2005
1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa.
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.
Computing Theory: HTML Year 11. Lesson Objective You will: o Be able to define what HTML is - ALL o Be able to write HTML code to create your own web.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
Dreamweaver Edulaunch Project 1 EQ: What are the key concepts when building the first page of a web site?
Visualizing Digital Collections at Archive-It Michele C. Weigle, Michael L. Nelson Web Sciences and Digital Libraries (WS-DL) Lab Department of Computer.
Tutorial 7 Working with Multimedia
Tutorial 7 Designing a Multimedia Web Site
Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International.
Serif Part 2 Adding text popups to pictures Adding a rollover image Adding a popup rollover Adding a Voki.
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
TOPIC II Dynamic HTML Prepared by: Nimcan Cabd Cali.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Computer Vision Group Department of Computer Science University of Illinois at Urbana-Champaign.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Continue making your website on Adobe Dreamweaver, adding colour, images and text. Lesson 8 Aim : Q: What's the difference between Batman and a thief?
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Naifan Zhuang, Jun Ye, Kien A. Hua
Ford Foundation International Fellowships Program - Philippines
Who and What Links to the Internet Archive
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Adapting Files On A “Visual Webpage” (Ver /mb)
Julián ALARTE DAVID INSA JOSEP SILVA
Efficient Image Classification on Vertically Decomposed Data
Content-based Image Retrieval
Web Data Extraction Based on Partial Tree Alignment
Lazy Preservation, Warrick, and the Web Infrastructure
K-means and Hierarchical Clustering
Efficient Image Classification on Vertically Decomposed Data
Flight prices.
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
Web creation: File Structure Page Title Page Description
Web Programming Assignment 4 - Extra Credit
Introduction to Digital Libraries Assignment #2
Unit 13: Website Development
Introduction to Digital Libraries Assignment #2
Presentation transcript:

Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson Old Dominion University Norfolk VA, USA The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * Ahmed AlSum did this work while he was PhD student at Old Dominion University ECIR 2014 Amsterdam, Netherlands

What is a Web Archive? 2 ECIR 2014 Amsterdam, Netherlands

Thumbnails in Web Archive Internet ArchiveUK Web Archive 3 ECIR 2014 Amsterdam, Netherlands

Memento Terminology URI-R, R URI-M, M URI-T, TM Original Resource Memento TimeMap 4 ECIR 2014 Amsterdam, Netherlands

Thumbnails Creation Challenges Scalability in Time IA may need 361 years to create thumbnail for each memento using one hundred machines. Scalability in Space IA will need 355 TB to store 1 thumbnail per each memento. Page quality 5 ECIR 2014 Amsterdam, Netherlands

Thumbnails Usage Challenges 6 This is partial view of the first 700 thumbnails out of 10,500 available mementos for ECIR 2014 Amsterdam, Netherlands

From 10,500 Mementos to 69 Thumbnails. 7 ECIR 2014 Amsterdam, Netherlands

How many thumbnails do we need? on the live Web 8 ECIR 2014 Amsterdam, Netherlands

How many thumbnails do we need? on the live Web 9 ECIR 2014 Amsterdam, Netherlands

40 Thumbnails are good. 10 ECIR 2014 Amsterdam, Netherlands

METHODOLOGY 11 ECIR 2014 Amsterdam, Netherlands

Visual Similarity and Text Similarity Similar Different HTML Text 12 ECIR 2014 Amsterdam, Netherlands

Correlation between Visual Similarity and Text Similarity Text Similarity SimHash DOM Tree Embedded resources Memento Datetime (Capture time) Visual Similarity 13 ECIR 2014 Amsterdam, Netherlands

Text Similarity SimHash Computes 64-bit SimHash fingerprints with k = 4 for two pages Full HTML text ✔ The main content from the web page All the text Templates including the text The template excluding the text Calculate the differences using Hamming Distance 14 ECIR 2014 Amsterdam, Netherlands

Text Similarity DOM Tree Transfer each webpage to DOM tree Calculate the difference using Levenshtein Distance Levenshtein distance: is the number of operations to insert, update, and delete. 15 ECIR 2014 Amsterdam, Netherlands

Text Similarity Embedded resources Extract the embedded resources for each page Calculate the total number of new resources that have been added and the resources that have been removed. For example, the difference between M 1 and M 2 : Addition of 5 resources (2 javascript files and 3 images) Removal of 2 resources (1 javascript file and 1 image). 16 ECIR 2014 Amsterdam, Netherlands

Text Similarity Memento datetime Calculate the difference between the record capture time for both pages in seconds. 17 ECIR 2014 Amsterdam, Netherlands

Visual Similarity Measurement: the number of different pixels between two thumbnails To compare two thumbnails, Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600. Calculate the Manhattan distance and Zero distance between each pair 18 ECIR 2014 Amsterdam, Netherlands

Correlation between Visual Similarity and Text Similarity SimHashDOM tree Embedded resourcesMemento Datetime 19 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands

SELECTION ALGORITHMS 20 ECIR 2014 Amsterdam, Netherlands

Threshold Grouping 21 ECIR 2014 Amsterdam, Netherlands

Threshold Grouping 22 ECIR 2014 Amsterdam, Netherlands

Clustering technique Input: TimeMap with n mementos A set of features. For example, F = {SimHash, Memento-Datetime} Task: Cluster n mementos in K clusters. 23 ECIR 2014 Amsterdam, Netherlands

Clustering technique SimHash Feature SimHash and Datetime Features 24 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands

Time Normalization 25 ECIR 2014 Amsterdam, Netherlands

Selection Algorithms Comparison Threshold GroupingK clusteringTime Normalization TimeMap Reduction27%9% to 12%23% Image Loss # Features1 feature1 or more1 feature Preprocessing requiredYes No Efficient processingMediumExtensiveLight IncrementalYesNoYes Online/offlineBoth 26 ECIR 2014 Amsterdam, Netherlands

Generalization outside the Web Archive Get k thumbnails from website that has n pages 27 ECIR 2014 Amsterdam, Netherlands

Conclusions We explored the similarity between the text and visual appearance of the web page. We found that SimHash and Levenshtein distance have the highest correlation We presented three algorithms to select k thumbnails from n mementos per TimeMap. ECIR 2014 Amsterdam, Netherlands