Download presentation
Presentation is loading. Please wait.
Published byEmily Mitchell Modified over 9 years ago
1
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * Ahmed AlSum did this work while he was PhD student at Old Dominion University ECIR 2014 Amsterdam, Netherlands
2
What is a Web Archive? http://www.cs.odu.edu 2 ECIR 2014 Amsterdam, Netherlands
3
Thumbnails in Web Archive Internet ArchiveUK Web Archive 3 ECIR 2014 Amsterdam, Netherlands
4
Memento Terminology URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 4 ECIR 2014 Amsterdam, Netherlands
5
Thumbnails Creation Challenges Scalability in Time IA may need 361 years to create thumbnail for each memento using one hundred machines. Scalability in Space IA will need 355 TB to store 1 thumbnail per each memento. Page quality 5 ECIR 2014 Amsterdam, Netherlands
6
Thumbnails Usage Challenges 6 This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.comwww.apple.com ECIR 2014 Amsterdam, Netherlands
7
From 10,500 Mementos to 69 Thumbnails. 7 ECIR 2014 Amsterdam, Netherlands
8
How many thumbnails do we need? www.unfi.comwww.unfi.com on the live Web 8 ECIR 2014 Amsterdam, Netherlands
9
How many thumbnails do we need? www.unfi.comwww.unfi.com on the live Web 9 ECIR 2014 Amsterdam, Netherlands
10
40 Thumbnails are good. 10 ECIR 2014 Amsterdam, Netherlands
11
METHODOLOGY 11 ECIR 2014 Amsterdam, Netherlands
12
Visual Similarity and Text Similarity Similar Different HTML Text 12 ECIR 2014 Amsterdam, Netherlands
13
Correlation between Visual Similarity and Text Similarity Text Similarity SimHash DOM Tree Embedded resources Memento Datetime (Capture time) Visual Similarity 13 ECIR 2014 Amsterdam, Netherlands
14
Text Similarity SimHash Computes 64-bit SimHash fingerprints with k = 4 for two pages Full HTML text ✔ The main content from the web page All the text Templates including the text The template excluding the text Calculate the differences using Hamming Distance 14 ECIR 2014 Amsterdam, Netherlands
15
Text Similarity DOM Tree Transfer each webpage to DOM tree Calculate the difference using Levenshtein Distance Levenshtein distance: is the number of operations to insert, update, and delete. 15 ECIR 2014 Amsterdam, Netherlands
16
Text Similarity Embedded resources Extract the embedded resources for each page Calculate the total number of new resources that have been added and the resources that have been removed. For example, the difference between M 1 and M 2 : Addition of 5 resources (2 javascript files and 3 images) Removal of 2 resources (1 javascript file and 1 image). 16 ECIR 2014 Amsterdam, Netherlands
17
Text Similarity Memento datetime Calculate the difference between the record capture time for both pages in seconds. 17 ECIR 2014 Amsterdam, Netherlands
18
Visual Similarity Measurement: the number of different pixels between two thumbnails To compare two thumbnails, Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600. Calculate the Manhattan distance and Zero distance between each pair 18 ECIR 2014 Amsterdam, Netherlands
19
Correlation between Visual Similarity and Text Similarity SimHashDOM tree Embedded resourcesMemento Datetime 19 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands
20
SELECTION ALGORITHMS 20 ECIR 2014 Amsterdam, Netherlands
21
Threshold Grouping 21 ECIR 2014 Amsterdam, Netherlands
22
Threshold Grouping 22 ECIR 2014 Amsterdam, Netherlands
23
Clustering technique Input: TimeMap with n mementos A set of features. For example, F = {SimHash, Memento-Datetime} Task: Cluster n mementos in K clusters. 23 ECIR 2014 Amsterdam, Netherlands
24
Clustering technique SimHash Feature SimHash and Datetime Features 24 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands
25
Time Normalization 25 ECIR 2014 Amsterdam, Netherlands
26
Selection Algorithms Comparison Threshold GroupingK clusteringTime Normalization TimeMap Reduction27%9% to 12%23% Image Loss2878 - 101109 # Features1 feature1 or more1 feature Preprocessing requiredYes No Efficient processingMediumExtensiveLight IncrementalYesNoYes Online/offlineBoth 26 ECIR 2014 Amsterdam, Netherlands
27
Generalization outside the Web Archive Get k thumbnails from website that has n pages 27 ECIR 2014 Amsterdam, Netherlands
28
Conclusions We explored the similarity between the text and visual appearance of the web page. We found that SimHash and Levenshtein distance have the highest correlation We presented three algorithms to select k thumbnails from n mementos per TimeMap. 28 aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.