Synchronicity Real Time Recovery of Missing Web Pages Martin Klein Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Improved TF-IDF Ranker
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Active Learning and Collaborative Filtering
A Quality Focused Crawler for Health Information Tim Tang.
Evaluating Search Engine
Information Retrieval in Practice
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Synchronicity: Just-In-Time Discovery of Lost Web Pages NDIIPP Partners Meeting June 25, 2009 Martin Klein & Michael L. Nelson Department of Computer Science.
Tag-based Social Interest Discovery
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
-- Martin Klein & Michael L. Nelson Old Dominion University.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Post-Ranking query suggestion by diversifying search Chao Wang.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 FollowMyLink Individual APT Presentation First Talk February 2006.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Can’t Find Your 404s? Santa Fe Complex March 13, 2009 Martin Klein, Frank McCown, Joan Smith, Michael L. Nelson Department of Computer Science Old Dominion.
Information Retrieval in Practice
Agreeing to Disagree: Search Engines and Their Public Interfaces
Information Retrieval
Just-In-Time Recovery of Missing Web Pages
Correlation of Term Count and Document Frequency for Google N-Grams
Characterization of Search Engine Caches
Correlation of Term Count and Document Frequency for Google N-Grams
Presentation transcript:

Synchronicity Real Time Recovery of Missing Web Pages Martin Klein Introduction to Digital Libraries Week 14 CS 751 Spring /12/2011

2 Who are you again? Ph.D. student w/ MLN since 2005 Diagnostic exam in 2006, dissertation proposal in publications to date Outstanding RA award CS dept CoS dissertation fellowship 3 ACM SIGWEB + 2 misc travel grants CS595 (S10) & CS518 (F10)

3 The Problem

4 The Problem Web users experience 404 errors expected lifetime of a web page is 44 days [Kahle97] 2% of web disappears every week [Fetterly03] Are they really gone? Or just relocated? has anybody crawled and indexed it? do Google, Yahoo!, Bing or the IA have a copy of that page? Information retrieval techniques needed to (re-)discover content

Web Infrastructure (WI) [McCown07] Web search engines (Google, Yahoo!, Bing) and their caches Web archives (Internet Archive) Research projects (CiteSeer) 5 The Environment

Digital preservation happens in the WI 6 Refreshing and Migration in the WI Google Scholar CiteSeerX Internet Archive

1 same URI maps to same or very similar content at a later time 2 same URI maps to different content at a later time 3 different URI maps to same or very similar content at the same or at a later time 4 the content can not be found at any URI 7 URI – Content Mapping Problem U1 C1 U1 C1 timeAB U1 C2 U1 C1 timeAB U2 C1 U1 C1 U1 404 timeAB U1 ??? U1 C1 timeAB

Content Similarity 8 JCDL July Today

Content Similarity 9 Hypertext August Today

Content Similarity 10 PSP August Today

Content Similarity 11 ECDL October Today

Content Similarity 12 Greynet Today ??

LS Removal Hit Rate Proxy Cache Google Yahoo Bing First introduced by Phelps and Wilensky [Phelps00] Small set of terms capturing “aboutness” of a document, “lightweight” metadata 13 Lexical Signatures (LSs) Resource Abstract

Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88] Term frequency (TF): – “How often does this word appear in this document?” Inverse document frequency (IDF): – “In how many documents does this word appear?” 14 Generation of Lexical Signatures

“Robust Hyperlink” 5 terms are suitable Append LS to URL texttiling+wilensky+disambiguation+subtopic+iago Limitations: 1.Applications (browsers) need to be modified to exploit LSs 2.LSs need to be computed a priori 3.Works well with most URLs but not with all of them 15 LS as Proposed by Phelps and Wilensky

Park et al. [Park03] investigated performance of various LS generation algorithms Evaluated “tunability” of TF and IDF component Weight on TF increases recall (completeness) Weight on IDF improves precision (exactness) 16 Generation of Lexical Signatures

Rank/ResultsURLLS 1/243http://endeavour.cs.berkeley.edu/endeavour achieve inter-endeavour amplifies Search 1/1,930http:// libraries conference cyberinfrastructure jcdl Search 1/25,900http:// knowledge webcasts kluge library Search 17 Lexical Signatures -- Examples

18 Synchronicity 404 error occurs while browsing look for same or older page in WI (1) if user satisfied return page  (2) else  generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page  (4) else  get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs... return pages  (6) The system may not return any results at all 

19 Synchro…What? Synchronicity Experience of causally unrelated events occurring together in a meaningful manner Events reveal underlying pattern, framework bigger than any of the synchronous systems Carl Gustav Jung ( ) “meaningful coincidence” Deschamps – de Fontgibu plum pudding example picture from

Errors

Errors

22 “Soft 404” Errors

23 “Soft 404” Errors

A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web (WIDM 2008)

LSs are usually generated following the TF-IDF scheme TF rather trivial to compute IDF requires knowledge about: overall size of the corpus (# of documents) # of documents a term occurs in Not complicated to compute for bounded corpora (such as TREC) If the web is the corpus, values can only be estimated The Problem 25

Use IDF values obtained from 1.Local collection of web pages 2.``screen scraping‘‘ SE result pages Validate both methods through comparison to baseline Use Google N-Grams as baseline Note: N-Grams provide term count (TC) and not DF values – details to come The Idea 26

27 Accurate IDF Values for LSs Screen scraping the Google web interface

28 The Dataset Local universe consisting of copies of URLs from the IA between 1996 and 2007

Same as above, follows Zipf distribution 10,493 observations 254,384 total terms 16,791 unique terms The Dataset 29

Total terms vs new terms The Dataset 30

Based on all 3 methods URL: Year: 2007 Union: 12 unique terms LSs Example 31

1.Normalized term overlap Assume term commutativity k-term LSs normalized by k 2.Kendall Tau Modified version since LSs to compare may contain different terms 3.M-Score Penalizes discordance in higher ranks Comparing LSs 32

Top 5, 10 and 15 terms LC – local universe SC – screen scraping NG – N-Grams Comparing LSs 33

Both methods for the computation of IDF values provide accurate results compared to the Google N-Gram baseline Screen scraping method seems preferable since similaity scores slightly higher feasible in real time Conclusions 34

Correlation of Term Count and Document Frequency for Google N-Grams (ECIR 2009)

Need of a reliable source to accurately compute IDF values of web pages (in real time) Shown, screen scraping works but missing validation of baseline (Google N- Grams) N-Grams seem suitable (recently created, based on web pages) but provide TC and not DF  what is their relationship? The Problem 36

37 Background & Motivation Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept Used (among others) to generate lexical signatures (LSs) TF is not hard to compute, IDF is since it depends on global knowledge about the corpus  When the entire web is the corpus IDF can only be estimated! Most text corpora provide term count values (TC) D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” D3 = “All You Need Is Love” D4 = “Long, Long, Long” TC >= DF but is there a correlation? Can we use TC to estimate DF? TermAllBuyCan’tIsLoveMeNeedPleaseYouLong TC DF

Investigate relationship between: TC and DF within the Web as Corpus (WaC) WaC based TC and Google N-Gram based TC TREC, BNC could be used but: they are not free TREC has been shown to be somewhat dated [Chiang05 ] The Idea 38

Analyze correlation of list of terms ordered by their TC and DF rank by computing: Spearman‘s Rho Kendall Tau Display frequency of TC/DF ratio for all terms Compare TC (WaC) and TC (N-Grams) frequencies The Experiment 39

40 Experiment Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Rank similarity of all terms

41 Experiment Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Spearman’s ρ and Kendall τ

42 Experiment Results Google: screen scraping DF values from the Google web interface Top 10 terms in decreasing order of their TF/IDF values taken from U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages!

Integer ValuesTwo DecimalsOne Decimal Frequency of TC/DF Ratio Within the WaC Experiment Results 43

44 Experiment Results Show similarity between WaC based TC and Google N-Gram based TC TC frequencies N-Grams have a threshold of 200

TC and DF Ranks within the WaC show strong correlation TC frequencies of WaC and Google N-Grams are very similiar Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages  Does not mean everything correlated to TC can be used as DF substitude! Conclusions 45

Inter-Search Engine Lexical Signature Performance (JCDL 2009)

Inter-Search Engine Lexical Signature Performance Martin KleinMichael L. Nelson Elephant Tusks Trunk African Loxodonta Elephant, Asian, African Species, Trunk Elephant, African, Tusks Asian, Trunk

48

Revisiting Lexical Signatures to (Re-)Discover Web Pages (ECDL 2008)

50 How to Evaluate the Evolution of LSs over Time Idea: Conduct overlap analysis of LSs generated over time LSs based on local universe mentioned above Neither Phelps and Wilensky nor Park et al. did that Park et al. just re-confirmed their findings after 6 month

51 Dataset Local universe consisting of copies of URLs from the IA between 1996 and 2007

10-term LSs generated for LSs Over Time - Example 52

53 LS Overlap Analysis Rooted: overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observed Sliding: overlap between two LSs of consecutive years starting with the first year and ending with the last

54 Evolution of LSs over Time Results: Little overlap between the early years and more recent ones Highest overlap in the first 1-2 years after creation of the LS Rarely peaks after that – once terms are gone do not return Rooted

55 Evolution of LSs over Time Results: Overlap increases over time Seem to reach steady state around 2003 Sliding

56 Performance of LSs Idea: Query Google search API with LSs LSs based on local universe mentioned above Identify URL in result set For each URL it is possible that: 1.URL is returned as the top ranked result 2.URL is ranked somewhere between 2 and 10 3.URL is ranked somewhere between 11 and URL is ranked somewhere beyond rank 100  considered as not returned

57 Performance of LSs wrt Number of Terms Results: 2-, 3- and 4-term LSs perform poorly 5-, 6- and 7-term LSs seem best Top mean rank (MR) value with 5 terms Most top ranked with 7 terms Binary pattern: either in top 10 or undiscovered 8 terms and beyond do not show improvement

58 Performance - Number of Terms Lightest gray = rank 1 Black = rank 101 and beyond Ranks 11-20, ,… colored proportionally 50% top ranked, 20% in top 10, 30% black Rank distribution of 5 term LSs Performance of LSs wrt Number of Terms

59 Performance of LSs Scoring: normalized Discounted Cumulative Gain (nDCG) Binary relevance: 1 for match, 0 otherwise

60 nDCG for LSs consisting of 2-15 terms (mean over all years) Performance of LSs wrt Number of Terms

61 Performance of LSs over Time Score for LSs consisting of 2, 5, 7 and 10 terms

LSs decay over time Rooted: quickly after generation Sliding: seem to stabilize 5-, 6- and 7-term LSs seem to perform best 7 – most top ranked 5 – fewest undiscovered 5 – lowest mean rank 2..4 as well as 8+ terms insufficient Conclusions 62

Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure (JCDL 2010)

64 The Problem Internet Archive - Wayback Machine international.comp:// international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 59 copies The Problem

65 The Problem 65 Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry The Problem

66 The Problem Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International The Problem

67 The Problem If no archived/cached copy can be found... Tags C? B A Link Neighborhood (LNLS) The Problem

68 The Problem

69 Contributions Compare performance of four automated methods to rediscover web pages 1. Lexical signatures (LSs)3. Tags 2. Titles4. LNLS Analysis of title characteristics wrt their retrieval performance Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery Contributions

70 Experiment - Data Gathering 500 URIs randomly sampled from DMOZ Applied filters –.com,.org,.net,.edu domains – English Language – min. of 50 terms [Park] Results in 309 URIs to download and parse Data Gathering

71 Experiment - Data Gathering Extract title –... Generate 3 LSs per page – IDF values obtained from Google, Yahoo!, MSN Live Obtain tags from delicious.com API (only 15%) Obtain link neighborhood from Yahoo! API (max. 50 URIs) – Generate LNLS – TF from “bucket” of words per neighborhood – IDF obtained from Yahoo! API Data Gathering

72 LS Retrieval Performance 5- and 7-Term LSs Yahoo! returns most URIs top ranked and leaves least undiscovered Binary retrieval pattern, URI either within top 10 or undiscovered LS Retrieval Performance

73 Title Retrieval Performance Non-Quoted and Quoted Titles Results at least as good as for LSs Google and Yahoo! return more URIs for non-quoted titles Same binary retrieval pattern Title Retrieval Performance

74 Tags Retrieval Performance API returns up to top10 tags - distinguish between # of tags queried Low # of URIs More later… Tags Retrieval Performance

75 LNLS Retrieval Performance 5- and 7-term LNLSs < 5% top ranked More later… LNLS Retrieval Performance

76 Query LNLS Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Done Done Done Query Tags Query Title Query LS Combination of Methods

77 Combination of Methods TopTop10Undis LS LS TI TA TopTop10Undis LS LS TI TA TopTop10Undis LS LS TI TA Google Yahoo! MSN Live Combination of Methods

78 Combination of Methods GoogleYahoo!MSN Live LS5-TI LS7-TI TI-LS TI-LS LS5-TI-LS LS7-TI-LS TI-LS5-LS TI-LS7-LS LS5-LS LS7-LS Top Results for Combination of Methods Combination of Methods

79 Length varies between 1 and 43 terms Length between 3 and 6 terms occurs most frequently and performs well [Ntoulas] Title Characteristics Length in # of Terms Title Characteristics

80 Length varies between 4 and 294 characters Short titles (<10) do not perform well Length between 10 and 70 most common Length between 10 and 45 seem to perform best Title Characteristics Length in # of Characters Title Characteristics

81 Title terms with a mean of 5,6,7 characters seem most suitable for well performing terms More than 1 or 2 stop words hurts performance Title Characteristics Mean # of Characters, # of Stop Words Title Characteristics

82 Concluding Remarks Lexical signatures, as much as titles, are very suitable as search engine queries to rediscover missing web pages. They return 50-70% URIs top ranked. Tags and link neighborhood LSs do not seem to significantly contribute to the retrieval of the web pages. Titles are much cheaper to obtain than LSs. The combination of primarily querying titles and 5-term LSs as a second option returns more than 75% URIs top ranked. Not all titles are equally good. Titles containing between 3 and 6 terms seem to perform best. More than a couple of stop words hurt the performance. Conclusions

Is This a Good Title? (Hypertext 2010)

84 The Problem Professional Scholarly Publishing The Problem

85 The Problem Internet Archive - Wayback Machine international.comp:// international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 59 copies The Problem

86 The Problem 86 Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry The Problem

87 The Problem Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International The Problem

88 The Problem Lexical Signature (TF/IDF) Plastic Surgeon Reconstructive Dr Bartell Symbol University ??? The Problem

89 The Problem Title Thomas Bartell MD Board- Certified - Cosmetic Plastic Reconstructive Surgery The Problem

90 The Problem 90 Lexical Signature (TF/IDF) Ronald USS MCSN Torrey Naval Sea Commanding The Problem

91 The Problem Title Home Page ??? Is This a Good Title? The Problem

92 Contributions Discuss discovery performance of web pages titles (compared to LSs) Analysis of discovered pages regarding their relevancy Display title evolution compared to content evolution over time Provide prediction model for title’s retrieval potential Contributions

93 Experiment - Data Gathering 20k URIs randomly sampled from DMOZ Applied filters – English language – min. of 50 terms Results in 6,875 URIs Downloaded and parsed the pages Extract title and generate LS per page (baseline).com.org.net.edusum Original Filtered Data Gathering

94 Title (and LS) Retrieval Performance Titles5- and 7-Term LSs Titles return more than 60% URIs top ranked Binary retrieval pattern, URI either within top 10 or undiscovered Title and LS Retrieval Performance

95 ??? Relevancy of Retrieval Results Distinguish between discovered (top 10) and undiscovered URIs Analyze content of top 10 results Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank Do titles return relevant results besides the original URI? Relevancy of Retrieval Results

96 Relevancy of Retrieval Results Term Overlap DiscoveredUndiscovered High relevancy in the top ranks with possible aliases and duplicates. Relevancy of Retrieval Results

97 Relevancy of Retrieval Results Shingles DiscoveredUndiscovered More optimal shingles values than top ranked URIs - possible aliases and duplicates. Relevancy of Retrieval Results

Sun Software Products Selector Guides - Solutions Tree Sun Software Solutions Sun Microsystems Products Sun Microsystems - Business & Industry Solutions Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions Title Evolution - Example I Sun Microsystems – Solutions Gateway Page - Sun Solutions Sun Microsystems Solutions & Services Services & Solutions Sun Services & Solutions Sun Solutions Title Evolution – Example I

DataCity of Manassas Park Main Page DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives Title Evolution - Example II computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity toll free Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity Service Disabled Veteran Owned Business SDVOB Title Evolution – Example II

100 Copies from fixed size time windows per year Extract available titles of past 14 years Compute normalized Levenshtein edit distance between titles of copies and baseline (0 = identical; 1 = completely dissimilar) How much do titles change over time? Title Evolution Over Time

101 Title Evolution Over Time Title edit distance frequencies Half the titles of available copies from recent years are (close to) identical Decay from 2005 on (with fewer copies available) 4 year old title: 40% chance to be unchanged Title Evolution Over Time

102 Title Evolution Over Time Title vs Document Y: avg shingle value for all copies per URI X: avg edit distance of corresponding titles overlap indicated by: green: 90 Semi-transparent: total amount of points plotted [0,1] - over 1600 times [0,0] times Title Evolution Over Time

103 Title Performance Prediction Quality prediction of title by Number of nouns, articles etc. Amount of title terms, characters ([Ntoulas]) Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”! [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp Title Performance Prediction

104 Concluding Remarks The “aboutness” of web pages can be determined from either the content or from the title. More than 60% of URIs are returned top ranked when using the title as a search engine query. Titles change more slowly and less significantly over time than the web pages’ content. Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor. Conclusions

Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages (submitted for publication)

106 The Problem We have seen that we have a good chance to rediscover missing pages with Lexical signatures Titles BUT What if no archived/cached copy can be found? The Problem

107 The Problem The Solution? Conferences Digitallibraries Conference Library Jcdl2005 Search

108 The Problem What is a good length for a tag based query string? 5 or 7 tags like lexical signatures? Can we improve retrieval performance when combining tags w/ title- and/or lexical signature-based queries? Do tags contain information about a page that is not in the title/content? The Questions

109 The Problem URIs with tags rather sparse in previously created corpora Creation of new, tag centered corpus query Delicious for 5k unique URIs eventually obtain: 4,968 URIs 11 duplicates 21 URIs w/o tags The Experiment

110 The Problem The Experiment Tags queried against the Yahoo! BOSS API Same four retrieval cases introduced earlier nDCG w/ same relevance scoring Mean Average Precision

111 The Problem The Experiment JaroWinkler distance between URIs Dice similarity between contents

112 The Problem The Experiment Combining methods

113 The Problem Fact: ~50% of tags do not occur in page “Secret”: ~50% of tags do not occur in current version of page ergo: How about previous versions? The Experiment

114 The Problem 3,306 URIs w/ older copies 66.3% of our tags do not occur in page 4.9% of tags occur in previous version of page – Ghost Tags represent a previous version better than the current one But what kind of tags are these? Are they important to the document? To the Delicious user? Ghost Tags

115 The Problem Ghost Tags Document importance: TF rank User importance: Delicious rank Normalized rank: 0 - top 1 - bottom

116 Concluding Remarks Tags can be used for search! We can improve the retrieval performance by combining tags based search with titles and lexical signatures. Ghost Tags exist! One out of three important terms better describes a previous than the current version of a page. How old are Ghost Tags? When do tags “ghostify”? Wrt importance/change of page? Conclusions

Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures (JCDL 2011)

118 The Problem We have seen that we have a good chance to rediscover missing pages with Lexical signatures Titles BUT What if no archived/cached copy can be found? Plan A: Tags The Problem

119 The Problem The Solution? Plan B: Link neighborhood Lexical Signatures

120 The Problem The Questions What is a good length for a neighborhood based lexical signature? 5 or 7 terms like lexical signatures? 5..8 terms like tag-based queries? How many backlinks do we need? Is the 1 st level of backlinks sufficient? From where in the linking page should we draw the candidate terms?

121 The Problem The Radius Question Paragraph Entire page Anchor text

122 The Dataset Same as for JCDL 2010 experiment 309 URIs 28,325 first level & 306,700 second level backlinks Filter for language, file type, content length, HTTP response code, “soft 404s” => 12% discarded Lexical signature generation IDF values from Yahoo! 1..7 and 10 terms

123 The Problem The Results level-radius-rank Anchor text

124 The Problem The Results – Backlink Level level-radius-rank Anchor text ± 5 words

125 The Problem The Results – Backlink Level level-radius-rank Anchor text ± 10 words

126 The Problem The Results – Backlink Level level-radius-rank Anchor text ± 10 words

127 The Problem The Results – Radius level-radius-rank All Radii

128 The Problem The Results – Backlink Rank level-radius-rank Anchor, Ranks 10, 100, 1000

129 The Problem The Results – In Numbers 1-anchor anchor-10 WINNER 4 terms first backlink level only top 10 backlinks only anchor text only

130 Concluding Remarks Link neighborhood based lexical signatures can help rediscover missing pages. It is a feasible “Plan C” due to the high success rate of cheaper methods (titles, tags, lexical signatures). Fortunately smallest parameters perform best (anchor, 10 backlinks, 1 st level backlinks) Can we find an optimum for the number of backlinks? (10/100/1000 leaves a big margin) Can we identify “Stop Anchors” e.g. click here, acrobat, etc Conclusions