Download presentation
Presentation is loading. Please wait.
Published byAlexina Fitzgerald Modified over 9 years ago
1
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive
2
Information Retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.”. Wikipedia, en.wikipedia.org/wiki/Information_retrievalen.wikipedia.org/wiki/Information_retrieval
3
Which are the information needs of web archive users? Web Archive Information Retrieval
4
Collecting information needs (methods) [03/02/2012 21:16:11] QUERY fcul [03/02/2012 21:16:19] CLICK RANK=1 Laboratory studies Online questionnaires Search log mining
5
Classes of web archive information needs Navigational – 53% to 81% – Seeing a web page in the past or how it evolved – How was the home page of the Library of Congress in 1996? Informational – 14% to 38% – Collecting information about a topic written in the past – How many people worked at the Library of Congress in 1996? Transactional – 5% to 16% – Downloading an old file or recovering a site from the past – Make a copy of the site of Library of Congress in 1996.
6
Which are the information resources provided by web archives? Web Archive Information Retrieval
7
Conducted 2 surveys in 2010 and 2014 Questionnaires and public information. List of Web archiving initiatives, Wikipedia
8
Data: significant amount archived by web archiving initiatives (2014) >68 initiatives across 33 countries >534 billions of web-archived files since 1996 (17 PB)
9
Access method: URL search Technology based on the Wayback Machine. URLs from the past that contain the desired information may be unknown to the users. The URLs for the same site change across time. vs. <lcweb.loc.govwww.loc.gov
10
10 Access method: Full-text search Technology based on Lucene extensions (NutchWAX & Solr). Lucene was not designed for Web Archive Information Retrieval Low relevance of search results
11
How to obtain relevant information resources through full-text search? Our challenge! Web Archive Information Retrieval
12
A Full-Text Index is like a huge Book Glossary Term………..Archived Web Pages (millions) “Library”………….,,, “Congress”……….,,, Information Retrieval Search for archived pages that contain given words Millions of web pages that contain the searched words. Web Archive Information Retrieval Archived pages are spread across time.
13
Which archived pages should be presented on the top results? How to rank the relevance of temporal search results?
14
Previous studies showed that temporal information: has been exploited to improve IR systems. can be extracted from web archives. Hypothesis: Improve WAIR systems by exploiting temporal information intrinsic to web archives through Machine Learning (Learn to Rank). Ranking relevance in WAIR
15
Methodology: combine ranking features using learn-to-rank (L2R) methods into a ranking model Ranking features f(q,d): frequency of query term q on text of document d i(d): number of inlinks to document d … Test collection Documents Topics Relevance judgments Best ranking model r(f(q,d), i(d)) = 0.6 x f(q,d) + 0.4 x i(d) L2R method L2R algorithms Including temporal features. Automatically assigned weights for each ranking feature.
16
Created Test collection through Cranfield Paradigm Documents – 255 million documents (8.9TB) – 6 archived-web collections: broad crawls, selective crawls, integrated collections – 14 years time span (1996-2010) Topics – 50 navigational needs (with date range) – e.g. the page of Publico newspaper before 2000 Relevance judgments – 3 judges, 3-level scale of relevance, 267 822 versions assessed – 5-fold cross-validation: 3 folders for training, 1 for validation, 1 for testing
17
Relevance judgments (qrels: TREC format) Test collection sample Topics (navigational needs)
18
L2R method 3 L2R algorithms – AdaRank, Random Forests, RankSVM. 6 relevance metrics – NDCG@k | Precision@k: k=1,5,10 68 ranking features – IR ranking features (e.g. Lucene, BM25, TF- IDF) – New temporal features
19
Temporal ranking features Intuition: Persistent documents are more relevant for navigational queries.
20
Lifespan is correlated to relevance. documents with higher relevance tend to have a longer lifespan 14 years of web snapshots analyzed
21
Number of versions is correlated to relevance. documents with higher relevance tend to have more versions 14 years of web snapshots analyzed
22
L2R with temporal features improves search relevance Relevance metric LuceneNutchWAXL2R without temporal features (Random Forests method) L2R with temporal features (Random Forests method) Precision gain of L2R with temporal features over Lucene Precision@10.280.320.640.76270% Precision@50.160.240.390.4250% Precision@100.130.170.24 184%
23
Most important ranking features (initial selection) We do not use them in Arquivo.pt yet. We use: 4 ranking features derived from TREC collection. 4 fields (URL, title, text body and anchor text of incoming links).
24
Temporal ranking model Intuition: Several ranking models learned for specific periods of time are more effective than a single ranking model.
25
Why several ranking models for WAIR? The web presents different characteristics across time – Technology – Structure and size of pages – Structure and size of sites – Language – Web-graph structure – Web 1996 vs. Web 2015? Should we use a single-model that fits all data? – No: [Kang & Kim 2003; Geng et al. 2008; Bian et al. 2010]
26
Several ranking models yield more relevant results than a single ranking model with temporal features Best results with 4 ranking models for 14 years +3.3%
27
Conclusions 1.Web archive users – First navigational needs, then informational needs. – Search as in web search engines. – Prefer full-text search and older documents. 2.WAIR systems – WAIR must be optimized for the specific needs of web archive users. – Some needs are not supported by current Information Retrieval technology. 3.Improving WAIR systems – Relevant documents tend to have a longer lifespan and more versions. – WAIR systems can be improved by exploiting ranking features intrinsic to web archives using Machine Learning (L2R). – Ranking models per time period outperform a single ranking model.
28
Future (hard) work Feature selection – We do not have our best ranking function in production – Must choose features attainable to implement considering the required resources (CPU, memory, processing time, impact on response speed) Create test collection for informational needs – Who was the president of the USA in 2007? Image search – A “must have” from users We need your collaboration!
29
All our resources are free open source All code available under the LGPL license: – https://code.google.com/p/pwa-technologies/ https://code.google.com/p/pwa-technologies/ Test collection to support evaluation: – https://code.google.com/p/pwa-technologies/wiki/TestCollection https://code.google.com/p/pwa-technologies/wiki/TestCollection L2R dataset for WAIR research: – http://code.google.com/p/pwa-technologies/wiki/L2R4WAIR http://code.google.com/p/pwa-technologies/wiki/L2R4WAIR Search service since 2010: – http://archive.pt http://archive.pt OpenSearch API: – http://code.google.com/p/pwa-technologies/wiki/OpenSearch http://code.google.com/p/pwa-technologies/wiki/OpenSearch Google Code:Pwa_technologies will migrate to GitHub:FullPast project.
30
Publications A Survey on Web Archiving Initiatives, Gomes et al., 2011. Understanding the Information Needs of Web Archive Users, Costa & Silva, 2010. Characterizing Search Behavior in Web Archives, Costa & Silva, 2011. A Search Log Analysis of a Portuguese Web Search Engine, Costa & Silva, 2010. Evaluating Web Archive Search Systems, Costa & Silva, 2012. Towards Information Retrieval Evaluation over Web Archives, Costa & Silva, 2009. Learning Temporal-Dependent Ranking Models, Costa et al., 2014. Creating a Billion-Scale Searchable Web Archive, Gomes et al., 2013. Information Search in Web Archives, Miguel Costa, PhD Thesis, 2014.
31
Thank you.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.