Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.

Slides:



Advertisements
Similar presentations
1 of 16 Information Access The External Information Providers © FAO 2005 IMARK Investing in Information for Development Information Access The External.
Advertisements

Using SD K12 SharePoint®.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa 3. Existing Access Methods 1. Background o The Web has become the main publication.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph JaJa, Mike Smorul, Mike McGann.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Product Retrieval Statistics Canada / Statistique Canada Chuck Humphrey ACCOLEDS/DLI Training December, 2001.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
1 ITGS - introduction A computer may have: a direct connection to a net (cable); or remote access (modem). Connect network to other network through: cables.
Using SD K12 SharePoint ®. What is SharePoint? Microsoft SharePoint Components Web Browser Collaboration functions Process management modules Search modules.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
The Internet 8th Edition Tutorial 4 Searching the Web.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CONTENT DISCOVERY, SERVICES, AND SUSTAINED ACCESS Timothy Cole, William Mischo, Beth Sandore, Sarah Shreeves ~ University of Illinois Library
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Chapter Twelve Digital Interactive Media Arens|Schaefer|Weigold Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
1 SEARCHING FOR TRUTH Locating Information on the WWW chapter 5.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
General Architecture of Retrieval Systems 1Adrienn Skrop.
Digitalcommons.unl.edu Archiving Department Records.
The Internet and the WWW IT-IDT-5.1. History of the Internet How did the Internet originate? Goal: To function if part of network were disabled Became.
Internet and Database Searching for Social Issues Joseph M. Compese Library Granada Hills Charter high School.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
PAWN: Producer-Archive Workflow Network
Archiving & Preserving Digital Content
Search Engines and Search techniques
Joseph JaJa, Mike Smorul, and Sangchul Song
Some Common Terms The Internet is a network of computers spanning the globe. It is also called the World Wide Web. World Wide Web It is a collection of.
Personalized Social Image Recommendation
Extraction, aggregation and classification at Web Scale
Prepared by Rao Umar Anwar For Detail information Visit my blog:
CS6604 Digital Libraries IDEAL Webpages Presented by
Latin American Government Documents Archive, LAGDA
Data Mining Chapter 6 Search Engines
Introduction to computers
All About the Internet.
Information Retrieval and Web Design
Information Retrieval and Web Design
Information Retrieval and Web Design
Metadata supported full-text search in a web archive
Presentation transcript:

Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland In Collaboration with the Library of Congress and the Internet Archive

Web – Main publication/communication medium today, but it is an ephemeral medium. Web Archiving: –Capture, annotate, and store important web contents within their contextual and temporal characteristics; –Preserve to enable search and access in the long term; –Unprecedented scale and heterogeneity. Web Archiving NDIIPP Partners Meeting 2 June 24, 2009

Discovery of relevant contents based on unstructured queries involving temporal specifications Presentation of pertinent summary information in ranked order according to the temporal context Scalable search and access performance Goals NDIIPP Partners Meeting 3 June 24, 2009

Existing Access Methods Chronological Listing Based on URLs –Used by the Wayback Machine of the Internet Archive, arguably the leader in web archiving. Directory Organization –Typically for domain specific contents, which are organized according to some hierarchical structure. Full Text Search –Similar to current web search engines (NutchWax/WERA) NDIIPP Partners Meeting 4 June 24, 2009

Limitations of Current Technologies Chronological Listing –Users are expected to provide URLs. Hierarchical Listing –Not scalable. Users explore hierarchical structures, with possibly large numbers of entries. Full Text Search (NutchWax/WERA) –Ranking of returned results does not take temporal context into consideration. –A listing similar to current web search engines. –Lack in performance and scalability. NDIIPP Partners Meeting 5 June 24, 2009

Issue #1: Scalability and Performance For any search time span, the ENTIRE history has to be examined. (Multiple distributed indices can be maintained instead. However, all the indices still need to be searched). NDIIPP Partners Meeting time Inverted index a … z search time span 6 June 24, 2009

Example: Search All, and then Filter “Find web pages that contain ‘September 11 th ’ before 2001” Search all, and then Filter  Very inefficient!! September 11 attacks - Wikipedia, the free encyclopedia September 11 attacks - Wikipedia, the free encyclopedia The September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital Archive September 11 Digital Archive Uses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims … 9/11 Tributes, September 11 Tributes and Memorials to the Victims … Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th /11 World Trade Center, k National Commission on Terrorist Attacks Upon the United States National Commission on Terrorist Attacks Upon the United States Commission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist ttacks, … 11commission.gov/ - 8k … and 4 million other pages pertaining to the September 11 th Attack … September 11 attacks - Wikipedia, the free encyclopedia September 11 attacks - Wikipedia, the free encyclopedia The September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital Archive September 11 Digital Archive Uses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims … 9/11 Tributes, September 11 Tributes and Memorials to the Victims … Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th /11 World Trade Center, k National Commission on Terrorist Attacks Upon the United States National Commission on Terrorist Attacks Upon the United States Commission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist ttacks, … 11commission.gov/ - 8k … and 4 million other pages pertaining to the September 11 th Attack … Ethiopian calendar - Wikipedia, the free encyclopedia Ethiopian calendar - Wikipedia, the free encyclopedia Thus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian),... en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, Mars Global Surveyor: Aerobraking APOD: September 11, Mars Global Surveyor: Aerobraking September 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap html - 5k … and only 630 other pages that are irrelevant to the September 11 th Attack Ethiopian calendar - Wikipedia, the free encyclopedia Ethiopian calendar - Wikipedia, the free encyclopedia Thus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian),... en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, Mars Global Surveyor: Aerobraking APOD: September 11, Mars Global Surveyor: Aerobraking September 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap html - 5k … and only 630 other pages that are irrelevant to the September 11 th Attack 4 Million+ pages 600+ pages

Issue #2: Time-independent Ranking Regardless of the search time span, the current ranking schemes always consider the ENTIRE history. Meaning and popularity of a term changes over time, and a ranking scheme should be dependent not only on the search terms but also the search time span. NDIIPP Partners Meeting time search time span 8 June 24, 2009

Issue #3: Ineffective Search Result Delivery Search results are usually delivered as a list of URLs, sorted by the relevance ranks. No other grouping / sorting options available. NDIIPP Partners Meeting 9 June 24, 2009

Ranking that depends on the time span specified by the user. Flexible and intuitive presentations of the returned results, ordered according to user’s specification. First Step toward Scalable and efficient ‘full text + temporal’ search. Core Technologies Developed NDIIPP Partners Meeting 10 June 24, 2009

Scalable & Efficient Temporal Searches NDIIPP Partners Meeting time time-window Inverted Index 1 a … z Inverted Index 2 a … z Inverted Index 3 a … z Inverted Index 4 a … z Inverted Index 5 a … z t1t1 t2t2 t3t3 t4t4 search time span For a given search time span, only these two indices are involved. 11 June 24, 2009 Inverted index a … z

Index Distribution and Parallel Search NDIIPP Partners Meeting Search Server Inverted Index 1-4 a … z Search Server Inverted Index 5-8 a … z Search Server Inverted Index 9-12 a … z Search Server Inverted Index a … z Search Cluster ADAPT Web Archive Search Web Server Request Broker Result Aggregator Web Interface 12 June 24, 2009

Time-dependent Ranking NDIIPP Partners Meeting time time-window Inverted Index 1 a … z Inverted Index 2 a … z Inverted Index 3 a … z Inverted Index 4 a … z Inverted Index 5 a … z t1t1 t2t2 t3t3 t4t4 search time span For a given search time span and terms, rankings depend on term popularity during this time span only (rather than the entire time span) 13 June 24, 2009

Search Result Delivery NDIIPP Partners Meeting Grouped by Time Grouped by URL Sorted by Relevance Sorted by Time 14 June 24, 2009

Collaboration with the Library of Congress and the Internet Archive. US 108 th Congress Web Archive: –16 monthly crawls between December 2003 and March –Web sites of Representatives, Senators, Delegates, and Committees of the 108 th US Congress ( ). –Number of sites: 582 –Number of records: 27 Millions –Total size around 2TB Archived in the Library of Congress Collection Used NDIIPP Partners Meeting 15 June 24, 2009

P ADAPT Web Archive Server INTERNET UMIACS Search/Return Ranked URLs Retrieve Web Documents Search Cluster Storage Cluster Processing/Indexing Cluster (Hadoop) WARCs Library of Congress Internet Archive Inverted Indices Storage Containers

Demo NDIIPP Partners Meeting 17 June 24, 2009

Screen Shots May 21, Group by Time Search Keywords Time Span Options Collapse Results Sort by Time Ungroup Sort by Relevance Retrieve Page Follow Link