Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and Bluma Peritz The Hebrew University.

Slides:



Advertisements
Similar presentations
“Keeping up with Changing Web.” Dartmouth college. Brian E Brewington. George Cybenko. Presented by : Shruthi R Bompelli.
Advertisements

Natural Language Processing WEB SEARCH ENGINES August, 2002.
WEB BASICS FOR CRITICAL THINKING. SEARCH ENGINES Use a variety of search engines: Google Yahoo! Dogpile AltaVista HotBot Lycos WebCrawler Bing.
INTERNET A collection of networks. History ARPANet – developed for security of sending in case of a nuclear attack IDEA – the system would not go down.
Web Characterization Week 9 LBSC 690 Information Technology.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Search Engines Jan Damsgaard Dept. of Informatics Copenhagen Business School
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
Search engines. The number of Internet hosts exceeded in in in in in
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
Supplementing the Library Collection with Digital Content from Engineering Departments Karen Clay Stanford University.
The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen.
The Designing of Web Services to Deliver Web Documents Associated with Historical Links David Chao College of Business San Francisco State University.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Golder and Huberman, 2006 Journal of Information Science Usage Patterns of Collaborative Tagging System.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Web Characterization: What Does the Web Look Like?
UNIT 9: PUBLISHING TECHNOLOGY. News in Digital Era 1. Readers can obtain digital press printed format anywhere in the world at any time. 2. This digital.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
CSCI-235 Micro-Computer in Science Internet Search.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Dreamweaver Edulaunch Project 1 EQ: What are the key concepts when building the first page of a web site?
Week 9 Lecture Quiz ProQuest Review Week 9 Homework Review Invisible vs. Visible Web Break GVRL + Google + The Wayback Machine The Information Cycle Featured.
SETTING THE STAGE FOR RESEARCH Karen L. Porter LS June, 2010.
Search Engine Comparisons By: Thomie Ventura. Search Engines Today, much, but not all, of the work we do revolves around the web Today, much, but not.
Challenges for Academic Libraries in the Networked World Christine L. Borgman Professor & Presidential Chair in Information Studies UCLA & Visiting Professor.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
1 Very similar items lost in the Web: An investigation of deduplication by Google Web Search and other search engines CWI, Amsterdam,
Marshall Breeding Director for Innovative Technology and Research Vanderbilt University
Internet Fundamentals Learning to Use the World Wide Web The Internet Teacher.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Where do I find it? Created by Connie CampbellConnie Campbell.
LIS618 lecture 0 Thomas Krichel Organization homepage Contents to be discussed today. Send mail.
Research related to the workshop Data sources and output types Which h-index? (Scientometrics, 2008) Web of Science with the Conference Proceedings Citation.
Understanding Search Engines. Basic Defintions: Search Engine Search engines are information retrieval (IR) systems designed to help find specific information.
Can scientific collaboration and excellence be measured by Web presence and Web links? Judit Bar-Ilan Bar-Ilan University and The Hebrew University of.
The Management of a Website’s Historical Resources David Chao College of Business San Francisco State University.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
1 SERD Project Director’s Conference CRIS OVERVIEW Education Component Current Research Information System March 30, 2005 Dr. Irma A. Lawrence National.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Quick Launch. Google Drive 30 GB Cloud Space Document.
Search Engines Information Technology and Social Life March 2, 2005.
Evolution of Web from a Search Engine Perspective Saket Singam
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Internet Power Searching Finding Pearls in a Zillion Grains of Sand.
Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.
Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.
Information Retrieval (9) Prof. Dragomir R. Radev
Computer Skills (1) Internet Explorer. To open the Internet Explorer: –Double click on the Internet Explorer icon on Desktop. –Or, from Start  All Programs.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Digital Commons digitalcommons.unl.edu. Digital Commons is: an “institutional repository” (IR) a resource for scholarly communication an opportunity for.
The Internet. The Internet and Systems that Use It Internet –A group of computer networks that encircle the entire globe –Began in 1969 Protocol –Language.
Search Engine Mortality & New Directions Greg R. Notess Internet Librarian International London 28 March 2001.
1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
CS 430: Information Discovery
Internet.
Looking for the following people:
Eric Sieverts University Library Utrecht Institute for Media &
Computer Networks and Internet
Data Mining Chapter 6 Search Engines
Search Engine Mortality & New Directions
Web Searching Everything, now..
Presentation transcript:

Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and Bluma Peritz The Hebrew University

Web documents They are not like printed/written material They are not like printed/written material If preserved, they last “forever”, e.g. the Code of Hammurabi If preserved, they last “forever”, e.g. the Code of Hammurabi They are not like unrecorded phone calls that disappear in the air They are not like unrecorded phone calls that disappear in the air

Web documents Can exist only for a limited amount of time Can exist only for a limited amount of time Can be removed, or moved to a different location Can be removed, or moved to a different location Can undergo changes Can undergo changes CNN’s main page is updated approx. every 15 minutes CNN’s main page is updated approx. every 15 minutes The program page for this conference The program page for this conference Can be temporarily inaccessible Can be temporarily inaccessible Communication/server problems Communication/server problems The Web is dynamic The Web is dynamic

The Web On the one hand grows continuously On the one hand grows continuously On the other hand it changes constantly, thus not only new documents are added to it, but On the other hand it changes constantly, thus not only new documents are added to it, but Exiting documents are removed Exiting documents are removed Existing documents undergo changes Existing documents undergo changes content content format format linkage linkage

Question: How do documents on the Web evolve? News pages change very frequently News pages change very frequently How about more “academic” topics? How about more “academic” topics? As a case study analyzed the changes occurring to a set of pages containing the search terms informetric or informetrics over a period of five years As a case study analyzed the changes occurring to a set of pages containing the search terms informetric or informetrics over a period of five years Almost no other such long-range studies Almost no other such long-range studies Koehler (JASIST, 2002): a “random”, fixed set of Web pages monitored weekly for a period of four years Koehler (JASIST, 2002): a “random”, fixed set of Web pages monitored weekly for a period of four years

Data collection First data collection point (June 1998) First data collection point (June 1998) Data discovery through submission of query to the then existing largest search engines Data discovery through submission of query to the then existing largest search engines AltaVista, Excite, HotBot, InfoSeek, Lycos and NorthernLight - exhaustiveness AltaVista, Excite, HotBot, InfoSeek, Lycos and NorthernLight - exhaustiveness All results retrieved, collated list of URLs created (941 URLs) All results retrieved, collated list of URLs created (941 URLs) URLs downloaded – asap after searching URLs downloaded – asap after searching Content of URLs checked for presence of search terms (866 URLs, 91.9%) Content of URLs checked for presence of search terms (866 URLs, 91.9%)

Data collection (cont.) Consecutive data collection points: June 1999, 2002 and 2003 Consecutive data collection points: June 1999, 2002 and 2003 Search engines were queried as before Search engines were queried as before Set of search engines in 2002 & 2003: AltaVista, Fast, Google, HotBot, Teoma and Wisenut Set of search engines in 2002 & 2003: AltaVista, Fast, Google, HotBot, Teoma and Wisenut List of URLs, pages downloaded List of URLs, pages downloaded Previously identified URLs that currently were not retrieved by the search engines were revisited and their contents downloaded Previously identified URLs that currently were not retrieved by the search engines were revisited and their contents downloaded This method allowed us to monitor previously discovered URLs, while adding new (or newly discovered) URLs to the set This method allowed us to monitor previously discovered URLs, while adding new (or newly discovered) URLs to the set

The observed growth rate during the study period

Not only growth … Until and including 2002, 5034 URLs were discovered Until and including 2002, 5034 URLs were discovered In June 2003 only 2850 were still available and satisfied the query In June 2003 only 2850 were still available and satisfied the query Thus 37.5% of the URLs (1890 URLs) Thus 37.5% of the URLs (1890 URLs) disappeared or disappeared or ceased to satisfy the query (topic shift) ceased to satisfy the query (topic shift)

… also modifications Out of the URLs satisfying the query at two consecutive data points, about 50% have undergone some kind of modification Out of the URLs satisfying the query at two consecutive data points, about 50% have undergone some kind of modification Text of the source files compared Text of the source files compared Stable dataset compared to random sets Stable dataset compared to random sets e.g. in Koehler’s random set of 361 URLs, only for 3% no changes were observed e.g. in Koehler’s random set of 361 URLs, only for 3% no changes were observed Unstable set compared to digital libraries Unstable set compared to digital libraries e.g. PubMedCentral, arXiv, CiteSeer – only 3% of the sample disappeared during the one year period of observation e.g. PubMedCentral, arXiv, CiteSeer – only 3% of the sample disappeared during the one year period of observation

What is the value of such elusive information??? Dellavalle et al., Science 2003: Going, going, gone: Lost Internet References Dellavalle et al., Science 2003: Going, going, gone: Lost Internet References Bar-Ilan & Peritz, JASIST (to appear): Evolution, Continuity and Disappearance of Documents on a Specific Topic on the Web - A Longitudinal Study of “Informetrics” Bar-Ilan & Peritz, JASIST (to appear): Evolution, Continuity and Disappearance of Documents on a Specific Topic on the Web - A Longitudinal Study of “Informetrics” SIST_notice.pdf SIST_notice.pdf Internet Archive – saves “snapshots” of the Web at different points in time. Wayback Machine Internet Archive – saves “snapshots” of the Web at different points in time. Wayback Machine

Aug 26, 2000

March 31, 2001

May 28, 2002

April 25, 2003