WebInfoMall: the Chinese Web Archive how we got started and how it is now Huang Lianen and Li Xiaoming Peking University, China Digital Archive Workshop.

Slides:

Advertisements

Similar presentations

CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.

Advertisements

Capacity Building Passing on the Experience Dr. Noha Adly World Digital Library Arab Peninsula Regional Group meeting.

Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program

Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS TSM CERN Daniele Francesco Kruse CERN IT/DSS.

Bringing an Institutional Repository to the Ball State University Community Cardinal Scholar (CS) Bradley Faust, Assistant Dean LITS University Libraries.

Backup Strategy. An Exam question will ask you to describe a backup strategy. Be able to explain: Safe, secure place in different location. Why? – For.

The Next I.T. Tsunami Paul A. Strassmann. Copyright © 2005, Paul A. Strassmann - IP4IT - 11/15/05 2 Perspective Months  Weeks.

Asking Questions on the Internet

LAT Data Server Workshop - 1 Jan 13-14, 2005 Tom Stephens GSSC Database Lead GSSC LAT Data Server Overview.

TC2-Computer Literacy Mr. Sencer February 4, 2010.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.

Developing PANDORA Mark Corbould Director, IT Business Systems.

Guide to Linux Installation and Administration, 2e1 Chapter 13 Backing Up System Data.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.

Recent Progress in the Million Book Digital Library Project in China By Prof. Jihai Zhao Zhejiang University Libraries, Hangzhou, China

1 Introduction To The New Mainframe Stephen S. Linkin Houston Community College ©HCCS & IBM® 2008 Stephen Linkin.

Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.

Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

Svein Arne Brygfjeld National Library of Norway Nordic Web Archive.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

A Public Trust at Risk: The Heritage Health Index Report on the Condition of Alabama’s Collection.

The DSpace Course Module – An introduction to DSpace.

The International Higher Education University Research Performance Forum April 2013 – Pan Pacific Orchard, Singapore Case Study – 2.00pm – 2.45pm.

1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

A Web Crawler Design for Data Mining

From the Desktop to the Cloud Leveraging Hybrid Storage Architectures In Your Repository David Tarrant, Tim Brody.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

Master Thesis Defense Jan Fiedler 04/17/98

Choosing Delivery Software for a Digital Library Jody DeRidder Digital Library Center University of Tennessee.

Component 4: Introduction to Information and Computer Science Unit 4: Application and System Software Lecture 3 This material was developed by Oregon Health.

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

OCLC Online Computer Library Center Digital Preservation with OCLC Digitization Standards: Issues & Updates Taylor Surface, OCLC.

Section 1 # 1 CS The Age of Infinite Storage.

Library of Vilnius Gediminas Technical University Asta Katinaitė, Aurelija Striogienė

Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.

An Approach to Persistence of Web Resources Joachim Feise University of California, Irvine Information and Computer Science

Large Scale Parallel File System and Cluster Management ICT, CAS.

File Organization Lecture 1

Services for Object Storage and Preservation March 2008 All content in these slides is considered work in progress. In no way does it represent an absolute.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

1 NETE4631 Working with Cloud-based Storage Lecture Notes #11.

TATII ITS Network (Fiber ) Portal Server Fourth Avenue Building Database Server Dual Sparc SAN (RAID) 1.2 TB Direct Connection backup_tables raw_data_files.

The Story of at the Alaska State Library Presented by Sheri Somerville Alaska State Library March 14, 2009.

Cosc 4750 Backups Why Backup? In case of failure In case of loss of files –User and system files Because you will regret it, if you don’t. –DUMB = Disasters.

BOINC: Progress and Plans David P. Anderson Space Sciences Lab University of California, Berkeley BOINC:FAST August 2013.

Wisconsin Archives Mentoring Service Presents “A Friend in Need”

Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.

ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.

From the Desktop to the Cloud Leveraging Hybrid Storage Architectures In Your Repository David Tarrant, Tim Brody.

Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.

Storage Why is storage an issue? Space requirements Persistence Accessibility Needs depend on purpose of storage Capture/encoding Access/delivery Preservation.

Enw / Name. What is a on-line / paper based data capture form Can you give an example where each are used? Automated data capture systems are used around.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Archiving & Preserving Digital Content

Answer to Summary Questions

Introduction to SDNS-Mon

Statistics Visualizer for Crawler

The demonstration of Lustre in EAST data system

Steve Ko Computer Sciences and Engineering University at Buffalo

VI-SEEM Data Repository

Steve Ko Computer Sciences and Engineering University at Buffalo

Robotic Search Engines for the Physical World

IBM Tivoli Storage Manager

Presentation transcript:

WebInfoMall: the Chinese Web Archive how we got started and how it is now Huang Lianen and Li Xiaoming Peking University, China Digital Archive Workshop August 27, 2007, Xian, China Huang Lianen and Li Xiaoming Peking University, China Digital Archive Workshop August 27, 2007, Xian, China

Institute of Network Computing and Information Systems Outline  Motivation developed in , I was not able to give an answer when some one asked me what had been on Chinese web , I was not able to give an answer when some one asked me what had been on Chinese web , I ’ d like to be able to answer concretely if some one will ask me what were on Chinese web 2001 ? 2100, I ’ d like to be able to answer concretely if some one will ask me what were on Chinese web 2001 ?  Archiving technology For long-term web crawl and store, what technology should be used, especially in a university lab environment ? For long-term web crawl and store, what technology should be used, especially in a university lab environment ?  Exhibition of the archive How do we show the archive to the society ? How do we show the archive to the society ?

Institute of Network Computing and Information Systems On the elapsing nature of Web data  Li Xiaoming, “ On the estimation of the number of previous Chinese Web pages ”, Journal of Peking University, Vol.39, No.3, May 2003,  As a by-product, we also obtained the result that the time for 50% of current web pages disappearing is about 0.99 year. Observing the elapsing nature, can we archive them before they are gone ?

Institute of Network Computing and Information Systems We have some advantage With a search engine, 50% is done ! The system work started in 2001

Institute of Network Computing and Information Systems The progress and current status  The crawl started in 2001 and the first batch of data was put on line Jan 18,  As of today, there is a total repository over 2.5 billion Chinese web pages (different), more precisely, pages crawled from mainland China ’ s web  About 1 million pages incremental every day.  Initially, we used tapes for storage, but changed to hard disks later.  Total online data (compressed) volume ≈ 30TB, with an off line backup.  Spring 2002, “ historical browsing ” was provided; summer 2006, beta test of “ backward browsing ” was tested

Institute of Network Computing and Information Systems 示例： InfoMall 界面

Institute of Network Computing and Information Systems 示例：输入

Institute of Network Computing and Information Systems 示例：新浪 Headquarter of Bin Ladin was bombed.

Institute of Network Computing and Information Systems 链接保持 The first air strike in new year, American AF bombed the headquarter of Bin Ladin.

Institute of Network Computing and Information Systems 继续保持链接

Featured collections: sars

Institute of Network Computing and Information Systems Featured collections: the first manned space vehicle

Institute of Network Computing and Information Systems We ask three questions:  What’s the use ? Preserving historical information before it ’ s lost Preserving historical information before it ’ s lost Implying great opportunities for deep mining Implying great opportunities for deep mining Providing access to previous information much more convenient than libraries even if they have kept it. Providing access to previous information much more convenient than libraries even if they have kept it.  Can we do it ? (or at least get a pretty good start) “ we ” : a university lab. “ we ” : a university lab.  How we do it ?

Institute of Network Computing and Information Systems Can we do it ? (resource requirement)  “ hard ” resource Crawler system: 4 computers of $5,000 each Crawler system: 4 computers of $5,000 each Storage system: about 50 million pages per 1TB, amounts to $4,000. If you need a backup, double the investment. Storage system: about 50 million pages per 1TB, amounts to $4,000. If you need a backup, double the investment. Access web server: $4,000 Access web server: $4,000 Space (not big, but reliable) to put these machines Space (not big, but reliable) to put these machines High speed network connection, ? per month ? High speed network connection, ? per month ?  “ soft ” resource Permission for crawling and keeping Permission for crawling and keeping A staff to handle the daily routine matters A staff to handle the daily routine matters Persistent enthusiasm for this undertaking Persistent enthusiasm for this undertaking

Institute of Network Computing and Information Systems How we do it ?  Incremental crawling A scheduled daily operation, collect about one to two million new pages a day, fingerprint compared with previous pages A scheduled daily operation, collect about one to two million new pages a day, fingerprint compared with previous pages  Data storage and incorporation Once a few weeks after having collected enough data Once a few weeks after having collected enough data  Accessibility Wayback machine style Wayback machine style Featured exhibitions Featured exhibitions

Institute of Network Computing and Information Systems WebInfoMall: hierarchical module data organization  Assurance of scalability and dynamic re- configurability Convenient for coping with changes at all levels Convenient for coping with changes at all levels record : file : batch : disk : node : system Matching logical data organization with physical devices structure as close as possible

Institute of Network Computing and Information Systems The architecture

Institute of Network Computing and Information Systems The operations under the hood

Institute of Network Computing and Information Systems Comparison  A survey done by National Library of China  Web InfoMall is the only large scale web archive in China – operated in a university lab ! In the flattened world, “small can act big !”

Institute of Network Computing and Information Systems Resource sharing   We have published data storage format   And provide WebInfoMall data to research community for free. The beneficiary research units include Peking University, Tsinghua University, Chinese Academy of Sciences, Shanghai Jiaotong University, Renmin Univerisyt of China, Harbin Institue of Technology,....   In particular, we built the largest Chinese Web Test collection with compressed 200GB web pages (CWT200g) for evaluation of Chinese web information retrieval technologies

Institute of Network Computing and Information Systems Summary   WebInfoMall, is the Chinese web archive since 2001, with over 2.5 billion pages in its repository as for   Straightforward technology has been used for building WebInfoMall Linux box + Berkeley DB + hierarchical module data organization   We are looking into different ways to access the data to get values more than just information preservation and history browsing

Institute of Network Computing and Information Systems Thanks for your attention 