Web Archives and Large-Scale Data: Preliminary Techniques for Facilitating Research Nicholas Woodward Latin American Network Information Center


Similar presentations
SIRS Decades Primary sources documenting life and events in 20th-century America 2008.

Whats Trending at the Census Bureau? Washington SDC Affiliates Meeting November 30, 2012 Seattle 1.
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Using CAB Abstracts to Search for Articles. Objectives Learn what CAB Abstracts is Know the main features of CAB Abstracts Learn how to conduct searches.
© 2012 Association for Computing Machinery Intro to the ACM Digital Library February 24, 2012 Intro to the ACM Digital Library February 24, 2012.
OCLC Online Computer Library Center OCLC Cataloging Update Connexion client 1.50 & more OCLC CJK Users Group Annual Meeting San Francisco, CA April 8,
DLI Training Nesstar Workshop
Hanneke COPPOLECCHIA-SOMERS European Parliament DG Internal Policies
Exploring the Deep Web Peter L. Kraus J. Willard Marriott Library – University of Utah.
The Internet.
History Study Center Primary and secondary sources documenting global history 2010.
The Economic and Social Data Service (ESDS) Kevin Schürer ESDS/UKDA ESDS Awareness Day 5 December 2003.
Depositing Data for Archiving Libby Bishop ESDS Qualidata, University of Essex Changing Families, Changing Food Meeting University of Sheffield 15 March.
Social Sciences Collections & Research: a new content-based team Gillian Ridgley, Ian Cooke, Jerry Jenkins.
Moving Forward With Digital Preservation at the Library of Congress Laura Campbell Associate Librarian for Strategic Initiatives Library of Congress.
Lorcan Dempsey OCLC Big Heads – Heads of Technical Services of Large Research Libraries ALA 2013 Chicago 28 June things about
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Teaching Using the Internet in Your Classroom.
Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program
Committed to Connecting the World International Telecommunication Union November 2009 GD Functionalities 12 November 2009 Rodolfo La Pietra SG/IS/ASD International.
Abuzar Ghaffari Introduction & Training – May 2007 AMERICAN MATHEMATICAL SOCIETY AMS - MATHSCINET.
How the University Library can help you with your term paper
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Latin American and Human Rights Web Archiving as part of Research Library Special Collections Kent Norsworthy LLILAS Benson Digital Curation Coordinator,
Features and Uses of a Multilingual Full-Text Electronic Theses and Dissertations (ETDs) System Yin Zhang Kent State University Kyiho Lee, Bumjong You.
1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Overview of Search Engines
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
NOBLE Digital Library. How does it work? The NOBLE Digital Library uses the DSpace platform. Image files and metadata are imported into DSpace using.
1 Archive-It Training University of Maryland July 12, 2007.
1 Urban Education Resources LIBRARY INSTRUCTION Jacqueline A. Gill Associate Professor Reference
Annick Le Follic Bibliothèque nationale de France Tallinn,
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Planning for Publishing Lecture Notes on the Web Deirdre Hetherington Educational Technology Unit.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
European Organization for Nuclear Research Organisation Européenne pour la Recherche Nucléaire CDS Invenio CERN’s open source digital library information.
The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive.
Annick Le Follic Bibliothèque nationale de France Tallinn,
The TARO Project Texas Archival Resources Online Fred Gilmore Sr Operating Systems Specialist UT Austin General Libraries April.
WHS joined Archive-It in the fall of 2010 Began capturing state information with the capture of Governor Jim Doyle’s websites at the end of the administration.
Kent Norsworthy, LLILAS and Benson Latin American Collection Jade Diaz, University of Texas Libraries2012 Texas Conference for Digital Libraries Kent Norsworthy,
Master Thesis Defense Jan Fiedler 04/17/98
WISER Social Sciences: Politics & International Relations Gillian Beattie (Social Science Library) Jane Rawson (Vere Harmsworth Library)
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

AILLA:The Archive of the Indigenous Languages of Latin America Heidi Johnson The University of Texas at Austin Latin American Digital Library Initiative,
The ISI Web of Knowledge nce/training/wok/#tab3.
How do I find works in the Repository?. University of Texas Libraries UT DR Digital Repository Search in the Repository Keyword search from the Repository.
Student Edition: Gale Info Trac Database Lesson Grades 9-12 High School Student Edition: Gale Info Trac Database Lesson Grades 9-12 High School Anita Cellucci.
The University of Texas at Austin If We Build it, Will They Come? Providing Enhanced Access to an Archive-It Collection LAGDA - Latin American Government.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Data mining in web applications
Information Retrieval in Practice
Archiving & Preserving Digital Content
Latin American Government Documents Archive, LAGDA
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
Web Content Management System
Digital Marketing Agency in Delhi
Presentation transcript:

Web Archives and Large-Scale Data: Preliminary Techniques for Facilitating Research Nicholas Woodward Latin American Network Information Center TCDL May 24, 2012

presidencia.gob.hn Before the Coup...

presidencia.gob.hn... During the Coup...

presidencia.gob.hn... After the Coup

Government in exile Website

Why Web Archive

History of archiving Latin America at UT Austin Benson Library collected gov docs in print since 1920s Latin America began moving to digital gov docs around 2000 Download, print and curate Latin American Government Document Archive begins 2005 Crawl entire websites, compress and curate data Provide access to digital content directly

Latin American Government Document Archive LAGDA = 280 seeds, about 15 government ministries per each of 18 countries crawled quarterly since 2005 Files crawled and archived to date in LAGDA70 million Data archived5.9 TB Items added to collection per year9-10 million HTML pages archived per crawl1.6 million PDF documents archived per crawl 260,000 Monthly average pageviews on LAGDA 2,918

Speeches Full text Audio Video Official Statistics Census Surveys Economic data Reports Regular Annual Reports State of the Union Sector Reports Latin American Government Documents Born Digital (Website as a digital object) (Social Media)

LAGDA: challenges to data mining Heterogeneous corpus Various languages Data formats (HTML, Word, PDF, Other) Document characteristics Minimal metadata Variety of sources (countries, governments, departments)

LAGDA: motivating problem Goal: Automatically attach labels to documents in a large collection based on training documents Challenges: Keyword search is ineffective due to lack of consistent words Training documents may cover broad subject areas

LAGDA: techniques for data mining Break documents into n-grams 1-gram {The, quick, brown, fox, jumps, over, the, lazy} 2-gram {The quick, quick brown, brown fox, fox jumps} 3-gram {The quick brown, quick brown fox…} Identify one or more subsets of n-grams with significant high usages in the training documents Evaluate all documents in the corpus using these n-grams

LAGDA: techniques for data mining Use this score and others to create a composite score The company you keep - Examine the text and the links that point to our documents Natural language processing Named entities & Part-of-Speech tagging

LAGDA: technology for large-scale computing at TACC Corral data storage system (6 Petabyes) Longhorn High Performance Cluster Paradigms for distributed computing (MPI and Hadoop) Nodes work in parallel and combine their results Allows us to divide and conquer the problem Open source libraries (Heritrix, Tika, Lucene, OpenNLP)

LAGDA: initial results Traditional classification approaches are unsuccessful Our n-gram approach for classification based on training set outperforms traditional Bayesian Inference Classifier Results from our composite scores demonstrate additional improvement

big data and libraries: going forward Challenges posed by web-archived data Size, heterogeneity and limited metadata Data access that is more dynamic and flexible How big data can create data-driven research Development of use cases and research examples Technology at the service of social sciences, humanities and other fields whose research could benefit

Acknowledgments Kent Norsworthy, LLILAS and Benson Collection Weijia Xu, TACC Carolyn Palaima, LLILAS and Benson Collection UT Libraries

Contact Google: LAGDA