2004.10.21 - SLIDE 1PNC Digital Archives Session Prof. Ray R. Larson University of California, Berkeley School of Information Management and Systems

Slides:



Advertisements
Similar presentations
Putting the Pieces Together Grace Agnew Slide User Description Rights Holder Authentication Rights Video Object Permission Administration.
Advertisements

Copyright, UCL LEADERS: Linking EAD to Electronically Retrievable Sources Developing a Generic Toolkit: Architecture and technology issues ALLC/ACH Conference.
Overview Environment for Internet database connectivity
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
XML: Extensible Markup Language
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
EAD in A2A Bill Stockting, Senior Editor A2A and EAD Working Group: Central Archives of Historical Records, Warsaw, 26 April 2003.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
A New Learning Tools. Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University.
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
SLIDE 1IS 242 – Fall 2011 Examples of XML DTDs and XSDs Ray R. Larson University of California, Berkeley School of Information IS 242: XML.
OLC Spring Chapter Conferences Metadata, Schmetadata … Tell Me Why I Should Care? OLC Spring Chapter Conferences, 2004 Margaret.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Angelika Menne-Haritz The MEX editor - METS and the presentation of digitised archives The MEX editor: METS and the Internet presentation of.
Presented by Karen W. Gwynn LS – Metadata University of Alabama Prof. Steven MacCall Spring 2011.
Overview of Search Engines
Use of METS in CDL Digital Special Collections Brian Tingle.
Guest Lecture LIS 656, Spring 2011 Kathryn Lybarger.
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. 1.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Mark Sullivan University of Florida Libraries Digital Library of the Caribbean.
XML Overview. Chapter 8 © 2011 Pearson Education 2 Extensible Markup Language (XML) A text-based markup language (like HTML) A text-based markup language.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Metadata: Essential Standards for Management of Digital Libraries ALI Digital Library Workshop Linda Cantara, Metadata Librarian Indiana University, Bloomington.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
DACS Describing Archives: A Content Standard. The Background  Archives, Personal Papers & Manuscripts, 1980s –New Technologies with Web, XML, EAD –Revision.
WEB BASED DATA TRANSFORMATION USING XML, JAVA Group members: Darius Balarashti & Matt Smith.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
Archival Description People, Records, and Functions Daniel V. Pitti Institute for Advanced Technology in the Humanities University of Virginia March 2003.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
Introduction to metadata
Introduction to Morpho BEAM Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
Collection Description in the 1 November 2001Collection Description in the Archives Hub Archival perspective Collection description has always been central.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
EAD: An Introduction and Primer Christopher J. Prom, Ph.D. Assistant University Archivist University of Illinois Archives July 7, 2003.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Basic Encoded Archival Description METRO New York Library Council Workshop Presented by Lara Nicosia December 9, 2011 New York, NY.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
XML Tools (Chapter 4 of XML Book). What tools are needed for a complete XML application? n Fundamental components n Web infrasructure n XML development.
EAD 101: An Introduction to Encoded Archival Description XML and the Encoded Archival Description: Providing Access to Collections Oregon Library Association.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Do Real Archivists Use OAI? Mid-Atlantic Regional Archives Conference Gettysburg, PA October 31, 2003 Chris Prom Assistant University Archivist University.
A centre of expertise in digital information management UKOLN is supported by: Metadata – what, why and how Ann Chapman.
XML 1. Chapter 8 © 2013 Pearson Education, Inc. Publishing as Prentice Hall SAMPLE XML SCHEMA (XSD) 2 Schema is a record definition, analogous to the.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Information Retrieval in Practice
7th Annual Hong Kong Innovative Users Group Meeting
Search Engine Architecture
An Overview of Data-PASS Shared Catalog
CSE591: Data Mining by H. Liu
Presentation transcript:

SLIDE 1PNC Digital Archives Session Prof. Ray R. Larson University of California, Berkeley School of Information Management and Systems Metadata and Retrieval for Digital Archives

SLIDE 2PNC Digital Archives Session Overview XML and SGML for Digital Archives and Digital Libraries The Encoded Archival Description (EAD) The Cheshire II system and Features The Archives Hub and Distributed Archives Hub

SLIDE 3PNC Digital Archives Session Overview XML and SGML for Digital Archives and Digital Libraries The Encoded Archival Description (EAD) The Cheshire II system and Features The Archives Hub and Distributed Archives Hub

SLIDE 4PNC Digital Archives Session What is SGML/XML? SGML stands for Standard Generalized Markup Language –XML stands for eXtended Markup Language What it is NOT: –Not a visual document description –Not an application specific markup –Not proprietary

SLIDE 5PNC Digital Archives Session What is SGML/XML? What it is: –An international standard (SGML- ISO 8879:1986) –A generic language for describing the structure of documents, and markup that can be used for those documents –Intended for generating markup for content rather than form elements XML is a simplified subset of SGML (being established by W3C)

SLIDE 6PNC Digital Archives Session Why XML is Revolutionary XML enables those developing WWW resources to preserve any “document type” or “database schema” when they publish on the Web XML enables web services to send self- describing “protocol messages” that can be understood by programs, not just “by eye” This information cannot be encoded in HTML XML-encoded information is smart enough to support new classes of Web applications Adapted from Dr. Robert J Glushko

SLIDE 7PNC Digital Archives Session XML Enables New Web Applications Data interchange between Web clients –use Web for application integration without information loss (example: product information in supply chain, EDI) Moving processing from server to client –reduce network traffic and server load (example: download airline schedule, find best flights without “back-and-forth” thrashing) Source Dr. Robert J Glushko

SLIDE 8PNC Digital Archives Session XML Enables New Web Applications Multiple client-side views of same data –expert and novice versions –manager and worker versions – localization (currency or measurement conversions) “Information push” from personalized applications –selecting information based on user preferences (example: custom news feed by matching article keywords against user profile) Source Dr. Robert J Glushko

SLIDE 9PNC Digital Archives Session The First Generation Web ComputersBrowsers.. making information accessible through browsers scripts HTML Eyeballs only No automation Limited integration Source Dr. Robert J Glushko

SLIDE 10PNC Digital Archives Session HTML Airline Schedule Seen “By Eye” Airline Schedule Flight Information United Airlines #200 San Francisco 9:30 AM Honolulu 12:30 PM $ Source Dr. Robert J Glushko

SLIDE 11PNC Digital Archives Session HTML Airline Schedule Seen “By Computer” Airline Schedule Flight Information United Airlines #200 San Francisco 9:30 AM Honolulu 12:30 PM $ Source Dr. Robert J Glushko

SLIDE 12PNC Digital Archives Session Next Generation Web Java Computers.. making information and services accessible to computers (and people) XML Structured searches Agents New models Source Dr. Robert J Glushko

SLIDE 13PNC Digital Archives Session Airline Schedule in XML San Francisco 9:30 AM Honolulu 12:30 PM Source Dr. Robert J Glushko

SLIDE 14PNC Digital Archives Session XML is a Foundation for Interoperability Format based API based WEBEDICORBA / COM XML.. exchange information in an application and vendor neutral format Source Dr. Robert J Glushko

SLIDE 15PNC Digital Archives Session XML and Digital Archives XML provides a platform for data interchange and for search and manipulation of described data XML/SGML provides a standard framework for sharing and distribution of Archival information using the Encoded Archival Description and Encoded Archival Context markup languages

SLIDE 16PNC Digital Archives Session Overview XML and SGML for Digital Archives and Digital Libraries The Encoded Archival Description (EAD) The Cheshire II system and Features The Archives Hub and Distributed Archives Hub

SLIDE 17PNC Digital Archives Session Components of Archival Description Description of records Context of creation: creators Functions and activities documented in records Dedicated descriptive semantics and structure for each component Components interrelated with one another

SLIDE 18PNC Digital Archives Session Records: EAD Encoded Archival Description –Society of American Archivists and Library of Congress –Used internationally –English, Spanish, Dutch, French, and Chinese 1998, 2002

SLIDE 19PNC Digital Archives Session What EAD Is An emerging encoding and structural standard for archival description –Data structure –Communication/interchange –Finding aid / archival description ISAD(G)

SLIDE 20PNC Digital Archives Session What EAD Is Not Content standard Data value standard Archival management system

SLIDE 21PNC Digital Archives Session Principals of Record Description Respect de fonds –Provenance –Original order Hierarchical and symmetrical Inheritance of description

SLIDE 22PNC Digital Archives Session The EAD DTD The EAD DTD is very complex and permits considerable flexibility in expressing the description and topics of the archival collection. The main parts are outlined on the following slides, but include: –A header, including basic descriptive info. –Optional frontmatter –The archival description We will describe only a few of the top-level tags

SLIDE 23PNC Digital Archives Session Major Sections and DTD Defs EAD – EADHeader: – –FILEDESC

SLIDE 24PNC Digital Archives Session Major Sections and DTD Defs The Archival Description: – The Descriptive Identification –

SLIDE 25PNC Digital Archives Session Example EAD Record (Hub) GB 0133 TAB Tabley Muniments John Rylands University Library of Manchester 150 Deansgate Manchester... (Parts removed )… University of Manchester, John Rylands University Library of Manchester <UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB" REPOSITORYCODE = "0133"> GB 0133 TAB Tabley Muniments 19th century 1.24 cu.m Warren, family, of Tabley, Cheshire Warren, John Byrne Leicester, , 3rd Baron de Tabley, poet

SLIDE 26PNC Digital Archives Session Example EAD Record (Hub) Administrative/Biographical History The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire, was born in 1835, the son of the 2nd Baron de Tabley ( ), and his wife, Catherina. His mother was Italian, the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he published under the pseudonyms George F. Preston ( ) and William Lancaster ( ), but latterly under his own name. His early verse included Praeterita (1863), Eclogues and Monodramas (1864), Studies in Verse (1865), Philocletes (1866), and Orestes (1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and Swinburne. In 1873 he produced …. (some data removed)…

SLIDE 27PNC Digital Archives Session Example EAD Record (Hub) Scope and Content The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates. Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert Bridges. There are volumes of Tabley's essays and verse, as well as a considerable number of notebooks and loose manuscripts of verse and other writings. There are various bundles and boxes relating to "Coins", "Botany", "Poetry", "Literary", "Financial" and bookplates. Preliminary survey list. There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM. The Library also has custody of the important Tabley Book Collection. The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record Office. Some of these papers were originally in the custody of the John Rylands University Library of Manchester.

SLIDE 28PNC Digital Archives Session Example EAD Record (Hub) Index terms Tabley Inferior Cheshire SJ7378 Benson Arthur Christopher Bridges Robert Seymour Duff Sir Mountstuart Elphinstone Grant Knight Gosse Sir Edmund William Knight Milnes Richard Monckton st Baron Houghton Bookplates Botany Numismatics Poetry Modern 19th century

SLIDE 29PNC Digital Archives Session People: EAC EAD is now complemented by “EAC” or the “Encoded Archival Context” It is another XML-based standard for encoding descriptions of record creators: corporate bodies, families, and individuals It is being developed as part of an international effort with hopes of being able to share this information among archives having materials related to those entities

SLIDE 30PNC Digital Archives Session Authority Control Identifying creator entities Recording name or names used by and for them Rule-based heading or entry formation and control

SLIDE 31PNC Digital Archives Session Relations Creators Records Functions and activities Each relation qualified by place and time Records evidence of people acting in particular places and times

SLIDE 32PNC Digital Archives Session Characteristics and Events Person –Sex, education, address, competencies, activities, affiliations, awards … –Biography Corporate body –Type, mandate, location, legal status, assets, structure… –Administrative history Family –Assets and structure, activities, location, legal status… –Family history

SLIDE 33PNC Digital Archives Session Overview XML and SGML for Digital Archives and Digital Libraries The Encoded Archival Description (EAD) The Cheshire II system and Features The Archives Hub and Distributed Archives Hub

SLIDE 34PNC Digital Archives Session Overview of Cheshire II It supports SGML and XML with components and component indexes It is a client/server application Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented Server supports a Relational Database Gateway Supports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and (new) Python Store SGML/XML as files or “Datastore” database

SLIDE 35PNC Digital Archives Session Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

SLIDE 36PNC Digital Archives Session SGML/XML Support Underlying native format for all data is SGML or XML The DTD defines the database contents Full SGML/XML parsing SGML/XML Format Configuration Files define the database location and indexes Various format conversions and utilities available for Z39.50 support (MARC, GRS-1

SLIDE 37PNC Digital Archives Session Distributed Search Tasks Resource Description –How to collect metadata about digital libraries and their collections or databases Resource Selection –How to select relevant digital library collections or databases from a large number of databases Distributed Search –How to perform parallel or sequential searching over the selected digital library databases Data Fusion –How to merge query results from different digital libraries with their different search engines, differing record structures, etc.

SLIDE 38PNC Digital Archives Session An Approach for Distributed Resource Discovery Distributed resource representation and discovery –New approach to building resource descriptions based on Z39.50 –Instead of using broadcast search across resources we are using two Z39.50 Services Identification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCAN

SLIDE 39PNC Digital Archives Session Z39.50 Explain Explain supports searches for –Server-Level metadata Server Name IP Addresses Ports –Database-Level metadata Database name Search attributes (indexes and combinations) –Support metadata (record syntaxes, etc)

SLIDE 40PNC Digital Archives Session Z39.50 SCAN Originally intended to support Browsing Query for –Database –Attributes plus Term (i.e., index and start point) –Step Size –Number of terms to retrieve –Position in Response set Results –Number of terms returned –List of Terms and their frequency in the database (for the given attribute combination)

SLIDE 41PNC Digital Archives Session Z39.50 SCAN Results % zscan title cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

SLIDE 42PNC Digital Archives Session Resource Index Creation For all servers, or a topical subset… –Get Explain information –For each index Use SCAN to extract terms and frequency Add term + freq + source index + database metadata to the XML “Collection Document” for the resource –Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes

SLIDE 43PNC Digital Archives Session MetaSearch Approach MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5

SLIDE 44PNC Digital Archives Session Known Issues and Problems Not all Z39.50 Servers support SCAN or Explain Solutions that appear to work well: –Probing for attributes instead of explain (e.g. DC attributes or analogs) –We also support OAI and can extract OAI metadata for servers that support OAI –Query-based sampling (Callan) Collection Documents are static and need to be replaced when the associated collection changes

SLIDE 45PNC Digital Archives Session Our Collection Ranking Approach We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results)

SLIDE 46PNC Digital Archives Session Titles only (short query)

SLIDE 47PNC Digital Archives Session Overview XML and SGML for Digital Archives and Digital Libraries The Encoded Archival Description (EAD) The Cheshire II system and Features The Archives Hub and Distributed Archives Hub

SLIDE 48PNC Digital Archives Session Distributed Archives Hub Distributed set of Cheshire Servers with EAD-based archival collections

SLIDE 49PNC Digital Archives Session Distributed Archives Hub Servers Replicated servers Group Servers Search Servers EAD Servers

SLIDE 50PNC Digital Archives Session Further Information Full Cheshire II client and server is open source and available at ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation Project Web Site Cheshire III Archives Hub