2011-11-17 - SLIDE 1I242 - Fall 2011 Connecting Archival Collections: The Social Networks and Archival Context Project Ray R. Larson University of California,

Slides:



Advertisements
Similar presentations
METS: An Introduction Structuring Digital Content.
Advertisements

ISAAR (CPF) and a possible mapping to CIDOC CRM Based on and “State of.
Conducting Archival Research Online Trevor Bond Trevor Bond Cheryl Gunselman Ben DeCrease.
RDA and DACS: Using a MARC-EAD Crosswalk to Improve Access to Special Collections Resources, a Project at UWG GUGM May 15, 2014 Presenters: Blynne Olivieri.
RLG Programs Karen Smith-Yoshimura OCLC Research CEAL, Philadelphia 24 March 2010 Cooperative Identities Hub.
The Application of ISAD(G) to the Description of Archival Datasets Elizabeth Shepherd University College London.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
EAD in A2A Bill Stockting, Senior Editor A2A and EAD Working Group: Central Archives of Historical Records, Warsaw, 26 April 2003.
SLIDE 1PNC Digital Archives Session Prof. Ray R. Larson University of California, Berkeley School of Information Management and Systems
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
SLIDE 1IS 242 – Fall 2011 Examples of XML DTDs and XSDs Ray R. Larson University of California, Berkeley School of Information IS 242: XML.
Archival Description and Access After Finding Aids.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
SLIDE 1IS 245 – Spring 2009 Codes and Rules for Description: History University of California, Berkeley School of Information IS 245: Organization.
SLIDE 1IS 257 – Fall 2007 Codes and Rules for Description: History University of California, Berkeley School of Information IS 245: Organization.
Presented by Karen W. Gwynn LS – Metadata University of Alabama Prof. Steven MacCall Spring 2011.
RDA AND AUTHORITY CONTROL Name: Hester Marais Job Title: Authority Describer Tel: Your institution's logo.
Use of METS in CDL Digital Special Collections Brian Tingle.
EMu and Archives NA EMu Users Conference – Oct Slide 1 EMu and Archives Experiences from the Canada Science and Technology Museum Corporation.
7: Basics of RDA Relationships for Serials Relationships in RDA Relationship designators Creators and other corporate bodies related to works Corporate.
Music Library AssociationFeb. 18, 2005BCC Open Meeting Development of AACR3 Kathy Glennan University of Southern California.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Mark Sullivan University of Florida Libraries Digital Library of the Caribbean.
Diving In: Testing the Archivists’ Toolkit. 21 Oct. 2006Archivists' Toolkit at NEA2 Bradley D. Westbrook, UC San Diego Katherine Stefko, Bates College.
FRAD: Functional Requirements for Authority Data.
National Archival Authorities Infrastructure Social Networks and Archival Context & National Archival Authorities Cooperative.
Linking and Exploring Authority Files TEL-ME-MOR/M-CAST Seminar, Prague November 23 rd 2006 Hans-Jörg Lieder, Berlin State Library.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Pavia Workshop 28 February 2013 Michael Forstrom Modern Literary Fonds: Split by Principle Manuscript Unit, Beinecke Library.
RIAMCO Rhode Island Archival and Manuscript Collections Online [your name]
Jenn Riley Metadata Librarian IU Digital Library Program New Developments in Cataloging.
Metadata: Essential Standards for Management of Digital Libraries ALI Digital Library Workshop Linda Cantara, Metadata Librarian Indiana University, Bloomington.
ARCHIVISTS’ TOOLKIT WORKSHOP March 13, 2008 Christine de Catanzaro Jody Thompson.
DACS Describing Archives: A Content Standard. The Background  Archives, Personal Papers & Manuscripts, 1980s –New Technologies with Web, XML, EAD –Revision.
Entity Relationships for the Bibliographic Universe Jacquie Samples September 7,2010 FRBR.
OCLC Research: Selected projects Eric Childress Larry Olszewski Presentation for Dpto. Biblioteconomía y Documentación Universidad Carlos III de Madrid.
Integrating a Statewide Web Gateway With Digital Collections ______________________ Eric Weig and Beth Kraemer University of Kentucky and KCVL.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
RDA in NACO Module 6.a RDA Chapter 11: Identifying Corporate Bodies—Overview Recording the Attributes.
The Future of Cataloging Codes and Systems: IME ICC, FRBR, and RDA by Dr. Barbara B. Tillett Chief, Cataloging Policy & Support Office Library of Congress.
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Archival Description People, Records, and Functions Daniel V. Pitti Institute for Advanced Technology in the Humanities University of Virginia March 2003.
Archival authority files and the representation of literary networks: first steps and opportunities Cataloguing Creativity, 15/11/2013: Bill Stockting,
Metadata and Documentation Iain Wallace Performing Arts Data Service.
By Addison, Jessica, and Lauren. Management The Mountain West Digital Library is a program of the Utah Academic Library Consortium (UALC) Three Governing.
Resource Description and Access Deirdre Kiorgaard ACOC Seminar, September 2007.
PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
Collection Description in the 1 November 2001Collection Description in the Archives Hub Archival perspective Collection description has always been central.
Metadata for digital preservation: a review of recent developments Michael Day UKOLN, University of Bath ECDL2001, 5th European Conference.
April 25, 2012 Making the Most of Library Collaboration and Cooperative Projects Partnering for Discovery: Jennifer LissErika Dowell Metadata/Cataloging.
Modul 4 Struktur Informasi Mata Kuliah Preservasi Informasi Digital.
INFO 6850 Archives II Week Seven THEORY, STANDARDS, BEST PRACTICES How do you encode the “context” of archival records?
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Basic Encoded Archival Description METRO New York Library Council Workshop Presented by Lara Nicosia December 9, 2011 New York, NY.
Feb 2012Teldap, Taipai1 Creativity, Collaboration, Convergence and the change from print to a digital environment: Theme and case study. (Also Friday 09:30.
Fitting in Functions Katherine M. Wisser & Anila Angjeli Description Section August 2015.
COMMON COMMUNICATION FORMAT (CCF). Dr.S. Surdarshan Rao Professor Dept. of Library & Information Science Osmania University Hyderbad
EAD 101: An Introduction to Encoded Archival Description XML and the Encoded Archival Description: Providing Access to Collections Oregon Library Association.
Presented by: Amy Carson, Trisha Hansen and Jonathan Sears.
FIND IT! USING LIBRARY CATALOGING CONCEPTS TO ORGANIZE AND MAKE RECORDS FINDABLE DIONNE L. MACK, INTERIM DIRECTOR OF QUALITY OF LIFE DEPARTMENTS.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Exploring EAC-CPF with the Remixing Archival Metadata Project (RAMP) 8 May 2014 Society of Florida Archivists Annual Meeting Allison Jai O’Dell
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
CRAI Library Catalog of University of Barcelona
7th Annual Hong Kong Innovative Users Group Meeting
CRAI Library Catalog of University of Barcelona
FRBR and FRAD as Implemented in RDA
APE EAD3 introduction - DARIAH - Brussels
Presentation transcript:

SLIDE 1I242 - Fall 2011 Connecting Archival Collections: The Social Networks and Archival Context Project Ray R. Larson University of California, Berkeley School of Information Thanks to Daniel V. Pitti of the Institute for Advanced Technology in the Humanities, University of Virginia, and Brian Tingle of the California Digital Library for many of the slides here

SLIDE 2I242 - Fall 2011 SNAC Overview Funding and Timeline Project Team Project Objectives and Rationale Data Contributing Institutions Archival Standards Employed Extraction and Matching Prototype Interface

SLIDE 3I242 - Fall 2011 Funding and Timeline National Endowment for the Humanities A Preservation and Access, Research and Development grant Two-year project May 2010-April 2012 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

SLIDE 4I242 - Fall 2011 Project Team Daniel Pitti (PI) and Worthy Martin (Institute for Advanced Technology in the Humanities, University of Virginia) Adrian Turner and Brian Tingle (California Digital Library, University of California) Ray Larson, Krishna Janakiraman (School of Information, University of California, Berkeley) Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

SLIDE 5I242 - Fall 2011 Project Objectives Archival finding aids currently intermix description of records with description of the creators of records and persons evident in the records Goal: Using EAC-CPF, an International archival authority control standard facilitate the separation of the description of people from the description of records for archival description Goal: enhance the economy and effectiveness of archival description to enhance access and understanding of users of archives, libraries, and museums Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

SLIDE 6I242 - Fall 2011 Data Contributing Institutions EAD-encoded finding aids –Library of Congress (1159) –Online Archive of California (15,400+) –Northwest Digital Archive (5,563+) –Virginia Heritage (8,390+) Authority records –Library of Congress: NACO/LCNAF (3.8M personal names; 900K corporate names) –Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names) –Virtual International Authority File (intersection with NACO/LCNAF, 5M personal names) Other biographical sources (e.g., DBPedia, IMDB)

SLIDE 7I242 - Fall 2011 Methods and Processing Extract EAC-CPF records from existing EAD- encoded archival descriptions –Extracting both creators and referenced CPF names Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF); merge records for the same entity –Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) –Key challenge: two or more people with the same name; two or more names for the same person Create a prototype historical resource and access system –Historical data and social-professional networks –Links to archive, library, and museum resources (by and about) Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

SLIDE 8I242 - Fall 2011 Components of Archival Description Description of records Context of creation: creators Functions and activities documented in records Dedicated descriptive semantics and structure for each component Components interrelated with one another

SLIDE 9I242 - Fall 2011 Records: EAD Encoded Archival Description –Society of American Archivists and Library of Congress –Used internationally –English, Spanish, Dutch, French, and Chinese 1998, 2002 Official site at

SLIDE 10I242 - Fall 2011 What EAD Is An emerging encoding and structural standard for archival description –Data structure –Communication/interchange –Finding aid / archival description Based on principles of ISAD(G): General International Standard Archival Description, Second edition

SLIDE 11I242 - Fall 2011 What EAD Is Not Content standard Data value standard Archival management system

SLIDE 12I242 - Fall 2011 Principals of Record Description Respect de fonds –Provenance –Original order Hierarchical and symmetrical Inheritance of description

SLIDE 13I242 - Fall 2011 Archival Records Records are the by-products of people living and working as individuals, in organized groups, in families Records document people living and working People exist in social-professional contexts, in relation to others Records document these relations All records created by the same entity are described together (a fonds or collection) –Creators documented in detail –Many of the people documented in the record referenced in description Archival descriptions document interrelations among people and records (documents) Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

SLIDE 14I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Source: J. Robert Oppenheimer Papers (LoC) Oppenheimer, J. Robert, Oppenheimer, J. Robert, Bethe, Hans Albrecht, Correspondence Born, Max, Correspondence Boyd, Julian P. (Julian Parks), Correspondence Bush, Vannevar, Correspondence Casals, Pablo, Correspondence Institute for Advanced Study (Princeton, N.J.) Los Alamos Scientific Laboratory EAD Elements

SLIDE 15I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Source: Leonard Bernstein Collection (LoC) 1 Aaltonen, Erkki Abbado, Claudio […] EAD Elements

SLIDE 16I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Biographical Sketch José Marcos Mugarrieta, prior to his term as Mexican consul in San Francisco , served in the Mexican army from He saw action in numerous battles and campaigns – Jamaica, under General Canalizo in 1841; Campeche, ; Merida, 1843; Veracruz, 1845; Mexico City, 1846; Angostura and Cerro-gordo, 1847; Guanajuato, 1848, and Sierra-Gorda under Bustamante, ; and Matamoros, […] In April 1857 Mugarrieta received an appointment from the Comonfort government for the consulship in San Francisco. He did not actually begin his new duties until September 1, 1859, due to illness and to the political situation in Mexico. […] EAD Elements

SLIDE 17I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia Chronology 1900 Born on Jan. 20 in Hastings, Minnesota Received baccalaureate from Princeton University, major in philosophy. […] 1965 Died on April 4. EAD Elements

SLIDE 18I242 - Fall 2011 The EAD DTD The EAD DTD is very complex and permits considerable flexibility in expressing the description and topics of the archival collection. The main parts are outlined on the following slides, but include: –A header, including basic descriptive info. –Optional frontmatter –The archival description We will describe only a few of the top-level tags

SLIDE 19I242 - Fall 2011 Major Sections and DTD Defs EAD – EADHeader: – –FILEDESC

SLIDE 20I242 - Fall 2011 Major Sections and DTD Defs The Archival Description: – The Descriptive Identification –

SLIDE 21I242 - Fall 2011 Example EAD Record (Hub) GB 0133 TAB Tabley Muniments John Rylands University Library of Manchester 150 Deansgate Manchester... (Parts removed )… University of Manchester, John Rylands University Library of Manchester <UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB" REPOSITORYCODE = "0133"> GB 0133 TAB Tabley Muniments 19th century 1.24 cu.m Warren, family, of Tabley, Cheshire Warren, John Byrne Leicester, , 3rd Baron de Tabley, poet

SLIDE 22I242 - Fall 2011 Example EAD Record (Hub) Administrative/Biographical History The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire, was born in 1835, the son of the 2nd Baron de Tabley ( ), and his wife, Catherina. His mother was Italian, the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he published under the pseudonyms George F. Preston ( ) and William Lancaster ( ), but latterly under his own name. His early verse included Praeterita (1863), Eclogues and Monodramas (1864), Studies in Verse (1865), Philocletes (1866), and Orestes (1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and Swinburne. In 1873 he produced …. (some data removed)…

SLIDE 23I242 - Fall 2011 Example EAD Record (Hub) Scope and Content The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates. Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert Bridges. There are volumes of Tabley's essays and verse, as well as a considerable number of notebooks and loose manuscripts of verse and other writings. There are various bundles and boxes relating to "Coins", "Botany", "Poetry", "Literary", "Financial" and bookplates. Preliminary survey list. There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM. The Library also has custody of the important Tabley Book Collection. The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record Office. Some of these papers were originally in the custody of the John Rylands University Library of Manchester.

SLIDE 24I242 - Fall 2011 Example EAD Record (Hub) Index terms Tabley Inferior Cheshire SJ7378 Benson Arthur Christopher Bridges Robert Seymour Duff Sir Mountstuart Elphinstone Grant Knight Gosse Sir Edmund William Knight Milnes Richard Monckton st Baron Houghton Bookplates Botany Numismatics Poetry Modern 19th century

SLIDE 25I242 - Fall 2011 EAC-CPF EAD is now complemented by “EAC” or the “Encoded Archival Context” It is another XML-based standard for descriptions of record creators: corporate bodies, persons and families (CPF) It was developed as part of an international effort with hopes of being able to link and share information among archives having materials related to particular corporate bodies, persons and families

SLIDE 26I242 - Fall 2011 Transformation of EAD to EAC The EAD archival records containing many names are transformed using a complex XSLT transform to many EAC-CPF records –one for each unique name in the EAD record

SLIDE 27I242 - Fall 2011 EAC-CP Data Examples Examples…

SLIDE 28I242 - Fall 2011 Library and Archive Authority Control Library (or bibliographic) authority control is almost exclusively about the control of names Archival authority control involves biographical- historical description of the CPF entity –Descriptions based on controlled vocabularies, for example, occupations, place of birth and death –But also biographical-historical description Prose Chronological list Archival authority control provides context for understanding records, the context of their creation, the provenance

SLIDE 29I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia person Oppenheimer, J. Robert, AACR2 Oppenheimer, J. Robert (Julius Robert), VIAF Oppenheimer, Julius Robert, VIAF Oppenheimer, Robert VIAF Ou-pẽn-hai-mo, VIAF EAC-CPF example data

SLIDE 30I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia 1904, Apr , Feb. 18 Science--Societies, etc. Male Physicists.

SLIDE 31I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia 1904, Apr. 22 New York, N.Y. Born, New York, N.Y Los Alamos, N. Mex. Director, Los Alamos Scientific Laboratory, Los Alamos, N. Mex (1) Denied security clearance […] (2) Published Science and the Common Understanding […] 1967, Feb. 18 Princeton, N.J. Died, Princeton, N.J.

SLIDE 32I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <cpfRelation xmlns:xlink=" xlink:type="simple" xlink:role=" xlink:arcrole="correspondedWith"> Bush, Vannevar, recordId: DLC.ms r007

SLIDE 33I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <resourceRelation xmlns:xlink=" xlink:arcrole="creatorOf" xlink:role="archivalRecords” xlink:type="simple” xlink:href=" J. Robert Oppenheimer Papers, (bulk ) Papers (bulk ) MSS35188 Oppenheimer, J. Robert, Manuscript Division. Library of Congress Physicist and director of the Institute for Advanced Study, Princeton, New Jersey. [...] Topics include theoretical physics, development of the atomic bomb, the relationship between government and science, nuclear energy, security, and national loyalty.

SLIDE 34I242 - Fall 2011 Authority Control Identifying creator entities and referenced entities (correspondents, etc.) Recording name or names used by and for them Rule-based heading or entry formation and control

SLIDE 35I242 - Fall 2011 Authority Control But - Different EAD records may use different names for the same people –Identifying creator entities and referenced entities (correspondents, etc.) –Recording name or names used by and for them Some records follow rules like AACRII for names, others don’t.

SLIDE 36I242 - Fall 2011 The Problem Proliferation of the forms of names –Different names for the same person –Different people with the same names Examples –from Books in Print (semi-controlled but not consistent) –ERIC author index (not controlled)

SLIDE 37I242 - Fall 2011 Goethe …etc…

SLIDE 38I242 - Fall 2011 John Muir

SLIDE 39I242 - Fall 2011 Library and Archive Authority Control Library (or bibliographic) authority control is almost exclusively about the control of names Archival authority control involves biographical- historical description of the CPF entity –Descriptions based on controlled vocabularies, for example, occupations, place of birth and death –But also biographical-historical description Prose Chronological list Archival authority control provides context for understanding records, the context of their creation, the provenance

SLIDE 40I242 - Fall 2011 EAC-CPF Encoded Archival Context-Corporate bodies, Persons, Families An international communication standard for archival authority control Based on International Council for Archives, International Standard Archival Authority Records- Corporate bodies, persons, families (ISAAR(CPF)) SAA Standards Committee, Technical Subcommittee on Encoded Archival Context Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

SLIDE 41I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia person Oppenheimer, J. Robert, AACR2 Oppenheimer, J. Robert (Julius Robert), VIAF Oppenheimer, Julius Robert, VIAF Oppenheimer, Robert VIAF Ou-pẽn-hai-mo, VIAF

SLIDE 42I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia 1904, Apr , Feb. 18 Science--Societies, etc. Male Physicists.

SLIDE 43I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia 1904, Apr. 22 New York, N.Y. Born, New York, N.Y Los Alamos, N. Mex. Director, Los Alamos Scientific Laboratory, Los Alamos, N. Mex (1) Denied security clearance […] (2) Published Science and the Common Understanding […] 1967, Feb. 18 Princeton, N.J. Died, Princeton, N.J.

SLIDE 44I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <cpfRelation xmlns:xlink=" xlink:type="simple" xlink:role=" xlink:arcrole="correspondedWith"> Bush, Vannevar, recordId: DLC.ms r007

SLIDE 45I242 - Fall 2011 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <resourceRelation xmlns:xlink=" xlink:arcrole="creatorOf" xlink:role="archivalRecords” xlink:type="simple” xlink:href=" J. Robert Oppenheimer Papers, (bulk ) Papers (bulk ) MSS35188 Oppenheimer, J. Robert, Manuscript Division. Library of Congress Physicist and director of the Institute for Advanced Study, Princeton, New Jersey. [...] Topics include theoretical physics, development of the atomic bomb, the relationship between government and science, nuclear energy, security, and national loyalty.

SLIDE 46I242 - Fall 2011 Year One Results-Extraction EAC-CPF records extracted –LoC: 43,702 from 1,159 finding aids –OAC: 91,811 from ~15,400 –NWDA: 22,609 from 5,160 –VH: not yet –Total 158,122 Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

SLIDE 47I242 - Fall 2011 Methods and Processing Extract EAC-CPF records from existing EAD- encoded archival descriptions –Extracting both creators and referenced CPF names Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) –Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) Create a prototype historical resource and access system –Historical data and social-professional networks –Links to archive, library, and museum resources (by and about)

SLIDE 48I242 - Fall 2011 Connect exactly matching records Connect records using name authority information Merge Cheshire Search Merging EAC-CPF Records LCNAF RepositoryULAN Repository

SLIDE 49I242 - Fall 2011 Connect exactly matching records Connect records using name authority information Merge Cheshire Search Merging EAC-CPF Records

SLIDE 50I242 - Fall 2011 Connect Exact Matches The EAC-CPF records provide the names without having to parse texts, etc. Allows us to use some simple methods like exact matching –Assume identical name entries means the same person/corporate body/family –Enter the full names and record IDs into a database and flag IDs with same names for merging

SLIDE 51I242 - Fall 2011 Connect exactly matching records Connect records using name authority information Merge Cheshire Search Merging EAC-CPF Records

SLIDE 52I242 - Fall 2011 Search Authority Files For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) –Search both the “authoritative” and “non- authoritative” forms –Consider any name matching a non- authoritative form to be a candidate match for the authoritative form –Flag EAC records that match the same authority record as potential matches

SLIDE 53I242 - Fall 2011 Connect exactly matching records Connect records using name authority information Merge Cheshire Search Merging EAC-CPF Records

SLIDE 54I242 - Fall 2011 Merge Flagged Records For all of the exact matches and authority matches –Use the Authoritative form of the name –Combine data from each match into a single EAC-CPF record –Retain all source record IDs and information Finally, output the merged EAC-CPF records

SLIDE 55I242 - Fall 2011 Inputs to SNAC merging LoC: 43,702 EAC-CPF records derived from 1159 finding aids OAC: 91,811 EAC-CPF records derived from ~15,400 finding aids NWDA: 22,609 EAC-CPF records derived from 5,568 finding aids Result: 123,920 “unique” names

SLIDE 56I242 - Fall 2011 Another view of the numbers… Person names merged from Person records Institutions merged from Institution records 1669 Families merged from 2263 Family records

SLIDE 57I242 - Fall 2011 But… Exact merging assumes that archives are following LC cataloging practice in their EAD records –There are some problems with this assumption

SLIDE 58I242 - Fall 2011 Some failures for merging… Different abbreviations: –A. & G. Carisch & C. –A. & G. Carisch & Co. And spacing issues: –A. C. Peters & Bro. –A. C. Peters & Brother. –A. C. Peters. (??) –A. C.Peters & Bro. Completeness and alternate rules –Tabb, John B. (John Banister), –Tabb, John Banister, Also differing transliterations for non-Latin scripts

SLIDE 59I242 - Fall 2011 Testing new merging methods Work done in conjunction with SNAC for a I School Masters’ project called Biograph –Krishna Janakiraman and Sean Marimpietri Using SNAC and merging with FreeBase and IMDB

SLIDE 60I242 - Fall 2011 Einstein, Albert, Einstein, Albert. Ainshutain, A Aiyinsitan Einstein, A. Albert Einstein Krishna Janakiraman and Sean Marimpietri - Biograph

SLIDE 61I242 - Fall 2011 Learn binary classifiers over varying names and existence dates Perturb existing information to generate additional samples within specific error levels Our approach Krishna Janakiraman and Sean Marimpietri - Biograph

SLIDE 62I242 - Fall 2011 Features Names Shingle Language Model Features Birth and Death dates Features Names String distance metrics TRAINTRAIN TRAINTRAIN Learn decision tree classifiers PREDICTPREDICT PREDICTPREDICT 0 0 Link Records 0 0 Krishna Janakiraman and Sean Marimpietri - Biograph

SLIDE 63I242 - Fall 2011 Shingle Language Model for names Name : Einstein Albert Shingle sequence : ein, ins, nst, ste, tei, ein …, ert Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein Krishna Janakiraman and Sean Marimpietri - Biograph

SLIDE 64I242 - Fall 2011 Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein ins nst ste ein In ert rtetei ein Ain ins nsh sht hta tai ain alb ert al rte tei ein ins nst ste ein In ert rte tei ein lbe Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph

SLIDE 65I242 - Fall 2011 Example Decision Tree For Von Neumann Date String Distance Krishna Janakiraman and Sean Marimpietri - Biograph

SLIDE 66I242 - Fall 2011 TP:78FP:11 FN:25TN:145 Albert Einstein TPR: 75.7% FPR: 7% TP:39FP:9 FN:6TN:60 George W Bush TPR: 86.6% FPR: 13% TP:182FP:14 FN:27TN:301 Von Neumann TPR: 75.7% FPR: 7% TPR: 72.7% FPR: 17% Corpus Average Krishna Janakiraman and Sean Marimpietri - Biograph

SLIDE 67I242 - Fall ,300 records, thresh = records, thresh = 0.9 How many did we link ?

SLIDE 68I242 - Fall 2011 Merging Conclusions There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information

SLIDE 69I242 - Fall 2011 Methods and Processing Extract EAC-CPF records from existing EAD- encoded archival descriptions –Extracting both creators and referenced CPF names Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) –Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) Create a prototype historical resource and access system –Historical data and social-professional networks –Links to archive, library, and museum resources (by and about)

SLIDE 70I242 - Fall 2011 SNAC Prototype Developed by Brian Tingle of the California Digital Library Uses XTF for management and search Social network visualization based on links in EAC-CPF records –E.g.: Correspondents, associated persons, associated corporate bodies, etc. Demo (or slides)

SLIDE 71I242 - Fall 2011

SLIDE 72I242 - Fall 2011

SLIDE 73I242 - Fall 2011

SLIDE 74I242 - Fall 2011

SLIDE 75I242 - Fall 2011

SLIDE 76I242 - Fall 2011

SLIDE 77I242 - Fall 2011

SLIDE 78I242 - Fall 2011

SLIDE 79I242 - Fall 2011

SLIDE 80I242 - Fall 2011

SLIDE 81I242 - Fall 2011 For More Information (Project website) rch (public prototype) rch Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia