Matching names in parallel T. Hickey Access October
Virtual International Authority File Link national authority records Build on their authority work Move towards universal bibliographic control Allow national or regional variations in authorized forms to co-exist Support needs for variations in preferred language, script, and spelling 10 million WorldCat records in non-English metadata
Joint VIAF Project
Matching Variations In the LCNAF and PND authority files: Same name, same person Same name, different people Different names, same person Missing person in one file
Two Different People – One Name Adams, Mike PND: a golfer LCNAF: author of a Beatles collector's guide Same Name Different People
One Person – Two Names LCNAF: Morel, Pierre PND: Morellus, Petrus Same Person Different Names
Enhancing the Authorities Bibliographic Record Derived Authority Record Enhanced Authority
Strong Matching Attributes A work (title) in common Common control numbers (ISBN, ISSN, or LCCN) Exact birth and death year Joint authors Name as subject
Weaker Attributes Only one of birth/death date(s) (allows some variation) Subject area of works (two levels) Format (books, films, musical scores, etc.) Language Publisher Partial title match Date of publication Country Role (author, illustrator, composer, etc.) Format (books, films, musical scores, etc.)
Computing it Standard approach Generate keys and data Load information into a database Index it Extract fields needed Map/Reduce approach Split the database up Run parallel jobs Bring information together via map/reduce Assemble information in stages
Map/Reduce Two stages Map Read in source file (e.g. MARC-21) Write out key + data Reduce Read in array of data for each unique key Write out key + data
Overview of MapReduce Source: Dean & Ghemawat (Google)
Our Implementation Written in Python Uses ssh and XML-RPC for control and communication Map/Reduce seems to add ~ 10% overhead Ran an earlier implementation on a 48 cpu cluster Current VIAF cluster is a 12 cpu cluster on 4 nodes Running Linux and 64-bit Python
VIAF Matching Code 17 modules 1,100 lines of code Plus 600 lines configuration 2,755 lines of tables embedded in code
VIAF Data Flow get changed Ids eliminate forename, date conflicts from buckets Extract Data build buckets surname: forename,date compare build compare data id:tag, data build compare data id:tag, data build name:id map name:id map authorities authority id: bib id changed authority ids potential pairs identify compare data pair id:[bib/auth]id select compare data pair id: compare data map authorities authority id: bib id name:id build name:id map pair id: scores identify compare data pair id:[bib/auth]id select compare data pair id: compare data LC Authority Extract Data LC CatalogPND Authority Extract Data PND Catalog Extract Data PND Catalog
WorldCat Identities Bring together all of WorldCat’s information about people Name(s) Works by and about Subjects Dates Fiction/non-fiction Roles Co-authors Add links Wikipedia Authority files
Sample Identity
Statistics Nearly 19 million different ‘identities’ in WorldCat 80 million (nominally) controlled headings The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)
Identities Data Flow Stage 1 NameInfoCitation Stage 3 Stage 4 NameInfoCitations Stage 2 Cover ArtWorldCatFRBRAudience Authorities Identities Wikipedia
Identities Stage 1 Extract Data From WorldCat Input: WorldCat (MARC-21) Map output: NameKey WorkID Reduce output: WorkID NameKey
Identities Stage 2 Extract Data From Authorities Input: NACO Authorities file (MARC-21) Map output NameKey XTos XFroms Reduce output NameKey
Identities Stage 3 Connect Citations with Names Input Stage 1 output WorkID ’s NameKey Map output NameKey
Identities Stage 4 Create Identities Input Authority info from stage 2 Merged name info from stage 3 Merged citations from stage 3 Map output Pass through Reduce output Pnkey
Schedules Identities Up this year? VIAF Reload, rematch this year Public service up early 2007
Conclusions Our merged files (e.g. WorldCat) are really quite large More processing power opens up new ways of manipulating and looking at our data Parallel processing is the only way to obtain the cycles needed Map-Reduce is an attractive way to do parallel processing Forces decomposition Scales well Opens up new possibilities
Thank you T. Hickey VIAF.org Access October