Download presentation
Presentation is loading. Please wait.
Published byJordan McCormick Modified over 9 years ago
1
Matching names in parallel T. Hickey Access 2006 2006 October
2
Virtual International Authority File Link national authority records Build on their authority work Move towards universal bibliographic control Allow national or regional variations in authorized forms to co-exist Support needs for variations in preferred language, script, and spelling 10 million WorldCat records in non-English metadata
3
Joint VIAF Project
4
Matching Variations In the LCNAF and PND authority files: Same name, same person Same name, different people Different names, same person Missing person in one file
5
Two Different People – One Name Adams, Mike PND: a golfer LCNAF: author of a Beatles collector's guide Same Name Different People
6
One Person – Two Names LCNAF: Morel, Pierre PND: Morellus, Petrus Same Person Different Names
7
Enhancing the Authorities Bibliographic Record Derived Authority Record Enhanced Authority
8
Strong Matching Attributes A work (title) in common Common control numbers (ISBN, ISSN, or LCCN) Exact birth and death year Joint authors Name as subject
9
Weaker Attributes Only one of birth/death date(s) (allows some variation) Subject area of works (two levels) Format (books, films, musical scores, etc.) Language Publisher Partial title match Date of publication Country Role (author, illustrator, composer, etc.) Format (books, films, musical scores, etc.)
10
Computing it Standard approach Generate keys and data Load information into a database Index it Extract fields needed Map/Reduce approach Split the database up Run parallel jobs Bring information together via map/reduce Assemble information in stages
11
Map/Reduce Two stages Map Read in source file (e.g. MARC-21) Write out key + data Reduce Read in array of data for each unique key Write out key + data
12
Overview of MapReduce Source: Dean & Ghemawat (Google)
13
Our Implementation Written in Python Uses ssh and XML-RPC for control and communication Map/Reduce seems to add ~ 10% overhead Ran an earlier implementation on a 48 cpu cluster Current VIAF cluster is a 12 cpu cluster on 4 nodes Running Linux and 64-bit Python
14
VIAF Matching Code 17 modules 1,100 lines of code Plus 600 lines configuration 2,755 lines of tables embedded in code
15
VIAF Data Flow get changed Ids eliminate forename, date conflicts from buckets Extract Data build buckets surname: forename,date compare build compare data id:tag, data build compare data id:tag, data build name:id map name:id map authorities authority id: bib id changed authority ids potential pairs identify compare data pair id:[bib/auth]id select compare data pair id: compare data map authorities authority id: bib id name:id build name:id map pair id: scores identify compare data pair id:[bib/auth]id select compare data pair id: compare data LC Authority Extract Data LC CatalogPND Authority Extract Data PND Catalog Extract Data PND Catalog
16
WorldCat Identities Bring together all of WorldCat’s information about people Name(s) Works by and about Subjects Dates Fiction/non-fiction Roles Co-authors Add links Wikipedia Authority files
17
Sample Identity
18
Statistics Nearly 19 million different ‘identities’ in WorldCat 80 million (nominally) controlled headings The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)
19
Identities Data Flow Stage 1 NameInfoCitation Stage 3 Stage 4 NameInfoCitations Stage 2 Cover ArtWorldCatFRBRAudience Authorities Identities Wikipedia
20
Identities Stage 1 Extract Data From WorldCat Input: WorldCat (MARC-21) Map output: NameKey WorkID Reduce output: WorkID NameKey
21
Identities Stage 2 Extract Data From Authorities Input: NACO Authorities file (MARC-21) Map output NameKey XTos XFroms Reduce output NameKey
22
Identities Stage 3 Connect Citations with Names Input Stage 1 output WorkID ’s NameKey Map output NameKey
23
Identities Stage 4 Create Identities Input Authority info from stage 2 Merged name info from stage 3 Merged citations from stage 3 Map output Pass through Reduce output Pnkey
24
Schedules Identities Up this year? VIAF Reload, rematch this year Public service up early 2007
25
Conclusions Our merged files (e.g. WorldCat) are really quite large More processing power opens up new ways of manipulating and looking at our data Parallel processing is the only way to obtain the cycles needed Map-Reduce is an attractive way to do parallel processing Forces decomposition Scales well Opens up new possibilities
26
Thank you T. Hickey VIAF.org http://errol.oclc.org/laf/n82-54463.html Access 2006 2006 October
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.