Matching names in parallel T. Hickey Access 2006 2006 October.

Matching names in parallel T. Hickey Access 2006 2006 October

Virtual International Authority File  Link national authority records  Build on their authority work  Move towards universal bibliographic control Allow national or regional variations in authorized forms to co-exist Support needs for variations in preferred language, script, and spelling 10 million WorldCat records in non-English metadata

Joint VIAF Project

Matching Variations In the LCNAF and PND authority files:  Same name, same person  Same name, different people  Different names, same person  Missing person in one file

Two Different People – One Name Adams, Mike  PND: a golfer  LCNAF: author of a Beatles collector's guide Same Name Different People

One Person – Two Names  LCNAF: Morel, Pierre  PND: Morellus, Petrus Same Person Different Names

Enhancing the Authorities Bibliographic Record Derived Authority Record Enhanced Authority

Strong Matching Attributes  A work (title) in common  Common control numbers (ISBN, ISSN, or LCCN)  Exact birth and death year  Joint authors  Name as subject

Weaker Attributes  Only one of birth/death date(s) (allows some variation)  Subject area of works (two levels)  Format (books, films, musical scores, etc.)  Language  Publisher  Partial title match  Date of publication  Country  Role (author, illustrator, composer, etc.)  Format (books, films, musical scores, etc.)

Computing it  Standard approach Generate keys and data Load information into a database Index it Extract fields needed  Map/Reduce approach Split the database up Run parallel jobs Bring information together via map/reduce Assemble information in stages

Map/Reduce  Two stages Map Read in source file (e.g. MARC-21) Write out key + data Reduce Read in array of data for each unique key Write out key + data

Overview of MapReduce Source: Dean & Ghemawat (Google)

Our Implementation  Written in Python  Uses ssh and XML-RPC for control and communication  Map/Reduce seems to add ~ 10% overhead  Ran an earlier implementation on a 48 cpu cluster  Current VIAF cluster is a 12 cpu cluster on 4 nodes  Running Linux and 64-bit Python

VIAF Matching Code  17 modules  1,100 lines of code  Plus 600 lines configuration 2,755 lines of tables embedded in code

VIAF Data Flow get changed Ids eliminate forename, date conflicts from buckets Extract Data build buckets surname: forename,date compare build compare data id:tag, data build compare data id:tag, data build name:id map name:id map authorities authority id: bib id changed authority ids potential pairs identify compare data pair id:[bib/auth]id select compare data pair id: compare data map authorities authority id: bib id name:id build name:id map pair id: scores identify compare data pair id:[bib/auth]id select compare data pair id: compare data LC Authority Extract Data LC CatalogPND Authority Extract Data PND Catalog Extract Data PND Catalog

WorldCat Identities  Bring together all of WorldCat’s information about people Name(s) Works by and about Subjects Dates Fiction/non-fiction Roles Co-authors  Add links Wikipedia Authority files

Sample Identity

Statistics  Nearly 19 million different ‘identities’ in WorldCat  80 million (nominally) controlled headings  The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)

Identities Data Flow Stage 1 NameInfoCitation Stage 3 Stage 4 NameInfoCitations Stage 2 Cover ArtWorldCatFRBRAudience Authorities Identities Wikipedia

Identities Stage 1 Extract Data From WorldCat  Input: WorldCat (MARC-21)  Map output: NameKey WorkID  Reduce output: WorkID NameKey

Identities Stage 2 Extract Data From Authorities  Input: NACO Authorities file (MARC-21)  Map output NameKey XTos XFroms  Reduce output NameKey

Identities Stage 3 Connect Citations with Names  Input Stage 1 output WorkID ’s NameKey  Map output NameKey

Identities Stage 4 Create Identities  Input Authority info from stage 2 Merged name info from stage 3 Merged citations from stage 3  Map output Pass through  Reduce output Pnkey

Schedules  Identities Up this year?  VIAF Reload, rematch this year Public service up early 2007

Conclusions  Our merged files (e.g. WorldCat) are really quite large  More processing power opens up new ways of manipulating and looking at our data  Parallel processing is the only way to obtain the cycles needed  Map-Reduce is an attractive way to do parallel processing Forces decomposition Scales well Opens up new possibilities

Thank you T. Hickey VIAF.org http://errol.oclc.org/laf/n82-54463.html Access 2006 2006 October

Matching names in parallel T. Hickey Access 2006 2006 October.

Similar presentations

Presentation on theme: "Matching names in parallel T. Hickey Access 2006 2006 October."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Matching names in parallel T. Hickey Access 2006 2006 October.

Similar presentations

Presentation on theme: "Matching names in parallel T. Hickey Access 2006 2006 October."— Presentation transcript:

Similar presentations

About project

Feedback