Download presentation
Presentation is loading. Please wait.
Published byMegan Suarez Modified over 11 years ago
1
FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004
2
Outline Algorithms FRBR work matching Handling author-title variants Hardware Beowulf cluster Applications Bookmarklets FictionFinder Future directions
3
Working with Group 1 Entities WEMI: Work Expression Manifestation Item Strict expression-level determination is hard We primarily divide by language Manifestation is easier We use the WorldCat master record
4
Work Identification Algorithm goals: Efficient Understandable Controllable by catalogers Uses existing WorldCat records
5
The Algorithm A key is generated for each record Extract author, title Look up in LC name authority file Added entry information as needed Form a key from bibliographic record Author, title, added entry information These can be sorted, compared
6
Example 146Smollett\1721 Expedition of Humphry Clinker 16Smollett\1721 Expedition of Humphrey Clinker 8Smollett\1721 Humphry Clinker 4Smollett\1721 Humphrey Clinker 2Smollett\1721 Expedition of Humphry Clinker 1Smollett\1721 Calatoriile lui Humphrey Clinker 1Smollet\1721 Expedition of Humphry Clinker 1Smollett Humphry Klinkers Reisen
7
Example (with authorities) 156Smollett\1721 Expedition of Humphry Clinker 16Smollett\1721 Expedition of Humphrey Clinker 4Smollett\1721 Humphrey Clinker 1Smollett\1721 Calatoriile lui Humphrey Clinker 1Smollet\1721 Expedition of Humphry Clinker 1Smollett\1721 Humphry Klinkers Reisen
8
More Detail Extract author names Look up in authority file Currently only personal names Subfields $abcdq Extract title Always use uniform titles if present Look up author/short title (~$a) Look up author/long title (~$abfgnp) Prefer alternative title for non-English Create key from author/title Always do NACO normalization (has limitations) Add information for uncontrolled title-main-entry
9
Authority Files Rule! Authors Author/titles Bring together variations Allow override in difficult cases Both splitting and joining groups Especially important with xISBN matching Especially important with non-English metadata
10
Limitations of the Authority File Whats missing: Many uniform titles Many author variants Many title variants Language of heading Partial solution Create auxiliary files of mechanically generated matches
11
Results of FRBR Matching on WorldCat 88% of manifestations are singletons 30% of manifestations are in 12% of the works Average size of multiple matches: 3.1 manifestations/work 43.1 million works in 54 million manifestations 54% of holdings on a FRBR work with >1 manifestation WorldCat manifestations average about 20 holdings FRBR helps where help is most needed
12
More FRBR Results 310,000 works have more than 5 manifestations 1.7 million have more than 2 manifestations Largest: 30,000+ for the Bible 1,537 Shakespeares Macbeth 1,026 Dickenss Christmas Carol
13
The Top 10 Works by Holdings WorkHoldingsManifs 1US Census (various)403,25210,164 2Bible (combined)271,53436,738 3Mother Goose66,5431,997 4Dante, The Divine Comedy59,0342,714 5Homer, The Odyssey43,8712,009 6Homer, The Iliad42,7562,388 7Twain, Huckleberry Finn39,3101,093 8Shakespeare, Hamlet37,6831,917 9Carroll, Alices Adventures in Wonderland37,6141,865 10Tolkien, Lord of the Rings37,461643
14
The Top 10 Works Cataloged in 2003 WorkLibraries 1Rowling, Harry Potter and the Order of the Phoenix2,406 2Clinton, Living History36,738 3Rohmann, My Friend Rabbit1,997 4Brown, The Da Vinci Code2,714 5Gibaldi, MLA Handbook2,009
15
Top 1000 Publication Dates
16
Top 1000 Languages
17
Our Beowulf Cluster 24 Nodes Each with 2x2.6 GHz processors 4 GBytes memory (96 GBytes total) One head node, 23 compute nodes 46x40 GBytes disk (~2 Terabytes total) Gigabit switch
18
What we are using it for All our bibliographic processing FRBR Extractions Searching Matching
19
Ganglia load visualization
20
Starting point FRBR key generation 25 hours on a 3.00GHz workstation with 2GB of RAM Generate two key files sort by key, uniq by key, sort by occurrence sort by key, post processing on keys, uniq by key, sort by occurrence Merge key files
21
FRBR on the Cluster 44 minutes on the cluster 69 key builders & 23 sort buckets with hyperthreading ON Generate 23 radix-sorted, post-processed key files Collapse and sort by occurrence in parallel Also outputs additional files used by other jobs
22
Application: Preservation Identify final copy items Do it at the work level Single-singles Single manifestations with single holding Found 18 million in WorldCat
23
Application: xISBN A simple Web service Given an ISBN: Identify the workset it is in Return all other ISBNs in that workset Results should be symmetrical! Same group retrieved for each ISBN in group ISBNs sorted by number of library holdings
24
xISBN Example http://labs.oclc.org/xisbn/0-19-281664-0http://labs.oclc.org/xisbn/0-19-281664-0 returns: 0192816640 0820312037 0820315370 0393015920 0393952274 0393952835 0140430210 0192811320 0192835947 0460872885 1853262706 0874131219
25
Matching on ISBNs ISBN additional information beyond Author/Title Allows relaxation of matching Introduces possible errors Offers the possibility of substantial improvement of work matching
26
Merging Worksets Using ISBN Matches Pair ISBNs with FRBR keys (Starts with 10 million ISBNs) Throw out ISBNs in single worksets Throw out ISBNs in > 5 worksets (We now have 561,000 ISBNs left) Are the titles similar enough? Throw out large groups Try to be very conservative Authority file always overrides other matching
27
Matches from ISBN Matching 74,000 author variants ~200,000 title variants These all create additional cross reference records Automatically folded into FRBR matching Kept separate from NACO file Only used in research at this time
28
Examples of Possible Matches /mcgraw hill encyclopedia of science & technology /mcgraw hill encyclopedia of science & technology\1\aar aor /mcgraw hill encyclopedia of science & technology\2\apa boo /mcgraw hill encyclopedia of science & technology\3\bor cle /mcgraw hill encyclopedia of science & technology\4\cli cyt … dickens, charles\1812 1870/tale of two cities dickens, charles\1812 1870/hard times dickens, charles\1812 1870/sketches by boz dickens, charles\1812 1870/martin chuzzlewit dickens, charles\1812 1870/bleak house dickens, charles\1812 1870/little dorrit dickens, charles\1812 1870/oliver twist …
29
Application: Bookmarklets
30
Clicking on Princeton
31
FictionFinder Indexes fiction from WorldCat Uses FRBR workset algorithm Focused on fiction Searching and browsing by Genre Fictitious Characters Imaginary Places Literary Forms Links to Google Open WorldCat Diane Vizine-Goetzs project
32
Humphry Clinker Search
33
Work Display
34
Detail of Language Display
35
First Few English Manifestations
36
Manifestation Display
37
Open WorldCat Link
38
Additional Matches Match variant titles: When the wind blows When the wind blows: a novel FictionFinder identified 10,000 of similar variations novela, novella, roman, … Created auxiliary authority records Now automatically used when FRBR algorithm is run
39
Future Continued development of FictionFinder Extending algorithm to serials? FirstSearch displays Additional matching criteria Local authority files? Integration of auxiliary files for production? Exploring FRBRizing some European catalogs Looking at extending beyond Roman characters
40
Links IFLA FRBR - Final Report http://www.ifla.org/VII/s13/frbr/frbr.htm Article in DLib http://www.dlib.org/dlib/september02/hickey/09hickey. html http://www.dlib.org/dlib/september02/hickey/09hickey. html OCLC Research Activities with FRBR http://www.oclc.org/research/projects/frbr/ FictionFinder http://fictionfinder.oclc.org/ Top 1000 http://www.oclc.org/research/top1000/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.