Download presentation
Presentation is loading. Please wait.
Published byAdrian McMillan Modified over 11 years ago
1
ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Beyond Data Mining: Delivering the Next Generation of Services from Library Data
2
WorldCat as an Aggregate Collection Data Mining and Analysis of WorldCat: …affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making. Lavoie, B.F., Connaway, L. S., & ONeill, E. T. (2007). Mapping WorldCats digital landscape. Library Resources & Technical Services, 51, 106-115 at 107.
3
WorldCat: July 2008 Total holdings: 1,292,763,300 Manifestations (records): 108,828,533 Works: 84,096,107 Digital Items: 3,182,550 Institutions: 69,000 Physical Items: ~1.2 billion
4
Global Origins of WorldCat Materials US 28% UK 8% Canada 3% Rest of World 27% Unknown 17% France 4% Germany 10%
5
Global Origins of WorldCat Materials Content Languages: 478 49% of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Content Languages: 478 49% of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million
6
WorldCat as a Decision-Making Resource Collection management Cooperative collection development Comparative collection analysis Collection assessment Mass digitization Off-site storage Preservation
7
WorldCat as a Decision-Making Resource Services Virtual reference Recommender services Social networking Systems Precision
8
WorldCat as a Decision-Making Resource Three Areas of Data Mining Research: OCLC WorldMap Audience Level Publisher Name Server
9
OCLC WorldMap
10
OCLC WorldMap TM : Objectives Geographically represent WorldCat data Titles published in each country Holdings for titles published in each country Languages represented for titles published in each country
11
OCLC WorldMap TM : Objectives Geographically represent data from UNESCO, ARL, and NCES for each country Number of Libraries Library volumes Certified/degreed librarians Registered library users Library expenditures Cultural heritage institutions (museums and archives) Publishers
12
OCLC WorldMap TM : Objectives Research prototype Support OCLC data mining research Visually display data for review and analysis Internal use Sales and marketing External use Library collection assessment and comparison Data may be processed AT A GLANCE Complement the AAU/ARL Global Resources Network project Project of the Council on Library and Information Resources (CLIR)
22
http://pubserv.oclc.org:12223/WorldMap/
23
OCLC Audience Level
24
Audience Level: Rationale and Objectives Thus we can infer materials audience level from holdings patterns, which in turn can support: Collection management Readers advisory services Reference services Information retrieval Holdings represent selection decisions by librarians … implies there are more than 1 billion individual selection decisions in the WorldCat holdings file Selections serve the interests of a librarys target community … Associate community (audience level) to library profiles - e.g., ARL, non-ARL academic, public, K- 12 school … ?
29
Example Computation: Build Community Library symbol Library nameLibrary type Weight OHIState Library of OhioOtherx OCOColumbus Metropolitan LibraryPublic0.33 CDCCedarville UniversityAcademic0.67 LIMLima Public LibraryPublic0.33 OUNOhio UniversityResearch1.00 OSDSEO Automation ConsortiumOtherx BGUBowling Green State UniversityAcademic0.67 MIAMiami UniversityAcademic0.67 AKR University of AkronAcademic0.67 BGFFirelands CollegeAcademic0.67 CINUniversity of CincinnatiResearch1.00 TOLUniversity of ToledoAcademic0.67 KSUKent State UniversityResearch1.00 HIRHiram CollegeAcademic0.67 YNGYoungstown State UniversityAcademic0.67
30
FRBRizing Audience Level Results Calculate Audience Level for each Manifestation Aggregate weighted holdings for Work OCLC NumberTotal HoldingsUsable Holdings Manifestation Audience Level 155044001471140.783825 296137121721170.769453 403931912071360.789426 627627631901240.758274 8101622410 x
32
Evaluating the OCLC Audience Level Random sample of 30 Zoology books, all audience levels Human subjects Ranked books in increasing order of difficulty Strong statistical correlation between human subjects ranking and programmatic ranking
33
Evaluating the OCLC Audience Level
34
http://audiencelevel.oclc.org/
35
OCLC Publisher Name Server
36
Publisher Name Server: Research Objectives Resolve for data mining and quality of WorldCat ISBN prefixes to publisher name Variant publisher names to a preferred form Complement Collection Analysis Service Librarians Publishers Capture and profile attributes of individual publishers Location(s) Language(s) of materials published Genre(s)/format(s) Dominant subject domain(s) Parent company and subsidiaries
37
Publisher Name Server: Methodology Programmatically cluster publishers records using ISBN prefixes Data clustering (The Free Dictionary) "The science of extracting useful information from large data sets or databases" Classification of similar objects into different groups Partitioning of a data set into subsets (clusters) Data in each subset (ideally) share some common trait Hand parse the entities and resolve ISBN prefixes
38
Publisher Name Server: Database 1750 publishing entities Relational database, preserving hierarchical relationships Begins with high-occurrence entities: Top 10 lists (USA, UK, Canada, Australia, Germany, France, Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia, Taiwan, New Zealand) Top 10 university presses Mergers and acquisitions, last 8 years
39
Publisher Name Server: Data Captured Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL ----- Languages Formats Conspectus Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers Weekly Online Hoovers Handbook Online Standard and Poors Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING
41
Publisher Name Server: Database More than 56,000 separate strings mapped to 1750 entities 8.5 million OCLC records 22% of these are Library of Congress records ~490 million holdings Hierarchical relationships maintained
42
Entity-Parsing in a World of Mergers and Acquisitions Prentice-Hall, Inc. Pearson Education, Inc. Addison-Wesley Publishing Company Allyn and BaconDominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co. Pearson PLC Pearson CanadaPearson Technology Group Copp ClarkAdobe PressCisco Press Penguin Books Allen LaneLadybird BooksRiverhead Books Puffin BooksPutnam BooksBerkeley Publishing Group Avery
43
Publisher Profiles Oxford University Press 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) Pearson PLC Includes 14 subsidiaries and acquisitions Aggregate: 291,433 records (0.27% of WorldCat)
44
Publisher Profiles – Top Languages Oxford Univ. Press: English 96.74% Latin0.51% German0.39% Chinese0.39% French0.37% Spanish0.28% Afrikaans0.14% Middle English0.13% Malay0.09% Swahili0.09% Pearson PLC: English95.27% Spanish 1.43% German 1.33% French 0.60% Dutch 0.55% Latin 0.26% Malay 0.06% Ancient Greek 0.05% Portuguese 0.05% Italian 0.04%
45
Publisher Profiles – Conspectus Divisions Oxford Univ. Press: Language/ Literature 27.12% History 11.92% Music 9.78% Philosophy/ Religion 9.55% Business/ Economics 6.15% Medicine 4.36% Law 3.85% Sociology 3.75% Political Science 3.58% Biology 2.60% Pearson PLC: Language/ Literature18.67% Business/ Economics13.30% Computer Science 9.42% Engineering 8.04% History 7.59% Mathematics 6.04% Education 5.64% Sociology 4.18% Philosophy/ Religion 3.81% Physical Sciences 2.75%
46
Publisher Profiles – Conspectus Categories Oxford Univ. Press: English literature 10.66% English language 5.86% Instrumental music3.48% Vocal music3.09% Literature on music2.26% History – Britain1.82% Economic history1.38% American lit.1.35% History – S. Asia1.30% General history1.29% Pearson PLC: English language7.74% Business admin.4.62% English literature3.63% Economics2.94% Comp. programming2.39% Electrical engineering2.24% Early childhood ed.2.05% Computer software1.88% U.S. federal law1.80% Computer Science1.54%
47
Publisher Profiles – Conspectus Subjects Oxford Univ. Press: English – modern 5.57% English lit – prose 2.51% English lit – 19 th c. 2.23% Juvenile lit. 1.06% English lit – poetry 1.03% English lit – collections 0.80% Biographies 0.76% English lit – 1900-1960 0.74% Shakespeare 0.68% Sacred choruses 0.66% Pearson PLC: English – modern7.68% Management2.53% Programming1.74% Arithmetic1.09% Economic theory1.06% Marketing1.06% General algebra1.04% Accounting0.97% Juvenile lit.0.93% English lit – 19 th c.0.89%
48
Projected MARC coding of Authorized Forms 710 Added Entry – Corporate Name Add $4 for publisher name Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) 752 Added Entry – Hierarchical Place Name Add $2 FAST where place of publication matches FAST geographical subject headings
49
Future Research Further data mining Profile aspects of publication output Deeper scaling into WorldCat (beyond ISBN) Plan for long-term maintenance ISBN-13 compliance File expansion of ongoing mergers/ acquisition activities
50
Thank You! Questions and Discussion Lynn Silipigni Connawayconnawal@oclc.orgconnawal@oclc.org Timothy J. Dickeydickeyt@oclc.orgdickeyt@oclc.org
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.