Capturing Untapped Descriptive Data: Creating Value for Librarians and Users Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006
WorldCat: July 2006 Total holdings: 1,071,507,045 Manifestations (records): 67,282,165 Works: 53,472,668 Digital Items: 1,571,803 Institutions: 26,236 Physical Items*: ~1.6 billion *Estimated Physical Items*: ~1.6 billion *Estimated
Origin of materials represented in WorldCat US 34% UK 9% Canada 3% Rest of World 40% Unknown 14%
Some aspects of Global WorldCat … Content Languages: % of WC non-English Top 5 non-English: German:4.5 million French:4.2 million Spanish:2.9 million Dutch:2.1 million Chinese:1.6 million Content Languages: % of WC non-English Top 5 non-English: German:4.5 million French:4.2 million Spanish:2.9 million Dutch:2.1 million Chinese:1.6 million Non-English Metadata Language: 9.3 million (20 languages) Top 5: Dutch:4.1 million Japanese: 0.7 million French:1.4 million Finnish: 0.7 million German:1.0 million Non-English Metadata Language: 9.3 million (20 languages) Top 5: Dutch:4.1 million Japanese: 0.7 million French:1.4 million Finnish: 0.7 million German:1.0 million Materials w/non-US origins: 35.3 million (52%) Top 5: UK:6.1 million Germany:4.0 million France:2.9 million Netherlands:2.2 million Canada:2.1 million Materials w/non-US origins: 35.3 million (52%) Top 5: UK:6.1 million Germany:4.0 million France:2.9 million Netherlands:2.2 million Canada:2.1 million
OCLC WorldCat TM : Decision-making Resource Collection management Cooperative collection development Comparative collection analysis Collection assessment Mass digitization Off-site storage Preservation Services Virtual reference Recommender services Systems Precision
OCLC WorldCat TM : Data Mining Research Projects Audience Level Publisher Name Server WorldMap
Audience Level: Rationale and Objectives Implies: we can infer materials audience level from holdings patterns, which in turn can support: Collection management Readers advisory services Reference services Information retrieval Holdings represent selection decisions by librarians … implies there are about 1 billion individual selection decisions in the WorldCat holdings file Selections are made to serve the interests of a librarys target community … Associate target community (audience level) to particular library profiles - e.g., ARL, non-ARL academic, public, K-12 school … ?
Example : Mother Goose
Publisher Name Server: Research Objectives Resolve for data mining and quality of WorldCat ISBN prefixes to publisher name Variant publisher names to a preferred form Complement Collection Analysis Service Librarians Publishers Capture and make available various attributes of individual publishers Location of publisher Language(s) of materials published Genre(s)/format(s) of materials published Dominant subject domain(s) of the publisher's output Parent company and subsidiaries
Publisher Name Server: Methodology Programmatically cluster publishers using ISBN prefixes Data clustering (The Free Dictionary) "The science of extracting useful information from large data sets or databases" Classification of similar objects into different groups Partitioning of a data set into subsets (clusters) Data in each subset (ideally) share some common trait Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: Database To date >800 records Relational database, preserving hierarchical relationships Begins with high-occurrence entities to identify: Top 10 lists (USA, UK, Canada, Australia, Germany, France, Japan, Italy) Top university presses Mergers and acquisitions
Top U.S. Publishing Entities in WorldCat (22,680,201 total U.S. records)
Publisher Name Server: Database Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL Languages Formats DDC Subjects LCC Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers Weekly Online Hoovers Handbook Online Standard and Poors Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING
Entity-Parsing in a World of Mergers and Acquisitions Prentice-Hall, Inc. Pearson Education, Inc. Addison-Wesley Publishing Company Allyn and BaconDominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co. Pearson PLC Pearson CanadaPearson Technology Group Copp ClarkAdobe PressCisco Press Penguin Books Allen LaneLadybird BooksRiverhead Books Puffin BooksPutnam BooksBerkeley Publishing Group Avery
OCLC WorldMap TM : Objectives Geographically represent library data from UNESCO, ARL, and NCES Number of libraries Amount of library expenditures Number of volumes and titles Number of librarians Number of users
OCLC WorldMap TM : Objectives Research prototype Test geographical representation of WorldCat Titles and holdings by country of publication Support data mining research area Visually display mined data to ease review and analysis Internal use Sales and marketing External use Library collection assessment and comparison Complement the AAU/ARL Global Resources Network project Project of the Council on Library and Information Resources (CLIR)
OCLC WorldMap TM : Technology First implemented SVG Open standard maintained by W3C Simple XML file Young technology Browser support limited Requires plug-in Converted to Flash Browser compatibility Plug-in compatibility (if a plug-in was installed!) For a detailed comparison of SVG and Flash, see:
OCLC WorldMap TM
Potential Future Projects Audience Level Integrate into WorldCat.org and OPACS to limit searches and retrieved sources Publisher Name Server Integrate into OCLC Collection Analysis Service for publisher business intelligence WorldMap Subject information aboutness Language of item Content language Metadata language Holdings by country of library
Presentation will be available at Prototypes available at Project Web Site:
Questions and Discussion Contact Information: