ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral.

Slides:



Advertisements
Similar presentations
1 Radio Maria World. 2 Postazioni Transmitter locations.
Advertisements

Jack Jedwab Association for Canadian Studies September 27 th, 2008 Canadian Post Olympic Survey.
Números.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
/ /17 32/ / /
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
Addition and Subtraction Equations
Disability status in Ethiopia in 1984, 1994 & 2007 population and housing sensus Ehete Bekele Seyoum ESA/STAT/AC.219/25.
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
Charleston Conference 7 November 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher.
Capturing Untapped Descriptive Data: Creating Value for Librarians and Users Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006.
Anatomy of Aggregate Collections: The Example of Google Print for Libraries Brian Lavoie Senior Research Scientist OCLC Research OCLC Members Council Meeting.
1 When you see… Find the zeros You think…. 2 To find the zeros...
Create an Application Title 1Y - Youth Chapter 5.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
Summative Math Test Algebra (28%) Geometry (29%)
ASCII stands for American Standard Code for Information Interchange
1 Making Changes to Existing Name and Work/Expression Authority Records Module 7. Making Changes to Existing Name and Work/Expression Authority Records.
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Break Time Remaining 10:00.
The basics for simulations
A sample problem. The cash in bank account for J. B. Lindsay Co. at May 31 of the current year indicated a balance of $14, after both the cash receipts.
MM4A6c: Apply the law of sines and the law of cosines.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Progressive Aerobic Cardiovascular Endurance Run
Biology 2 Plant Kingdom Identification Test Review.
Name of presenter(s) or subtitle Canadian Netizens February 2004.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
2.10% more children born Die 0.2 years sooner Spend 95.53% less money on health care No class divide 60.84% less electricity 84.40% less oil.
Subtraction: Adding UP
: 3 00.
5 minutes.
Numeracy Resources for KS2
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Static Equilibrium; Elasticity and Fracture
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
famous photographer Ara Guler famous photographer ARA GULER.
& dding ubtracting ractions.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Biostatistics course Part 14 Analysis of binary paired data
Select a time to count down from the clock above
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
OCLC Research: Selected projects Eric Childress Larry Olszewski Presentation for Dpto. Biblioteconomía y Documentación Universidad Carlos III de Madrid.
Libraries in the History of Print Culture 10 Sept Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research Mining Global Library Records for.
Presentation transcript:

ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Beyond Data Mining: Delivering the Next Generation of Services from Library Data

WorldCat as an Aggregate Collection Data Mining and Analysis of WorldCat: …affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making. Lavoie, B.F., Connaway, L. S., & ONeill, E. T. (2007). Mapping WorldCats digital landscape. Library Resources & Technical Services, 51, at 107.

WorldCat: July 2008 Total holdings: 1,292,763,300 Manifestations (records): 108,828,533 Works: 84,096,107 Digital Items: 3,182,550 Institutions: 69,000 Physical Items: ~1.2 billion

Global Origins of WorldCat Materials US 28% UK 8% Canada 3% Rest of World 27% Unknown 17% France 4% Germany 10%

Global Origins of WorldCat Materials Content Languages: % of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Content Languages: % of WC non-English Top 5 non-English: German:12 million French:6.1 million Spanish:3.5 million Dutch:2.6 million Japanese:2.4 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Materials w/non-US origins: 57.9 million (55%) Top 5: Germany:10.0 million UK:8.8 million France:4.2 million Netherlands:2.9 million Canada:2.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million Non-English Metadata Language: 28 million (66 languages) Top 5: German:11 million French: 1.8 million Dutch:5.0 million Finnish: 0.7 million Swedish:1.9 million

WorldCat as a Decision-Making Resource Collection management Cooperative collection development Comparative collection analysis Collection assessment Mass digitization Off-site storage Preservation

WorldCat as a Decision-Making Resource Services Virtual reference Recommender services Social networking Systems Precision

WorldCat as a Decision-Making Resource Three Areas of Data Mining Research: OCLC WorldMap Audience Level Publisher Name Server

OCLC WorldMap

OCLC WorldMap TM : Objectives Geographically represent WorldCat data Titles published in each country Holdings for titles published in each country Languages represented for titles published in each country

OCLC WorldMap TM : Objectives Geographically represent data from UNESCO, ARL, and NCES for each country Number of Libraries Library volumes Certified/degreed librarians Registered library users Library expenditures Cultural heritage institutions (museums and archives) Publishers

OCLC WorldMap TM : Objectives Research prototype Support OCLC data mining research Visually display data for review and analysis Internal use Sales and marketing External use Library collection assessment and comparison Data may be processed AT A GLANCE Complement the AAU/ARL Global Resources Network project Project of the Council on Library and Information Resources (CLIR)

OCLC Audience Level

Audience Level: Rationale and Objectives Thus we can infer materials audience level from holdings patterns, which in turn can support: Collection management Readers advisory services Reference services Information retrieval Holdings represent selection decisions by librarians … implies there are more than 1 billion individual selection decisions in the WorldCat holdings file Selections serve the interests of a librarys target community … Associate community (audience level) to library profiles - e.g., ARL, non-ARL academic, public, K- 12 school … ?

Example Computation: Build Community Library symbol Library nameLibrary type Weight OHIState Library of OhioOtherx OCOColumbus Metropolitan LibraryPublic0.33 CDCCedarville UniversityAcademic0.67 LIMLima Public LibraryPublic0.33 OUNOhio UniversityResearch1.00 OSDSEO Automation ConsortiumOtherx BGUBowling Green State UniversityAcademic0.67 MIAMiami UniversityAcademic0.67 AKR University of AkronAcademic0.67 BGFFirelands CollegeAcademic0.67 CINUniversity of CincinnatiResearch1.00 TOLUniversity of ToledoAcademic0.67 KSUKent State UniversityResearch1.00 HIRHiram CollegeAcademic0.67 YNGYoungstown State UniversityAcademic0.67

FRBRizing Audience Level Results Calculate Audience Level for each Manifestation Aggregate weighted holdings for Work OCLC NumberTotal HoldingsUsable Holdings Manifestation Audience Level x

Evaluating the OCLC Audience Level Random sample of 30 Zoology books, all audience levels Human subjects Ranked books in increasing order of difficulty Strong statistical correlation between human subjects ranking and programmatic ranking

Evaluating the OCLC Audience Level

OCLC Publisher Name Server

Publisher Name Server: Research Objectives Resolve for data mining and quality of WorldCat ISBN prefixes to publisher name Variant publisher names to a preferred form Complement Collection Analysis Service Librarians Publishers Capture and profile attributes of individual publishers Location(s) Language(s) of materials published Genre(s)/format(s) Dominant subject domain(s) Parent company and subsidiaries

Publisher Name Server: Methodology Programmatically cluster publishers records using ISBN prefixes Data clustering (The Free Dictionary) "The science of extracting useful information from large data sets or databases" Classification of similar objects into different groups Partitioning of a data set into subsets (clusters) Data in each subset (ideally) share some common trait Hand parse the entities and resolve ISBN prefixes

Publisher Name Server: Database 1750 publishing entities Relational database, preserving hierarchical relationships Begins with high-occurrence entities: Top 10 lists (USA, UK, Canada, Australia, Germany, France, Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia, Taiwan, New Zealand) Top 10 university presses Mergers and acquisitions, last 8 years

Publisher Name Server: Data Captured Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL Languages Formats Conspectus Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers Weekly Online Hoovers Handbook Online Standard and Poors Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING

Publisher Name Server: Database More than 56,000 separate strings mapped to 1750 entities 8.5 million OCLC records 22% of these are Library of Congress records ~490 million holdings Hierarchical relationships maintained

Entity-Parsing in a World of Mergers and Acquisitions Prentice-Hall, Inc. Pearson Education, Inc. Addison-Wesley Publishing Company Allyn and BaconDominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co. Pearson PLC Pearson CanadaPearson Technology Group Copp ClarkAdobe PressCisco Press Penguin Books Allen LaneLadybird BooksRiverhead Books Puffin BooksPutnam BooksBerkeley Publishing Group Avery

Publisher Profiles Oxford University Press 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) Pearson PLC Includes 14 subsidiaries and acquisitions Aggregate: 291,433 records (0.27% of WorldCat)

Publisher Profiles – Top Languages Oxford Univ. Press: English 96.74% Latin0.51% German0.39% Chinese0.39% French0.37% Spanish0.28% Afrikaans0.14% Middle English0.13% Malay0.09% Swahili0.09% Pearson PLC: English95.27% Spanish 1.43% German 1.33% French 0.60% Dutch 0.55% Latin 0.26% Malay 0.06% Ancient Greek 0.05% Portuguese 0.05% Italian 0.04%

Publisher Profiles – Conspectus Divisions Oxford Univ. Press: Language/ Literature 27.12% History 11.92% Music 9.78% Philosophy/ Religion 9.55% Business/ Economics 6.15% Medicine 4.36% Law 3.85% Sociology 3.75% Political Science 3.58% Biology 2.60% Pearson PLC: Language/ Literature18.67% Business/ Economics13.30% Computer Science 9.42% Engineering 8.04% History 7.59% Mathematics 6.04% Education 5.64% Sociology 4.18% Philosophy/ Religion 3.81% Physical Sciences 2.75%

Publisher Profiles – Conspectus Categories Oxford Univ. Press: English literature 10.66% English language 5.86% Instrumental music3.48% Vocal music3.09% Literature on music2.26% History – Britain1.82% Economic history1.38% American lit.1.35% History – S. Asia1.30% General history1.29% Pearson PLC: English language7.74% Business admin.4.62% English literature3.63% Economics2.94% Comp. programming2.39% Electrical engineering2.24% Early childhood ed.2.05% Computer software1.88% U.S. federal law1.80% Computer Science1.54%

Publisher Profiles – Conspectus Subjects Oxford Univ. Press: English – modern 5.57% English lit – prose 2.51% English lit – 19 th c. 2.23% Juvenile lit. 1.06% English lit – poetry 1.03% English lit – collections 0.80% Biographies 0.76% English lit – % Shakespeare 0.68% Sacred choruses 0.66% Pearson PLC: English – modern7.68% Management2.53% Programming1.74% Arithmetic1.09% Economic theory1.06% Marketing1.06% General algebra1.04% Accounting0.97% Juvenile lit.0.93% English lit – 19 th c.0.89%

Projected MARC coding of Authorized Forms 710 Added Entry – Corporate Name Add $4 for publisher name Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) 752 Added Entry – Hierarchical Place Name Add $2 FAST where place of publication matches FAST geographical subject headings

Future Research Further data mining Profile aspects of publication output Deeper scaling into WorldCat (beyond ISBN) Plan for long-term maintenance ISBN-13 compliance File expansion of ongoing mergers/ acquisition activities

Thank You! Questions and Discussion Lynn Silipigni Timothy J.