Download presentation
Presentation is loading. Please wait.
Published byWillis Foster Modified over 6 years ago
1
Bandits and Browsing: Data Mining and Network Analysis for Library Collections
Harriett Green, English and Digital Humanities Librarian, University Library Kirk Hess, Digital Humanities Specialist, University Library Richard Hislop, Ph.D. candidate, Department of Economics, UIUC ERRT, April 25, 2012
2
The Problem How can users effectively find materials in today’s library collections and digital libraries? Transformation in the acquisitions, access, and storage of library collections with digital materials, off-site storage, etc. Availability of immense amounts of data IR literature: user searching patterns
3
Project GOAL: To develop information retrieval and analytical tools that could be incorporated into a possible recommender system Metadata analysis to help users navigate and retrieve items from the collection Code libraries will allow interdisciplinary study and research about the library itself. Network analysis can reveal essential information about the collection's structure.
4
Project Structure TEAM: Harriett Green (PI), Kirk Hess, Richard Hislop
SUPPORT: I-CHASS Scalable Research Challenge—Michael Simeone, co-PI TOOLS: Awarded Start-Up Allocation of 30,000 SUs from XSEDE on the SGI Altix UV Blacklight cluster at Pittsburgh Supercomputing Center with XSEDE consultation support
5
Questions What other collection items are like X item? How do we show people these related items? What is the topic area that people want? How do we show people an estimated result of what they want? How do we create visualizations and recommendations of items in the collection?
6
The Beginning: Sample Data Set
Initially ran analyses on 40,000 item English collection Quantify inefficiencies in subject headings Developed prototypes of analyses to run on the full UIUC Library catalog data
7
XSEDE Analysis Run analyses on entire UIUC Library catalog data
Conduct network analyses on entire UIUC Library catalog data for subject correlations Extend betweenness calculation to use weighting based on items checked out together Find clusters that need to be connected via extra subject headings
8
Analysis of subject headings
Simple subject analysis can uncover lesser known correlations Extracts from a Correlation Table: Renaissance England 1 Death – Social Aspects – England – History – 16th Century 0.301 Literature and Science – England – History – 17th Century English Literature – Psychological Aspects 0.211 Protestantism and Literature – History – 17th Century 0.172 Desire in Literature 0.102 Some of the connections are relatively obvious. For instance, Rennaissance England links to ENGLISH LITERATURE EARLY MODERN CRITICISM TEXTUAL or THEATER ENGLAND HISTORY 16TH CENTURY. However, some of the other ones would require a lot more guesswork, even though they might be really relevant. The correlation tables provide a good starting place for expanding patrons' searches. But, they're limited because they can only look at directly linked headings.
9
Metadata analysis Help users and library staff identify and connect search terms to subject headings and metadata in the catalog Our initial approach: Use correlation of subject headings in bibliographic records. Quantifying Efficiency – ECS and ACS. Result in a recommender system: analysis that will provide lists of related topics.
10
Approach: Finding the right questions
Niche topics are important Some headings are bridges between subjects Metadata as a network analysis problem Heading (ordered by betweenness) Degree Women in literature. 1164 Sex role in literature. 1174 Narration (Rhetoric) 761 English literature--Early modern, History and criticism. 1153 Psychological fiction. 342 American literature--20th century--History and criticism. 863 English literature--19th century--History and criticism. 1022 American fiction--20th century--History and criticism. 928 This slide is where I'd mention Google and Amazon. The nitche topics are where they'd not do so well as they don't have any particular incentive to try and figure out the quality of under-read pages.
11
Analyzing Circulation Data
Collection use provides information about how to further improve the catalog Can identify not only the most-important known links, but find connections that need to be added Database is represented as a network, with traffic between items that are checked out together
12
Analyzing user transactions
13
Other collection analyses
Collection development can be analyzed across time in acquisition of authors and titles Changes in library policy Effect of converting collection from Dewey to LOC? Effect of book location on check out frequency? (General stacks vs. departmental library vs. high-density storage)
14
Approaches to Collection Analysis
I like this picture (only partly because of how much time it took me to make – the blue shading is a confidence interval, btw). I'm also hoping to use it to set up the idea that people other than librarians can make use of the catalog's information, particularly once we've done some of the 'heavy lifting'. With any luck, our applications will be useful, but there is a ton of stuff that I wouldn't think to do.
15
Challenges for library Recommender System
Google/Amazon/Netflix vs. Voyager and VuFind different approaches to users Keyword searching: word frequency, Solr sorting by proximity and frequency Recommender systems : build user profiles, clustering of users and of documents Easy Search: tracking by simple click-throughs
16
Future Steps Analyze other data sets from other libraries’ catalogs
Create a suite of tools that libraries can use to calculate and improve the economic efficiency Code libraries that can be shared and used across library systems: Reduce the need to re-solve problems (UTF-8); Code uses CSV files for easy integration Visualize network diagrams of the data for assessments of collections
17
QUESTIONS? Thank you! Harriett Green, Kirk Hess, Richard Hislop,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.