Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bibliomining: An Introduction

Similar presentations


Presentation on theme: "Bibliomining: An Introduction"— Presentation transcript:

1 Bibliomining: An Introduction

2 Outline Introduction Bibliomining Process Example Applications
Placing Bibliomining in Context A Research Agenda to Advance Bibliomining

3 Origins and Definition of Bibliomining
‘‘bibliometrics’’ + ‘‘data mining’’ Bibliometrics focuses on the creation of works Data mining (Web usage mining) focuses on the access of works The application of data mining and bibliometric tools to data produced from library services Gain a better understanding of library user communities Frequencies and aggregate measures hide underlying patterns The combination of data mining, bibliometrics, statistics, and reporting tools used to extract patterns of behavior-based artifacts from library systems for aiding decision-making or justifying services

4 Bibliometrics Traditional bibliometrics is based on the quantitative exploration of document-based scholarly communication Data for bibliometrics Works: authors, collections Connections: citations, authorship, common terms, other aspects of the creation and publication process Allow the researchers to understand the context in which a work was created, the long-term citation impact of the work and the differences between fields in regard to their scholastic output patterns

5 Data for Bibliometrics

6 Bibliometrics (Cont.) Frequency-based, Visualization, data mining
Frequency of authorship in a subject, commonality of words used, and discovery of a core set of frequently cited works Integrating the citations between works allows for very rich exploration of relations between scholars and topics Linkages between works are used to aid in automated information retrieval and visualization of scholarship and the social networks between those involved with the creation process Many newer bibliometric applications involve Web-based resources and hyperlinks that enhance or replace traditional citation information

7 Social Network

8 User-based Data Mining
One popular area: the examination of how users explore Web spaces (Web usage Mining) Focus on accesses of different Web pages by a particular user (or IP address) Patterns of use are discovered through data mining and used to personalize the information presented to the user or improve the information service In user-based data mining, the links between works come from a commonality of use If one user accesses two works during the same session, for example, then if another user views one of those works then the other might also be of interest

9 Data for User-Based Data Mining
Links between works that result from the users

10 Data for Anonymized Community-Based Web Usage Mining
Demographic Surrogate

11 Bibliomining Process

12 Overview Determining areas of focus
Identifying internal and external data sources Collecting, cleaning, and anonymizing the data into a data warehouse Selecting appropriate analysis tools Discovery of patterns through data mining and creation of reports with traditional analytical tools Analyzing and implementing the results

13 Determining Areas of Focus
Might come from a specific problem in the library or may be a general area requiring exploration and decision-making Directed data mining: problem-focused Ex. Budget cuts have reduced the staff time for contacting patrons about delinquent materials. Is there a way to predict the chance patrons will return material once it is one week late in order to prioritize our calling lists? Undirected data mining: consider general topical area Ex. How are different departments and types of patrons using the electronic journals? May produce an overwhelming number of patterns to explore for validation should be considered only when a strong data warehouse is in place

14 Identifying Data Sources
The bibliomining process requires transactional, non-aggregated, low-level data Privacy issue? Internal data sources are those already within the library system Patron database, transactional data, Web server logs External data sources Demographic information related to a specific ID number that is located in the computer center or personnel management system Demographic information for zip codes from census data

15 Data for Bibliomining

16 Conceptual Framework for Data Types in the Bibliomining Data Mining

17 A Framework for the Data
Data about a work Three kinds of fields Fields that were extracted from the work (like title or author) Fields that are created about the work (like subject heading) Fields that indicate the format and location of the work (like URL or collection) Come from a MARC record, Dublin Core information, or CMS Can also connect into bibliometric information, such as citations or links to other works May require extraction from the original source (in the case of digital reference) or linking into a citation database Challenge: no article level usage reports

18 A Framework for the Data (Cont.)
Data about the user Demographic surrogate Other fields that come from inferences about the user: zip code, location/department/lab (inference from IP address)

19 A Framework for the Data (Cont.)
Data about the service Searching, circulation, reference, interlibrary loan and other library services Fields common to most services include time and date, library personnel involved, location, method, and if the service was used in conjunction with other services Each library services also has a set of appropriate fields Searching: the content of the search and the next steps taken Interlibrary loan: cost, a vendor, and a time of fulfillment Circulation: acquisition process of the work and circulation length.

20 Creating the Data Warehouse
A data warehouse is a DB that is separate from the operational systems and contains a cleaned and anonymized version of the operational data reformatted for analysis Use queries to extract the data from the identified sources, combines those data using common fields, cleans the data, and writes the resulting records into either a flat file or a relational database designed specifically for analysis Can be automated to pull data from the operational systems into the data warehouse on a regular basis

21 Creating the Data Warehouse – Protecting Patron Privacy
Going through the data warehousing process requires the library to examine their data sources By explicitly determining what to keep and what to destroy, libraries can save the demographic information needed to evaluate communities of users without keeping records of the individuals in those communities Two examples

22 Cleaning Transactional Records

23 Cleaning Web Server Transactional Records

24 Creating the Data Warehouse – Building the Data Warehouse
Building the data warehouse takes much more time than mining the data Suggest to start with a narrowly defined bibliomining topic and work through the entire process This iterative process also has the advantage of allowing those developing the data warehouse, to improve their collection and cleaning algorithms early in the life of the bibliomining project

25 Selecting Appropriate Analysis Tools
Traditional Reporting Management information system (MIS) Online Analytical Processing (OLAP) Visualization Data Mining

26 Analysis Tools – Traditional Reporting
Library decision-makers examine aggregates and averages to understand their service use The advantage to the data warehouse is that new questions can be asked not only of the present situation but also, the past This allows those doing evaluation or measurement to ask new questions and then create a historical view of those reports in order to understand trends Libraries can more easily understand behavior between different demographic groups in the library

27 Analysis Tools – Management information system (MIS)
Provide a manager with the ability to ask basic questions of the data ILS packages have some type of basic MIS built in An MIS built on top of a data warehouse made for the library will be more powerful and provide information that the library needs to see Another addition to MIS is a critical factor alert system Example: if hourly circulation (factor) is below or above a certain level, a manager could be immediately notified so staffing changes could be made

28 Analysis Tools – Online Analytical Processing (OLAP)
An interactive view of the data Under the surface, the OLAP tool has run thousands of DB queries to combine all of the selected variables along with all of the selected measures (aggregation types, timeframes…) All of the fields are defined ahead of time, and the system runs many queries before anyone uses it Response to the manager using the OLAP front-end for reports is instant, which encourages exploration Penn Library Data Farm (

29 Analysis Tools – Online Analytical Processing (OLAP) (Cont.)
The user will pick one of many variables from a list to examine Example: use of e-journals under dimensions, such as time and subject A high-level view of this data in a tabular report (year and general classification) Expand the report -- click on a year  expand the year into quarters, leaving the subject headings the same and recalculating the data. The user can then click on another field to drill down into the data During exploration, the manger can capture any view of the data and turn it into a regular report

30 Analysis Tools – Visualization
Present the characteristics of data in a visual form 我們利用兩種圖示來表現視覺化 1.圖餅圖:使用圓餅的好處在於可以利用圓的半徑大小來決定該群內的文章數目 2.雷達圖:雷達圖可以表現群與群之間的關係強度,不過雷達圖僅適用於該群有大於三個相關的群 接下來我們介紹本系統視覺化的呈現方式

31 Analysis Tools – Data Mining
Discovery of valid, novel, and actionable patterns in large amounts of data using statistical and artificial intelligence tools Two main categories of data mining tasks Description: understand the data from the past and the present discover patterns for affinity groups of variables common to different patrons or clusters of demographic groups that exhibit certain characteristics (association rule mining, clustering) Prediction: make a statement about the unknown based upon what is known Classification (place an item into a category) Estimation (produce a numeric value for an unknown variable)

32 Analysis Tools – Data Mining (Cont.)
Techniques: neural networks, regression, clustering, rule generation, and classification Process: Take a cleaned data set Generate new variables from existing ones Split the data into model building sets and test sets Apply techniques to the model building sets to discover patterns Use the test sets to ensure the patterns are more generalizable Confirm these patterns with someone who knows the domain Web Usage Mining, Text Mining (+ bibliometrics)

33 Analysis Tools – Category & Cluster Results
*我們系統目前有541篇文章,主要是citeseer資料庫中與information retrieval相關的paper 這是我們系統的首頁畫面,左邊為我們的分類主題,是經由文件分群之後每一群的main topic *右邊則是分群結果,圓的大小代表包含文章數的多寡,顏色的深淺則代表群內的相似度 當相似度愈高顏色愈深 Category

34 Analysis Tools – Cluster Detail Information
Related Topic Citation Relation Cluster Label 接下來我們說明一下如何利用本系統尋找我們所需要的文章 假設我們現在要找som相關的文章 *我們便可以從分項的類別中尋找與som相關的主題 *其就會列出som的相關文章 *也可以直接看到相關文章的abstract,方便使用者判別該文件是不是他所需要的 *並可以得到與som相關的主題概念有那些,除了用條列式呈現之外 *也用雷達圖表示 Related Article Abstract Cluster Label

35 Analysis Tools – Citation Relation
雷達圖可以點選放大,讓使用者看的更清楚

36 DREW Open Effort Project
Digital reference electronic warehouse (DREW) . Develop an XML schema to… Allow digital reference transactions from different services and in different communication forms to live together in one space Allow researchers to access these archives and explore them using a variety of methods Capture the results of this research into a management information system, and then allow the reference services to view their own archives through the tools created by the researchers Knowledge base, citations and links to other works

37 Analysis and Implementation
Once the results have been developed, they must be validated Test and tweak the model with data that were not used during the development process (training and test) The most important validation is to have a librarian who is familiar with that particular library context examine the models . Implement the report/model Essential to monitor the variables that power the models over time; if the mean of a variable strays too far because of changes in the library, the model may have to be reevaluated

38 Example Applications – See Another PPT

39 Placing Bibliomining in Context

40 Conceptual Framework for Decision-Makers

41 Conceptual Framework for Library and Information Scientists
推論 歸納 Hypothetico-Deductive-Inductive Method

42 Understanding both Frameworks
In both frameworks, bibliomining is not the end of the exploration process It is one tool to be used in combination with other methods of measurement and evaluation, such as LIBQUAL, E-metrics, cost-benefit analyses, surveys, focus groups, or other qualitative explorations Using only bibliomining to understand a digital library can result in biased or incomplete results While the information provided by bibliomining is useful, it needs to be supplemented by more user-based approaches to provide a more complete picture of the library system

43 A Research Agenda to Advance Bibliomining

44 Data Collection Various data sources
Integrated library system Web-based front-end to digital libraries (federated search) A system to support interlibrary loan A system to support digital reference services External systems – citation databases, census data How to collect data and match it between systems Standard for data – Project COUNTER, NISO Z x (library metrics and statistics)  aggregate-level data Cooperation between system creators – easily exportable data warehouse and match between systems through common fields

45 User Privacy The bibliomining data warehouse can provide the method for keeping information about the materials used in the library without maintaining specific information about the users of the library How about the effect this anonymization has on the power of the data mining tools to discover patterns? Privacy-protecting data mining Privacy issues coming from Digital Reference Service (DRS): personal information in the questions Text mining and NLP

46 Variable, Metric, and Model Generation
While researchers have developed metrics for library statistics, they have primarily focused on fields from one data source Once the warehouse has been constructed, the possibilities grow for the discovery interesting variables for mining and metrics for evaluation Start in the data mining process, looking for relationships between individual variables that allow for deeper understanding Through the patterns discovered with data mining, new metrics and measures can be proposed Example: one-time high-demand needs VS. needs that represent the general user base

47 Integration of Management Information System and Data Mining tools
Integrate the found algorithms into the systems that drive digital libraries This combination of a built-in data warehouse, interactive reporting module, standards for report description, and modular design will make it much easier for library decision-makers to get involved with bibliomining. Toward developing these integrated modules for other systems that support digital libraries

48 Multi-system Data Warehouses and Knowledge Bases
The creation of services that span many digital libraries Library consortia Joining together digital library sources and services while still maintaining identity for those participating (like National Science Digital Library) Join data warehouses with libraries that have similar user groups and similar collections Agree demographic surrogates or develop a cross-walk algorithm to map demographics Need to ensure that these patterns apply to their own library before making decisions based upon them Methods for combining utilization and collection metadata between different systems. Standardize a series of metrics (what do “Hit” and “Visit” mean?) Create a standard for record-level data (MARC, COUNTER…)

49 Conclusion: moving beyond evaluation to understanding
The final and most long-lasting area of research of bibliomining is improving understanding of digital libraries at a generalized, and perhaps even conceptual, level These data warehouses will combine resources traditionally unavailable in this combined form to researchers What connections can be made between patron demographics, and bibliometric-based social networks of authors? How much influence do the works written and cited by faculty at an institution have on the patterns of student use of library services? How do usage patterns differ between departments or demographic groups, and what can the library do to better personalize and enhance existing services? Qualitative + quantitative


Download ppt "Bibliomining: An Introduction"

Similar presentations


Ads by Google