1 Writeslike.us Em Tonkin, Andrew Hewson

Slides:



Advertisements
Similar presentations
COUNTER: improving usage statistics Peter Shepherd Director COUNTER December 2006.
Advertisements

28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Open Scholarship 2006 Bielefeld Academic Search Engine a Scientific Search Service for Institutional Repositories Open Scholarship 2006 New Challenges.
Linking Repositories Scoping Study Key Perspectives Ltd University of Hull SHERPA University of Southampton.
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.
Introduction to metadata for IDAH fellows Jenn Riley Metadata Librarian Digital Library Program.
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
SciVal Experts & SciVal Funding Information Sessions.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Introduction to Implementing an Institutional Repository Delivered to Technical Services Staff Dr. John Archer Library University of Regina September 21,
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.
ÆKOS: A new paradigm for discovery and access to complex ecological data David Turner, Paul Chinnick, Andrew Graham, Matt Schneider, Craig Walker Logos.
Release 4 of the COUNTER Code of Practice for e- Resources and new usage- based measures of impact Peter Shepherd COUNTER May 2014.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Databases & Data Warehouses Chapter 3 Database Processing.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Grey Literature, E-Repositories and Evaluation of Academic & Research Institutes. The case study of BPI e-repository Maria V. Kitsiou - Head Librarian,
Management, marketing and population of repositories Morag Greig, University of Glasgow.
Malaysian Grid for Learning October DC 2004, Shanghai, China. © 2004 MIMOS Berhad. All Rights Reserved Metadata Management System DC2004: International.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.
5-7 November 2014 DR Workflow Practical Digital Content Management from Digital Libraries & Archives Perspective.
DTIC Discovery Tools 28 March 2012 Moderator: Kapin L. Ferguson.
1 Writeslike.us Em Tonkin, Andrew Hewson
Introduction to Text and Web Mining. I. Text Mining is part of our lives.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
EBank UK: linking scientific data, scholarly communication and learning Michael Day and Rachel Heery UKOLN, University of Bath
Google Scholar as a cybermetric tool Alastair G Smith Victoria University of Wellington New Zealand
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Librarians as a Resource for African Journals Partnership Project (AJPP) Journals Christine Wamunyima Kanyengo
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.
Metadata in a distributed information environment: Interoperability as recombinant potential Lorcan Dempsey OCLC/SCURL pre-IFLA conference, 15/16 Aug 02.
Open access & visibility Management Digital Preservation ORA: Purposes.
Scientific Data and Electronic Publishing Renze Brandsma, Head, Digital Production Centre University of Amsterdam Maarten Hoogerwerf, Project Manager,
Introduction to metadata
IUScholarWorks Technical Overview Randall Floyd Digital Library Program Programmer/Database Administrator.
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
GBIF Data Access and Database Interoperability 2003 Work Programme Overview Donald Hobern, GBIF Programme Officer for Data Access and Database Interoperability.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Metadata-based Discovery: Experience in Crystallography UKOLN is supported by: Monica Duke UKOLN, University of Bath, UK A centre of.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Introduction to metadata for IDAH fellows Jenn Riley Metadata Librarian Digital Library Program.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
What is Research?. Intro.  Research- “Any honest attempt to study a problem systematically or to add to man’s knowledge of a problem may be regarded.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA …………………………………………………………………………………………………… LOUISE CORTI …………………….…………………………….… UK DATA ARCHIVE.
General Architecture of Retrieval Systems 1Adrienn Skrop.
 System Requirement Specification and System Planning.
COUNTER Code of Practice - an introduction to Release 4
Peter Shepherd COUNTER March 2012
Active Data Management in Space 20m DG
Introduction to Implementing an Institutional Repository
EnTag Enhanced Tagging for Discovery Koraljka Golub, Jim Moon,
The New Face of Information Retrieval: The Ankara University Open Access Platform Prof. Dr. Sekine Karakaş Prof. Dr. Doğan.
Tech introduction.
A Case Study for Synergistically Implementing the Management of Open Data Robert R. Downs NASA Socioeconomic Data and Applications.
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

1 Writeslike.us Em Tonkin, Andrew Hewson

2 Background Relevant research themes: Metadata harvesting and reuse Automatic metadata extraction Text analysis Social network analysis Scholarly communication, particularly informal communication

3 Aim Helping people to find each other: Finding other researchers with similar interests to yourself in your geographic area Or in your area of research Not everybody with similar interests will attend the same conferences! Helping students find potential research supervisors Encouraging serendipity.

4 Relevant technologies In fact there are an awful lot of these. Social network analysis: Generally requires a very large dataset Solvable either by a) being Facebook or similar (but adoption rates are far from 100%) b) automated analysis of relevant data Solution b) is cheap, simple, and very fallible. Not a new approach – at the core of bibliometrics

5 Data extraction

6 Relevant technical problems Author identity disambiguation Formal social networks disambiguate between instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which). Needs to be solved to acceptable level. Need to define how good 'acceptable' is. Formal solutions usually depend on unique identifiers + registries Cheap, moderately effective solution: disambiguate via textual characteristics + metadata

7 Methodology Harvest OAI metadata: captures large list of: Author names (somewhat randomly formatted) Digital object titles, descriptions (sometimes), dates (sometimes) and content (sometimes) Citations (sometimes) Spider digital objects, analyse them for formal metadata – retrieve addresses, etc. Retain OAI source: useful clue regarding author affiliations (sometimes)

8 Links from OAI records

9 Links from OAI records (2) Just under half of the pages retrieved through crawling of links provided within DC records contained one or more accessible documents. Around 15% of linked pages resolved to journal endpoints – ‘paywalls’ Sometimes contain additional useful metadata about the document – not necessarily appropriate to harvest this However, the copyright ownership is in itself a useful data point. Around 40% of institutional repository links were found to contain no accessible data.

10 Links from OAI records (3) 240,000 records were harvested. Out of the 62,000 records containing an actionable http dc:identifier, 35,000 contained a handle.net (15,500) or dx.doi.org (20,000) actionable persistent identifier. DOIs and handles appear to have a similar prevalence in UK institutional repositories.

11 Methodology (II) Analyse text for noun-phrase-like structures – useful clue as to theme Background information required, such as: Institution name, domains/URLs associated with each institution Retrieved via harvesting from Wikipedia Much of this information is not well-structured, so unavailable via DBPedia Poorly structured information needs filtering: for example, author names are not consistently structured between repositories. - machine learning problem. Search with contextual network graph algorithm

12 Contextual network graph algorithm Like spilling a little ink on one node of the graph: It spreads a predefined distance through the graph of relations between authors, objects, roughly calculated identities, classifications, and other metadata, in a manner defined by the way in which the implementation is tuned. The result is a ranked list of matching nodes and their types, which can then be presented to the user.

13 'Sometimes' and 'usually' Statistics are: Cheap Imperfect Available Rapid innovation philosophy: Cheap is good Simple is good Solutions requiring novel/additional uptake of infrastructure are out of reach

14 Results Basic concept worked well Law of diminishing returns: beyond the first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!) Interface development actually required more time than the dataset development, and exceeded project length... But useful dataset can be released as linked data, reused for various purposes

15 Results (2) A random sample of authors shows that authors with few publications have little visibility in the formal indexes. Low-quality publication, or early-career researcher?

16 Caveat (emptor?) Collecting data has legal implications. Displaying data has legal implications, especially when the site is presented as able to perform specific functions – such as “analysing research impact” Realistic solution: Disclaimer: “[Nobody] makes any warranty whatsoever that the operation of the Site will be [...]error-free; that defects will be corrected; […] as to the results that may be obtained from[...] the Site; or as to the accuracy, completeness, reliability, availability, suitability, quality, non- infringement or operation of any Content, product or service provided on or accessible from[...] the Site.”

17 Future work Exploring the legal issues Alternative uses of data Targeted interface development Integration of additional tools/search methods

18 Walkthrough: Basic search (the harder method!)

19 Advanced search

20

21

22

23

24 Walkthrough

25 Conclusion OAI-DC (and Wikipedia!) is a good source for 'semi-structured' data There is a great deal of potential for using this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network- like graphs Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications