Information structuring in the PEER project

Slides:



Advertisements
Similar presentations
Academic Search Engines
Advertisements

Special Features of Publishers Web Sites. Objectives Review standard features via Elsevier website Identify special features in the websites of the following.
EndNote Web Reference Management Software (module 5.1)
Open Scholarship 2006 Bielefeld Academic Search Engine a Scientific Search Service for Institutional Repositories Open Scholarship 2006 New Challenges.
NIH Public Access Compliance Cleveland Health Sciences Library Case Western Reserve University Kathleen C. Blazar.
PEER Publishing and the Ecology of European Research An introduction to: February 2009 Supported by the EC eContentplus programme.
Trends in Scientific Publishing Guenther Eichhorn DirectorAbstracting & Indexing Cambridge, MA April 2010.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
PEER Publishing and the Ecology of European Research The PEER Project State of Play Presented by Michael Mabe, STM NUV Meeting, Amsterdam.
1 OPEN ACCESS Organization-Pays Program for Books.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
Highlights from the Open Access Timeline (1) 1971, Project Gutenberg launched on the Internet (originally as an FTP site). There are now 18,000 free books.
Curating academic publications a perspective for research libraries Laurent Romary INRIA & HUB-IDSL.
Steve Yip Head of Reference and Research Services HKUST Library Research Support Provided by HKUST Library and other JULAC Libraries in HK 1 Date : March.
Outlining a scholarly workbench – publication and data as a continuum Laurent Romary INRIA & Humboldt Univ. Berlin.
ARCHIVING DATA Research Data Management. Archive - a place where public records or other historical documents are kept. An extensive record or collection.
Release 4 of the COUNTER Code of Practice for e- Resources and new usage- based measures of impact Peter Shepherd COUNTER May 2014.
National Aeronautics and Space Administration Implementing DSpace at NASA Langley Research Center 1 Greta Lowe Librarian NASA Langley Research Center
ⓒ UNIST LIBRARY UNIST Institutional Repository ⓒ UNIST LIBRARY
E-journal Publishing Strategies at Pitt Timothy S. Deliyannides Director, Office of Scholarly Communication and Publishing and Head, Information Technology.
From Berlin back to Business OPEN Stellenbosch University Library and Information Service Mimi Seyffert Manager: Digitisation and Digital Services.
AMERICAN PHYSICAL SOCIETY
Status of ICT structure, infrastructure and applications existed to manage and disseminate information and knowledge of Agricultural Biotechnology Innovations.
Sam Kalb Scholarly Communication Services Coordinator QUEEN’S.
Presented by Ansie van der Westhuizen Unisa Institutional Repository: Sharing knowledge to advance research
Electronic Submission and Reviewing Methodology Hooman Momen Editor Bulletin of the World Health Organization.
„Serving Innovation …“ ElPub2006 – Workshop Wolfram Horstmann 2006/06/14 I n i t i a t i v e f o r I n n o v a t i o n i n S c h o l a r l y C o m m u.
PEERing into the Future Journals, Self-Archiving &The European Commission-Funded PEER Project Michael A Mabe Chief Executive Officer, STM, & Visiting Professor,
European Organization for Nuclear Research Organisation Européenne pour la Recherche Nucléaire CDS Invenio CERN’s open source digital library information.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
An old tradition and a new technology have converged to make possible an unprecedented public good - Budapest Open Access Initiative.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
WISER Social Sciences: Politics & International Relations Gillian Beattie (Social Science Library) Jane Rawson (Vere Harmsworth Library)
University of Antwerp Library TEW & HI UA library offers... books, journals, internet catalogue -UA catalogue, e-info catalogue databases -e.g.
Open Access to Grey Literature: Challenges and Opportunities in India By Dr. Manorama Tripathi Prof. H. N. Prasad Banaras Hindu University, Varanasi. Mr.
TEI and Scholarly publishing Laurent Romary INRIA & HUB-ISDL TEI council, chair.
DE GRUYTER OPEN ORGANIZATION-PAYS PROGRAM FOR BOOKS 2May 13, 2015Organization-Pays Program for Books.
OARE Module 5A: Scopus (Elsevier). Table of Contents About Scopus (Elsevier) Using Scopus Search Page Results/Refine Search Pages Download, PDF, Export,
Thomson Reuters ISI (Information Sciences Institute) Azam Raoofi, Head of Indexing & Education Departments, Kowsar Editorial Meeting, Sep 19 th 2013.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
CERN - IT Department CH-1211 Genève 23 Switzerland t INSPIRE A Global Digital Library for HEP 14 th February 2011 Tim Smith on behalf of.
Presented by Ansie van der Westhuizen Unisa Institutional Repository: Sharing knowledge to advance research
Weaving Data into the Scholarly Information Network UNECE Work Session on the Communication of Statistics OECD Conference Centre, Paris June 30 - July.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Citing Datasets. Research: search for knowledge or any systematic investigation to establish facts. And to establish facts, one needs Data.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
From Access to Archive Transforming Scholars Portal into an E-Journal Archive.
International Forum on “Local Wisdom as Power to Social and Economic Development” ELECTRONIC RESOURCES OF LOCAL INFORMATION IN NATIONAL LIBRARY OF VIETNAM.
Open Access Tools for Scholars Scholarly Communication Retreat Wednesday December 12, 2007 Presented by Marcia Salmon.
Digital Commons digitalcommons.unl.edu. Digital Commons is: an “institutional repository” (IR) a resource for scholarly communication an opportunity for.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
PEER Publishing and the Ecology of European Research Presentation by Julia Wallace, Project Manager, PEER Research in the Open: How.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
Scopus - Elsevier (Advanced Course Module 8)
OARE Module 5A: Scopus (Elsevier)
NRF Open Access Statement
7th Annual Hong Kong Innovative Users Group Meeting
Bielefeld Academic Search Engine
Summon discovers contents from one search box!
Tuesday Tech Talks Skeen Library Presents Today’s Topic: Presented by:
VI-SEEM Data Repository
Scopus - Elsevier (Advanced Course Module 8)
IL Step 3: Using Bibliographic Databases
Introduction of KNS55 Platform
Accessing journals by Language 4
Networked Information Resources
Scopus - Elsevier (Advanced Course: Module 8)
Presentation transcript:

Information structuring in the PEER project Back to meaning Information structuring in the PEER project Foudil Bretel1, Patrice Lopez1-2, Maud Medves1-2, Alain Monteil1, Laurent Romary1-2 1INRIA 2Humboldt Univ. Berlin PEER Publishing and the Ecology of European Research 1 www.peerproject.eu

Sorting out the chaos? Vision: channelling heterogeneous (publisher’s) data into one single (meaningful) format Material: PDF with metadata – what can the TEI do with it? Articulating a pivot/reference format seen as a strict customization of the TEI Exploring the possibility of automatic metadata extraction from PDFs PEER Publishing and the Ecology of European Research 2 www.peerproject.eu

Why is it so difficult? Great heterogeneity of format within publishers Meta data (and full-text) Proprietary, ScholarOne, NLM 2.0, NLM 3.0, … Various issues Affiliations Publication date information ISO 639 codes (countries) Bibliographical references Proprietary metadata fields PEER Publishing and the Ecology of European Research 3 www.peerproject.eu

The information chaos Article title Journal title ISSN (print) article-title/title | ArticleTitle | article-title | ce:title | art_title | article_title | nihms-submit/title | ArticleTitle/Title | ChapterTitle Journal title j-title | JournalTitle | full_journal_title | jrn_title | journal-title ISSN (print) JournalPrintISSN | issn[@issn_type='print'] | issn[@pub-type='ppub'] | PrintISSN | issn-paper First page of a paper spn | FirstPage | ArticleFirstPage | fpage | first-page PEER Publishing and the Ecology of European Research 4 www.peerproject.eu

Sorting this out Defining a coherent infrastructure to facilitate The long-term management of scholarly content in research institutions Smooth interaction between publishers and research institutions Better understanding of what each of us can provide On-going experimental setting: the EU PEER project PEER Publishing and the Ecology of European Research 5 www.peerproject.eu

The PEER project Initiated by the EU commission (DG INFSO) Objective: study the impact of systematically archiving stage-two outputs in “institutional repositories” (cf. Romary & Armbruster 2010) on journals and business models on wider ecology of scientific resarch Consortium STM, European Science Foundation (ESF), Goettingen State and University Library (UGOE), Max Planck Gesellschaft (MPG), INRIA PEER Publishing and the Ecology of European Research 6 www.peerproject.eu

Content submission - publishers Eligible Journals / Articles Publishers PEER Depot Authors Select 100 % Metadata 50 % Manuscripts Publishers Transfer Publishers Deposit Publishers Inform PEER Publishing and the Ecology of European Research 7 www.peerproject.eu 7

Content submission – to repositories & LTP archive Publishers Deposit Authors Deposit PEER Depot Transfer Transfer Long-Term Preservation; LTP Depot (e-Depot, KB) Publicly Available PEER Repositories UGOE ULD MPG KTU HAL TDC SSOAR PEER Publishing and the Ecology of European Research 8 www.peerproject.eu 8

What has been done Publishers involved the project BMJ Publishing Group (proprietary format) Cambridge University Press (NLM2.2) EDP Science (NLM3.0) Elsevier (proprietary format) IOP Publishing (NLM3.0) Nature Publishing Group (proprietary format) Oxford University Press (ScholarOne) Portland Press (NLM2.0) Sage Publications (proprietary format) Springer (proprietary format) Taylor & Francis Group (ScholarOne) Wiley-Blackwell (ScholarOne) PEER Publishing and the Ecology of European Research 9 www.peerproject.eu

The PEER deposit workflow Repositories Publishers HAL SUB-Göt PEER Depot MPS … KB Preservation PEER Publishing and the Ecology of European Research 10 www.peerproject.eu

TEI as a pivot format for interchange General strategy: no information should be lost Nearly everything in sourceDesc + Keywords, Summary, Copyright Strict author description Deep encoding of names Deep encoding of affiliations (Web of Science - 3-level) Deep encoding of addresses – getting the country right Precise publishing information Pagination, DOIs, volume, issue, journals name(s) Yes, biblStruct is cool! PEER Publishing and the Ecology of European Research 11 www.peerproject.eu

Example Source (Springer proprietary format) PEER format (TEI) <Author AffiliationIDS="Aff1" CorrespondingAffiliationID="Aff1"> <AuthorName DisplayOrder="Western"> <GivenName>Hucheng</GivenName> <FamilyName>Qi</FamilyName> </AuthorName> <Contact> <Email>hqi@durisol.com</Email> </Contact> </Author> …. <Affiliation ID="Aff1"> <OrgName>Durisol, A division of Armtec Limited Partnership</OrgName> <OrgAddress> <Street>51 Arthur Street South</Street> <Postcode>N0K 1N0</Postcode> <City>Mitchell</City> <State>ON</State> <Country>Canada</Country> </OrgAddress> </Affiliation> Source (Springer proprietary format) PEER format (TEI) <author> <persName> <forename type="first">Hucheng</forename> <surname>Qi</surname> </persName> <email>hqi@durisol.com</email> <affiliation> <orgName type="institution">Durisol, A division of Armtec Limited Partnership</orgName> <address> <street>51 Arthur Street South</street> <postCode>N0K 1N0</postCode> <settlement>Mitchell</settlement> <region>ON</region> <country key="CA">CANADA</country> </address> </affiliation> </author> PEER Publishing and the Ecology of European Research 12 www.peerproject.eu

Example PEER format (TEI) Source (Springer proprietary format) <sourceDesc> <biblStruct> <analytic>… <title level="a" type="main" xml:lang="en"> The investigation of basic processes of rapidly hardening wood-cement-water mixture with CO<hi rend="subscript">2</hi> </title> <title level="a" type="main" xml:lang="de"> Untersuchung der Vorgänge bei der schnellen Härtung einer Holz-Zement-Wasser-Mischung mit CO<hi rend="subscript">2</hi> </title> </analytic> <monogr> <imprint> <date when="2009-05-30"/> <biblScope type="fpage">1</biblScope> <biblScope type="lpage">7</biblScope> </imprint> </monogr> <idno type="DOI">10.1007/s00107-009-0351-z</idno> <idno type="publisherID">s00107-009-0351-z</idno> <idno type="articleID">351</idno> </biblStruct> </sourceDesc> </fileDesc> PEER format (TEI) <ArticleID>351</ArticleID> <ArticleDOI>10.1007/s00107-009-0351-z</ArticleDOI> <ArticleSequenceNumber>0</ArticleSequenceNumber> <ArticleTitle Language="En"> The investigation of basic processes of rapidly hardening wood-cement-water mixture with CO<Subscript>2</Subscript> </ArticleTitle> <ArticleTitle Language="De"> Untersuchung der Vorgänge bei der schnellen Härtung einer Holz-Zement-Wasser-Mischung mit CO<Subscript>2</Subscript> </ArticleTitle> <ArticleCategory>Originals Originalarbeiten </ArticleCategory> <ArticleFirstPage>1</ArticleFirstPage> <ArticleLastPage>7</ArticleLastPage> <ArticleHistory> <RegistrationDate> <Year>2009</Year> <Month>05</Month> <Day>14</Day> </RegistrationDate> <Received> <Year>2008</Year><Month>12</Month><Day>9</Day></Received> <OnlineDate> <Year>2009</Year> <Month>5</Month><Day>30</Day></OnlineDate> </ArticleHistory> <ArticleCopyright> <CopyrightHolderName>Springer-Verlag</CopyrightHolderName> <CopyrightYear>2009</CopyrightYear> </ArticleCopyright> <ArticleContext> <JournalID>107</JournalID> </ArticleContext> </ArticleInfo> Source (Springer proprietary format) PEER Publishing and the Ecology of European Research 13 www.peerproject.eu

… And when no metadata is available PEER Publishing and the Ecology of European Research 14 www.peerproject.eu

GROBID GeneRation Of BIbliographic Data A text mining tool for extracting bibliographical metadata at large Input: Technical and scientific domains Scholar documents, technical manuals and patents Raw text or text with layout information (PDF) Machine learning approach PEER Publishing and the Ecology of European Research 15 www.peerproject.eu

Metadata extraction from front page PEER Publishing and the Ecology of European Research 16 www.peerproject.eu

Metadata extraction from front page Extraction of bibliographical information from article header Fields: title, authors, date, abstract, location, affiliation, book title, journal title, email, publication number, web, degree, keywords, etc. As features, exploitation of position information (begin/end of line, in the doc.) lexical information (vocabulary, large gazetteers) layout information (font size, font style, etc.) Conditional Random Fields (CRF) (Peng & McCallum 04) Current training corpus: 1 350 global examples + 200 affiliations/addresses blocks + 500 authors sequences, etc. PEER Publishing and the Ecology of European Research 17 www.peerproject.eu

Layout & Block Analysis: XY-Cut algorithm PEER Publishing and the Ecology of European Research 18 www.peerproject.eu

Metadata extraction from header PEER Publishing and the Ecology of European Research 19 www.peerproject.eu

Metadata extraction from header PEER Publishing and the Ecology of European Research 20 www.peerproject.eu

Metadata consolidation Exploitation of external bibliographical databases for correcting/completing results based on extraction results Crossref: The full bibliographical record can be obtained based on: DOI Journal title, volume, first page Title + author first name ➞ frequent! Other databases: xISSN, xISBN, Amazon Web Service Real time: online requests between 0.8-1.5 seconds PEER Publishing and the Ecology of European Research 21 www.peerproject.eu

Accuracy overview: corpus CORA Features Accuracy Precision Recall F1 Token 99.65 97.37 94.19 95.75 Field 94.7 Instance 74.91 Instance after consolidation 82.20 Title 99.70 98.24 95.48 96.84 Author 99.38 90.27 96.36 93.21 Date 99.86 97.53 81.07 87.29 Affiliation 99.52 98.25 93.26 95.69 Abstract 98.95 99.64 98.81 99.22 (+9.7%) PEER Publishing and the Ecology of European Research 22 www.peerproject.eu

Extraction from header Collection Pre- processing Token + features CRF models Header Authors Affiliations + catalog + catalog + expected result Document segmentation - text segmentation - feature generation train PEER Publishing and the Ecology of European Research 23 www.peerproject.eu

Extraction from header Collection Pre- processing Token + features CRF models Header Authors Affiliations + catalog + catalog + expected result Document segmentation - text segmentation - feature generation train / classify post-processing consolidation Segmented document Term candidates + features terms + labels Final biblio. record Document PEER Publishing and the Ecology of European Research 24 www.peerproject.eu

Why GROBID ? Cataloguing: mass digitalization User needs: self-archiving of scholar papers by authors in open archives metadata not easily available Extraction of additional metadata: references, keywords, etc. for enriching/correcting existing ones improvement in search & retrieval Ease document access from citation strings Playground for experimenting with CRF models for text mining PEER Publishing and the Ecology of European Research 25 www.peerproject.eu

Lessons Reusable infrastructure for various types of academic-publisher relation (e.g. Gold OA agreements) biblStruct is cool Cf. Michael’s talk: deeply structured Standardization in the publishing world is still an open issue … diplomatically put The TEI has a role to play in the publishing world Coherence between publication material and other sources E.g. central role of attribution/authorship/affiliation Stylesheets to be made available in OxGarage

A TEI customization for scholarly publishing A family of formats based on the TEI customization facilities Core editing customization (to be further extended – minimal tool support) Reference customization family for archiving Can be extended to specific domains: Maths, physics, SVG graphics, etc. Precise representation of bibliographic information Specific support through associated tool: XSLT stylesheets (html, pdf TEI2NLM) PDF 2 TEI facility (Grobid) Open Office 2 TEI facilities (maintained at Oxford) MSWord 2 TEI facilities (TEI project with ISO) AccessTEI PEER Publishing and the Ecology of European Research 27 www.peerproject.eu