Information structuring in the PEER project

1 Information structuring in the PEER project
Back to meaning Information structuring in the PEER project Foudil Bretel1, Patrice Lopez1-2, Maud Medves1-2, Alain Monteil1, Laurent Romary1-2 1INRIA 2Humboldt Univ. Berlin PEER Publishing and the Ecology of European Research

2 Sorting out the chaos? Vision: channelling heterogeneous (publisher’s) data into one single (meaningful) format Material: PDF with metadata – what can the TEI do with it? Articulating a pivot/reference format seen as a strict customization of the TEI Exploring the possibility of automatic metadata extraction from PDFs PEER Publishing and the Ecology of European Research

3 Why is it so difficult? Great heterogeneity of format within publishers Meta data (and full-text) Proprietary, ScholarOne, NLM 2.0, NLM 3.0, … Various issues Affiliations Publication date information ISO 639 codes (countries) Bibliographical references Proprietary metadata fields PEER Publishing and the Ecology of European Research

4 The information chaos Article title Journal title ISSN (print)
article-title/title | ArticleTitle | article-title | ce:title | art_title | article_title | nihms-submit/title | ArticleTitle/Title | ChapterTitle Journal title j-title | JournalTitle | full_journal_title | jrn_title | journal-title ISSN (print) JournalPrintISSN | | | PrintISSN | issn-paper First page of a paper spn | FirstPage | ArticleFirstPage | fpage | first-page PEER Publishing and the Ecology of European Research

5 Sorting this out Defining a coherent infrastructure to facilitate
The long-term management of scholarly content in research institutions Smooth interaction between publishers and research institutions Better understanding of what each of us can provide On-going experimental setting: the EU PEER project PEER Publishing and the Ecology of European Research

6 The PEER project Initiated by the EU commission (DG INFSO)
Objective: study the impact of systematically archiving stage-two outputs in “institutional repositories” (cf. Romary & Armbruster 2010) on journals and business models on wider ecology of scientific resarch Consortium STM, European Science Foundation (ESF), Goettingen State and University Library (UGOE), Max Planck Gesellschaft (MPG), INRIA PEER Publishing and the Ecology of European Research

7 Content submission - publishers
Eligible Journals / Articles Publishers PEER Depot Authors Select 100 % Metadata 50 % Manuscripts Publishers Transfer Publishers Deposit Publishers Inform PEER Publishing and the Ecology of European Research 7

8 Content submission – to repositories & LTP archive
Publishers Deposit Authors Deposit PEER Depot Transfer Transfer Long-Term Preservation; LTP Depot (e-Depot, KB) Publicly Available PEER Repositories UGOE ULD MPG KTU HAL TDC SSOAR PEER Publishing and the Ecology of European Research 8

9 What has been done Publishers involved the project
BMJ Publishing Group (proprietary format) Cambridge University Press (NLM2.2) EDP Science (NLM3.0) Elsevier (proprietary format) IOP Publishing (NLM3.0) Nature Publishing Group (proprietary format) Oxford University Press (ScholarOne) Portland Press (NLM2.0) Sage Publications (proprietary format) Springer (proprietary format) Taylor & Francis Group (ScholarOne) Wiley-Blackwell (ScholarOne) PEER Publishing and the Ecology of European Research

10 The PEER deposit workflow
Repositories Publishers HAL SUB-Göt PEER Depot MPS KB Preservation PEER Publishing and the Ecology of European Research

11 TEI as a pivot format for interchange
General strategy: no information should be lost Nearly everything in sourceDesc + Keywords, Summary, Copyright Strict author description Deep encoding of names Deep encoding of affiliations (Web of Science - 3-level) Deep encoding of addresses – getting the country right Precise publishing information Pagination, DOIs, volume, issue, journals name(s) Yes, biblStruct is cool! PEER Publishing and the Ecology of European Research

12 Example Source (Springer proprietary format) PEER format (TEI)
<Author AffiliationIDS="Aff1" CorrespondingAffiliationID="Aff1"> <AuthorName DisplayOrder="Western"> <GivenName>Hucheng</GivenName> <FamilyName>Qi</FamilyName> </AuthorName> <Contact> </Contact> </Author> …. <Affiliation ID="Aff1"> <OrgName>Durisol, A division of Armtec Limited Partnership</OrgName> <OrgAddress> <Street>51 Arthur Street South</Street> <Postcode>N0K 1N0</Postcode> <City>Mitchell</City> <State>ON</State> <Country>Canada</Country> </OrgAddress> </Affiliation> Source (Springer proprietary format) PEER format (TEI) <author> <persName> <forename type="first">Hucheng</forename> <surname>Qi</surname> </persName> <affiliation> <orgName type="institution">Durisol, A division of Armtec Limited Partnership</orgName> <address> <street>51 Arthur Street South</street> <postCode>N0K 1N0</postCode> <settlement>Mitchell</settlement> <region>ON</region> <country key="CA">CANADA</country> </address> </affiliation> </author> PEER Publishing and the Ecology of European Research

13 Example PEER format (TEI) Source (Springer proprietary format)
<sourceDesc> <biblStruct> <analytic>… <title level="a" type="main" xml:lang="en"> The investigation of basic processes of rapidly hardening wood-cement-water mixture with CO<hi rend="subscript">2</hi> </title> <title level="a" type="main" xml:lang="de"> Untersuchung der Vorgänge bei der schnellen Härtung einer Holz-Zement-Wasser-Mischung mit CO<hi rend="subscript">2</hi> </title> </analytic> <monogr> <imprint> <date when=" "/> <biblScope type="fpage">1</biblScope> <biblScope type="lpage">7</biblScope> </imprint> </monogr> <idno type="DOI"> /s z</idno> <idno type="publisherID">s z</idno> <idno type="articleID">351</idno> </biblStruct> </sourceDesc> </fileDesc> PEER format (TEI) <ArticleID>351</ArticleID> <ArticleDOI> /s z</ArticleDOI> <ArticleSequenceNumber>0</ArticleSequenceNumber> <ArticleTitle Language="En"> The investigation of basic processes of rapidly hardening wood-cement-water mixture with CO<Subscript>2</Subscript> </ArticleTitle> <ArticleTitle Language="De"> Untersuchung der Vorgänge bei der schnellen Härtung einer Holz-Zement-Wasser-Mischung mit CO<Subscript>2</Subscript> </ArticleTitle> <ArticleCategory>Originals Originalarbeiten </ArticleCategory> <ArticleFirstPage>1</ArticleFirstPage> <ArticleLastPage>7</ArticleLastPage> <ArticleHistory> <RegistrationDate> <Year>2009</Year> <Month>05</Month> <Day>14</Day> </RegistrationDate> <Received> <Year>2008</Year><Month>12</Month><Day>9</Day></Received> <OnlineDate> <Year>2009</Year> <Month>5</Month><Day>30</Day></OnlineDate> </ArticleHistory> <ArticleCopyright> <CopyrightHolderName>Springer-Verlag</CopyrightHolderName> <CopyrightYear>2009</CopyrightYear> </ArticleCopyright> <ArticleContext> <JournalID>107</JournalID> </ArticleContext> </ArticleInfo> Source (Springer proprietary format) PEER Publishing and the Ecology of European Research

14 … And when no metadata is available
PEER Publishing and the Ecology of European Research

15 GROBID GeneRation Of BIbliographic Data
A text mining tool for extracting bibliographical metadata at large Input: Technical and scientific domains Scholar documents, technical manuals and patents Raw text or text with layout information (PDF) Machine learning approach PEER Publishing and the Ecology of European Research

16 Metadata extraction from front page
PEER Publishing and the Ecology of European Research

17 Metadata extraction from front page
Extraction of bibliographical information from article header Fields: title, authors, date, abstract, location, affiliation, book title, journal title, , publication number, web, degree, keywords, etc. As features, exploitation of position information (begin/end of line, in the doc.) lexical information (vocabulary, large gazetteers) layout information (font size, font style, etc.) Conditional Random Fields (CRF) (Peng & McCallum 04) Current training corpus: global examples affiliations/addresses blocks authors sequences, etc. PEER Publishing and the Ecology of European Research

18 Layout & Block Analysis: XY-Cut algorithm
PEER Publishing and the Ecology of European Research

19 Metadata extraction from header
PEER Publishing and the Ecology of European Research

20 Metadata extraction from header
PEER Publishing and the Ecology of European Research

21 Metadata consolidation
Exploitation of external bibliographical databases for correcting/completing results based on extraction results Crossref: The full bibliographical record can be obtained based on: DOI Journal title, volume, first page Title + author first name ➞ frequent! Other databases: xISSN, xISBN, Amazon Web Service Real time: online requests between seconds PEER Publishing and the Ecology of European Research

22 Accuracy overview: corpus CORA
Features Accuracy Precision Recall F1 Token 99.65 97.37 94.19 95.75 Field 94.7 Instance 74.91 Instance after consolidation 82.20 Title 99.70 98.24 95.48 96.84 Author 99.38 90.27 96.36 93.21 Date 99.86 97.53 81.07 87.29 Affiliation 99.52 98.25 93.26 95.69 Abstract 98.95 99.64 98.81 99.22 (+9.7%) PEER Publishing and the Ecology of European Research

23 Extraction from header
Collection Pre- processing Token + features CRF models Header Authors Affiliations + catalog + catalog + expected result Document segmentation - text segmentation - feature generation train PEER Publishing and the Ecology of European Research

24 Extraction from header
Collection Pre- processing Token + features CRF models Header Authors Affiliations + catalog + catalog + expected result Document segmentation - text segmentation - feature generation train / classify post-processing consolidation Segmented document Term candidates + features terms + labels Final biblio. record Document PEER Publishing and the Ecology of European Research

25 Why GROBID ? Cataloguing: mass digitalization User needs:
self-archiving of scholar papers by authors in open archives metadata not easily available Extraction of additional metadata: references, keywords, etc. for enriching/correcting existing ones improvement in search & retrieval Ease document access from citation strings Playground for experimenting with CRF models for text mining PEER Publishing and the Ecology of European Research

26 Lessons Reusable infrastructure for various types of academic-publisher relation (e.g. Gold OA agreements) biblStruct is cool Cf. Michael’s talk: deeply structured Standardization in the publishing world is still an open issue … diplomatically put The TEI has a role to play in the publishing world Coherence between publication material and other sources E.g. central role of attribution/authorship/affiliation Stylesheets to be made available in OxGarage

27 A TEI customization for scholarly publishing
A family of formats based on the TEI customization facilities Core editing customization (to be further extended – minimal tool support) Reference customization family for archiving Can be extended to specific domains: Maths, physics, SVG graphics, etc. Precise representation of bibliographic information Specific support through associated tool: XSLT stylesheets (html, pdf TEI2NLM) PDF 2 TEI facility (Grobid) Open Office 2 TEI facilities (maintained at Oxford) MSWord 2 TEI facilities (TEI project with ISO) AccessTEI PEER Publishing and the Ecology of European Research

