Download presentation
Presentation is loading. Please wait.
Published byMargaret Rose Cunningham Modified over 8 years ago
1
The OpenAIRE Infrastructure On Measuring Research Impact - The EGI use-case - OpenAIRE project Claudio Atzori, Paolo Manghi (CNR-ISTI), Ioannis Foufoulas, Harry Dimitropoulos, Antonis Lempesis, Natalia Manola, Elefterios Stamatogiannakis (NKUA) EGI Sergio Andreozzi (EGI.eu), Wilhelm Bühler (NGI-DE), Sara Coelho (EGI.eu), Catherine Gater (EGI.eu), Burcu Ortakaya (NGI TR), Genevieve Romier (NGI-FR) 1 Paolo Manghi paolo.manghi@isti.cnr.it
2
OpenAIREplus project The objectives (in a nutshell) European data infrastructure for scholarly communication Facilitating discovery of research outcome across disciplines and Europe (and beyond) publications and datasets repositories Promoting Open Access: publishing and citation best practices, business models for publications and datasets Interlinking and contextualizing publications and datasets Measuring impact of research initiatives Open Access vs non-Open Access publishing under public EC fundings (SC39) Funding schemes: return of investment Research initiatives: research impact Providing both human and technical infrastructure to make this possible! 2 EGI Forum, Madrid – 16 th of September 2013
3
Networking infrastructure National Open Access Desks 3 Human Network Building on Confederation of Open Access Repositories (COAR)
4
Networking activities Main objectives Top-down strategy: dissemination, liaison activities, community building (NOADs) Engaging all stakeholders in the OA dissemination, linking and usage of publications and related datasets Bottom-up strategy: aligning with national, European and international initiatives e.g., DataCite, ESFRI-infrastructure projects (such as CLARIN, DARIAH) EuroCRIS (CERIF), EUDAT, Europeana Organizations of four national and international workshops EGI Forum, Madrid – 16 th of September 2013
5
Technical infrastructure Functional architecture 5 Context (e.g. projects, Author IDs) Publications Information Inference by Data Mining De-duplication Entity Registries (OpenDOAR, EC-CORDA) Data Source Validation Cleansing & Transforming 5 Datasets Data repositories Publications Institutional & Thematic repositories, OA Journals CRIS systems (e.g. National projects) Zenodo Import policies: OpenAIRE guidelines Dublin Core metadata + FP7 project + license info + registered in OpenDOAR DataCite metadata + PID + link to pubs in OpenAIRE Datasets Get support (NOADs) Statistics Search & Browse Deposit/ Ingest (claim) Publications Curate & collaborate (feedbacks) Access for Third-party Service Providers Usage policies
6
6 Data source import End-user claims Native Information Space De-duplication Public Information Space Data Inference Human Data Curation Enriched Information Space OpenAIRE Portal: Discovery & Impact measure OpenAIRE Data flow EGI Forum, Madrid – 16 th of September 2013
7
OpenAIRE data model CERIF & DataCite inspiration 7
8
Importing from data sources Repositories and CRIS systems Publication repositories (Dublin Core, OpenAIRE profile) Publications with relationships to authors, projects, licenses, access rights, and repository Repositories aggregators (Dublin Core, OpenAIRE profile) Publications with relationships to authors, projects, licenses, access rights, repository, and aggregator Dataset repositories (DataCite, OpenAIRE profile) Datasets with relationships to authors, projects, and publications CRIS systems (CERIFXML, OpenAIRE profile) Publications with relationships to authors, organizations, projects, etc. Datasets with relationships to authors, organizations, projects, etc. Projects with relationships to persons, organizations, fundings, etc. 8 EGI Forum, Madrid – 16 th of September 2013
9
Importing from data sources Entity registries OpenDOAR (OpenDOAR HTTP) Repositories with relationships to organizations EC CORDA (CORDA XML) EC FP7 projects with relationships to persons and organizations Wellcome Trust (Wellcome Trust HTTP) WT projects with relationships to organizations Re3data.org for data repositories (ongoing) Dataset repositories ORCID: ongoing Researcher identifiers (maybe more) Exchanging data with data sources! Typically relationships to objects out of their domain Publications and datasets Context 9 EGI Forum, Madrid – 16 th of September 2013
10
Importing from end-users Depositing, claiming, feedback OpenAIRE Zenodo Deposition: authors who are orphans of a repository of reference for publication and datasets can deposit file and metadata into the Zenodo repository CrossRef Claiming: authors who have a repository of reference can “claim” their depositions into OpenAIRE by specifying the relative DOIs Human Data Curation: OpenAIRE data curators (and registered end-user feedbacks, prior validation from data curators) can enrich or fix the information space 10 EGI Forum, Madrid – 16 th of September 2013
11
Importing from inference Examples Relationship inference by mining publication PDFs Project-publication: EC and National projects (e.g. WT) Dataset-publication: DOI connections Publication-publication: citations by bibliography, text similarity Property inference by mining publication PDFs Title of papers Authors of papers Affiliating organizations of the authors at time of publication Subject classification EGI Forum, Madrid – 16 th of September 2013
12
Information Space Publication repositories: ~400 repositories With links to EC projects: ~110 repositories + 20 Open Access Journals + 1 aggregator (NARCIS, NL) With Open Access publications (DRIVER): ~390 repositories Dataset repositories DataCite, DRYAD, Pangaea, others 12 EGI Forum, Madrid – 16 th of September 2013
13
Information Space OpenAIRE Production system (www.openaire.eu, online since 2011):www.openaire.eu 40,000 publications linked to EC fundings OpenAIREplus Beta system (public release end of September 2013) 8 Millions publications, including DRIVER infrastructure OA publications, OpenAIRE publications, and datasets Datasets DRYAD DataCite Pangaea Biodiversity (external) databases: Geoscience journal? 13 EGI Forum, Madrid – 16 th of September 2013
14
Enabling Technology D-NET Software toolkit Service-oriented data infrastructure enabling technology Adoption: Projects (DRIVER/II, OpenAIRE/plus, EFG/1914, HOPE, EAGLE) and nations (Spain-Recolecta, Poland, Belgium, Argentina (in progress)) By: CNR-ISTI (IT), Uni Athens (GR), ICM (PL), Uni Bielefeld (GE) INVENIO Repository Customizable repository platform: workflows and data models Adoption: CERN digital library and 30+ institutions world-wide By: CERN (Switzerland), collaboration from DESY, EPFL, FNAL, SLAC 14 EGI Forum, Madrid – 16 th of September 2013
15
Data Mediation Area Data Conversion Area Data Storage and Indexing Area Data Enhancement Area Data Provision Area External Data Sources OAI-PMH Harvester Service OAI-PMH Harvester Service Data Source Validator Service Data Source Validator Service FTP/HTTP Import Service FTP/HTTP Import Service Data Source Man Service Data Source Man Service Metadata Transformation Service Metadata Transformation Service Metadata Cleaner Service Metadata Cleaner Service Feature Extraction Service Feature Extraction Service Metadata (un)packaging Service Metadata (un)packaging Service Full-Text Index Service Full-Text Index Service Metadata Store Service Metadata Store Service Database Service Database Service De-duplication Service De-duplication Service Human Data Curation Service Human Data Curation Service Information Inference Service Information Inference Service Generic portal Service (query and stats - impact) Generic portal Service (query and stats - impact) SRW/CQL Publisher Service SRW/CQL Publisher Service OAI-PMH/ORE Publisher Service OAI-PMH/ORE Publisher Service PostgresMongoDB Hadoop TextMining GroovyXSLT Solr (sharded) Linked Data Linked Data Object Store Service Object Store Service Column Store Service Column Store Service HBASE MongoDB (GridFS)
16
Deposition of metadata and files relative to datasets and publications Metadata curation Preservation over time http://zenodo-dev.cern.ch/ 16 EGI Forum, Madrid – 16 th of September 2013
17
Measuring impact of “research initiatives” Research initiative An activity that is funding, favoring or developing science and willing to measure its impact in terms of research outcome Today’s OpenAIRE supported research initiatives: European Commission: OA vs non-OA publications w.r.t. EC projects (facets: organizations, countries, EC funding schemes, EC subjects) e-IRG EGI infrastructure: Publications w.r.t. EGI infrastructure (facets: EGI virtual organizations, EGI disciplines) Future: National funding agencies Other scientific infrastructures 17 EGI Forum, Madrid – 16 th of September 2013
18
Measuring impact of “research initiatives” Repositories, archives, entity registries and CRISs publication/dataset & research initiative Inferenc e by mining End- user Feedbac k “Claim” Any relationships & entities Any relationships & entities Collectio n Automated provision: Supported by compliance to “Guidelines for Content Providers” (Semi-)Automated provision: facilitated by “mandates” typically issued by funding agencies (e.g., reference to initiative embedded in publication text) Any relationships & entities Data model represents the concept of “research initiative”
19
EGI community use- case Claims: screenshot Under development EGI Forum, Madrid – 16 th of September 2013 19
20
Portal: publication details Screenshot: EGI community use- case EGI Forum, Madrid – 16 th of September 2013 20
21
EGI community use- case Inference by mining: Use cases Identify publications associated to EGI in terms of: Associated to EGI projects Publication “enabledBy EGI:XYZ” Publication ”supportedBy EGI:XYZ” Associated to a certain Virtual Organisation (VO) or National GRID Infrastructures (NGI) Publication "used EGI" Publication "used NGI:XYZ" Publication ”producedBy VO:XYZ” Associated to a certain EGI scientific discipline Publication "related to EGI Scientific Discipline:XYZ” EGI Forum, Madrid – 16 th of September 2013
22
EGI community use-case Inference by mining: Acknowledgement statements Typically used by EC projects & EGI VO’s in scientific publications, such as: Organisati on Name TypeGrant ID Suggested Acknowledgement WeNMREC Project 26157 2 "The WeNMR project (European FP7 e-Infrastructure grant, contract no. 261572, www.wenmr.eu), supported by the European Grid Initiative (EGI) through the national GRID Initiatives of Belgium, France, Italy, Germany, the Netherlands (via the Dutch BiG Grid project), Portugal, Spain, UK, South Africa, Taiwan and the Latin America GRID infrastructure via the Gisela project is acknowledged for the use of web portals, computing and storage facilities." and the following article describing the WeNMR portals should be cited: Wassenaar et al. (2012). WeNMR: Structural Biology on the Grid.J. Grid. Comp., 10:743-767. EGI- InSPIRE EC Project 26132 3 The authors acknowledge the use of resources provided by the European Grid Infrastructure. For more information, please reference the EGI-InSPIRE paper (http://go.egi.eu/pdnon).http://go.egi.eu/pdnon ALICEVOn/a The ALICE collaboration gratefully acknowledges the resources and support provided by all Grid centres and the Worldwide LHC Computing Grid (WLCG) collaboration. LHCbVOn/a The Tier1 computing centres are supported by IN2P3 (France), KIT and BMBF (Germany), INFN (Italy), NWO and SURF (The Netherlands), PIC (Spain), GridPP (United Kingdom). We are thankful for the computing resources put at our disposal by Yandex LLC (Russia), as well as to the communities behind the multiple open source software packages that we depend on. NGI:PTNGIn/a This work makes use of results produced with the support of the Portuguese National Grid Initiative. More information in https://wiki.ncg.ingrid.pthttps://wiki.ncg.ingrid.pt ……… …
23
To use in generic rules to discover publications potentially linked to EGI: EGI community use- case Inference by mining: mining terms EGI Forum, Madrid – 16 th of September 2013 TermsRelationships "European Grid Infrastructure"used EGI "European Grid Initiative"used EGI "National Grid Infrastructure"used EGI "National Grid Initiative"used EGI "WLCG"used EGI "France Grilles"used EGI, used NGI:FR "Metacentrum"used EGI, used NGI:CZ "Swiss National Grid Association"used EGI, used NGI:CH "Ibergrid" "NGI-ES"used EGI, used NGI:ES and/or NG:PT "BIGGrid" "The Dutch e-Science Grid"used EGI, used NGI:NL "PL-Grid" "Polish National Grid Initiative"used EGI, used NGI:PL "TRUBA" "Turkish National Grid Initiative" used EGI, used NGI:TR "ISRAGRID" "Israel National Infrastructure for Grid and Cloud Computing" used EGI, used NGI:IL
24
EGI community use-case Inference by mining: algorithmic approach Extract full text from publications if structured, use “funding” & “acknowledgements” fields Scan text for matches against any of the EGI organization names provided For each match, search surrounding context for general terms & suggested acknowledgements (using word pairs) to add a confidence value to the match and eliminate false matches For EC projects, we search not only for the project acronym (e.g. EGI-InSPIRE) but also for the grant ID (261323) EGI Forum, Madrid – 16 th of September 2013
25
EGI community use-case Inference by mining: Results for arXiv.org Set of 236,130 publications, from 01/2010 to 05/2013 Found 1,819 EGI-related publications (5,500 high- confidence matches in total, with multiple matches per publication) EGI-organisation# of unique publications ALICE1091 ATLAS255 CALICE19 CMS248 ICECUBE6 LHCB13 MAGIC25 METACENTRUM11 T2K195 1863 EGI Forum, Madrid – 16 th of September 2013
26
EGI community use-case Inference by mining: Results for Web of Science (WoS) Set of 14,014 publications returned by WoS query on “funding text” field, using general search terms & organisation names Text mining high-confidence results (verified by EGI): EGI-organisation# of unique publications ALICE65 ATLAS74 CMS210 LHCB1 METACENTRUM206 556 EGI Forum, Madrid – 16 th of September 2013
27
Inference by mining Lessons learned & Best practices Mandates on how to write acknowledgements are crucial but often missing Try to collect as much information that may help with the mining beforehand. Even information that you may not think that it'll help, it may prove useful in the end. Clean and normalize your input data (character encoding, stop-word removal, character case, special characters, etc.). If this step is not done correctly, all the processes that follow it, will suffer. Design your data mining methods to be very tolerant. In our case, suggested acknowledgements never appeared exactly as given in the input texts. So we had to use word pairs to be tolerant to that. You should do manual curation of the results to tune your data mining methods. Yes it is very labor intensive, but without it you'll be blind to your mistakes. Design and implement your data processing methods to work in a streamed fashion and to be performant. Streamed design solves the “data bigger than memory” problem, performance design solves the “having to wait one week for results” problem. EGI Forum, Madrid – 16 th of September 2013
28
Roadmap Release Results of inference visible from the portal Claim user interfaces available from the portal Plan BETA release – ready for testing by 15 th of November 2013 Production release – ready by 1 st of October 2013 EGI Forum, Madrid – 16 th of September 2013 28
29
OpenAIRE project factsheet Coordination University of Athens - GR Goettingen University Library - DE CNR-ISTI - IT Technical production & operation 5 partners with expertise in technologies for Digital Libraries and Data Infrastructures General Starting date: Dec 1, 2011 Duration: 30 months Total budget: 5.2 Mi 29 Scientific communities EMBL-EBI – biology DANS – social sciences STFC/BADC – climate Networking Organization 5 libraries, active in OA movement National Open Access Desks All member states Norway, Switzerland, Turkey, Iceland EGI Forum, Madrid – 16 th of September 2013
30
Questions? For more: OpenAIRE infrastructure: http://www.openaire.euhttp://www.openaire.eu D-NET software: http://www.d-net.research-infrastructures.euhttp://www.d-net.research-infrastructures.eu INVENIO: http://http://invenio-software.org/
31
Inference framework OpenAIRE implements a framework and infrastructure for data inference services Inference service (pluggable and pipeline-able): applies an algorithm to an input collection and returns an output collection with a given degree of Trust Scope: publication/dataset files (e.g. PDF, XML), graph of objects, metadata Actions: e.g. discipline-specific inference methods, NLP techniques Inference workflows: sequences of inference steps each instantiated by an inference service Workflow’ intermediate and final collections are cached and can be used for testing new inference configuration or re-executing only part of the workflows 31 EGI Forum, Madrid – 16 th of September 2013
32
Identity, Provenance and Trust Each object in the information space is enriched with information relative to: State-less identifiers: generation of stable identifiers Data source from which the object was collected Workflow performing the collection: data source import, end- user claim, end-user feedback, inference, etc. Agent responsible of the collection of the object: registered users, system driven mechanism Trust: value from 0 to 1 stating the level of TRUST of the object EGI Forum, Madrid – 16 th of September 2013 32
33
De-duplicating content Equivalent objects are “merged” into one “representative object” Publications, the representative object keeps: All properties of the merged records (prioritized by degree of TRUST) All files of the merged records: URLs, data source and license All relationships to other objects of the merged records: projects, authors, publications, datasets, etc. Organizations, the representative object keeps: All properties of the merged records (prioritized by degree of TRUST) All relationships to other objects of the merged records: projects, publications, datasets, etc. Persons (ongoing), the representative object keeps: All properties of the merged records (prioritized by degree of TRUST) All relationships to other objects of the merged records: organizations, publications, datasets, etc. EGI Forum, Madrid – 16 th of September 2013 33
34
Collection Caching MDStore pubs, datasets, CERIF RDBMS projects, orgs, datasources PDF Store HarvestingUploading Information Space Temporary Store De-duping Exporting LinkedDa ta HTTP OAI-PMH Data Source manageme nt Information Inference Curation End-user feedback Data curator tools Relation ships Metadat a Objects UIs Statistics Researc h Impact Enhanced Publicatio ns Search & Browse Permane nt Store Zenodo Validator
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.