Realizing the Dream of a Global Digital Library in High-Energy Physics Annette Holtkamp, Salvatore Mele, Tibor Simko, Tim Smith CERN, Geneva DML 2010 – Paris 7 Jul 2010
2 HEP community l closely-knit community n 20-30k active researchers publishing 10k articles n large collaborations (up to 5000 members) n very international (even small author groups) n authors = readers l rapid information exchange essential n mailing of preprints since the 60’s n long OA tradition n >90% of HEP journal articles on arXiv l dominance of community based information systems n arXiv n SPIRES
Dominance of community services 3 From 2007 survey of 2,000 physicists. Gentil-Beccot et al, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course. J.Am.Soc.Inf.Sci.60: ,2009 arXiv:
SPIRES (1974-) 4 l network of databases n HEP literature, conferences, institutions, experiments, hepnames, jobs l SLAC – DESY – Fermilab Collaboration l SPIRES-HEP n Metadata for 850k objects, ~800 new records per week n Preprints, journal articles, conference contributions, books, grey literature n since 1974, web server since 1991 n 100k searches/day l high data quality, manually curated, comprehensive coverage l high acceptance, user involvement But: l outdated technology from the 70‘s
5 Invenio (2002-) l digital multimedia library system l platform for CERN Document Server (CDS) l powerful search engine n Google-like speed for up to 5M records n combined metadata, reference and fulltext search l flexible metadata (MARCXML, multimedia) l personalization and collaborative features l modular architecture l Apache/Python/MySQL l GNU General Public Licence n ~30 instances worldwide
6 ingestion
7 dissemination
8 run by (2007-)
9 INSPIRE development l 2007: Inception, feasibility study l 2008: user-level functionalities n data conversion n citation analysis, search syntax, output formats… l 2009: cataloguing functionalities n metadata maintenance and enrichment tools l 2010: workflow n harvesting, cataloguing… l April 2010: public beta version
10 Bibliographic Content l SPIRES content (plus part of CDS): journal articles, conference proceedings, preprints, experimental notes, theses l going beyond SPIRES: conference slides, multimedia, software, high-level research data… l going back before 1974 l more material from neighboring disciplines astrophysics, nuclear physics, mathematics… cited by core HEP articles
11 “Fulltext” repository l all freely accessible articles n esp. “endangered” material l access restricted articles n “hidden archive” n first agreements with Springer and APS l historical material n scanning of old preprint series l beyond articles n slides, multimedia, software, wikis… n independent citable objects
12 INSPIRE features I l Advanced search functionality n Google-like freetext search n Complex second-order searches Example: Find the most influential HEP core papers that cite the Hitchin article „Generalized Calabi-Yau manifolds“ but don‘t cite any papers by Polchinski refersto:reportnumber:math/ collection:core cited:100->9999 NOT refersto:author:Polchinski
13 INSPIRE features II l detailed record pages n abstract, keywords, references, citations, fulltext, figures n various export formats l comprehensive author pages n affiliation history, coauthors, frequent keywords, article classification, citation summary l citation analysis n cited by, co-cited with, self-citations, citation history l taxonomy based classification
14 HEP taxonomy hierarchical structure of all important l HEP concepts (dynamical symmetry breaking) providing n synonyms (dynamically broken) n related terms (spontaneous symmetry breaking) n broader/narrower (symmetry breaking) n definitions n subject areas (high-energy physics – theory)
15 Taxonomy applications l fast automatic generation of keywords n enabling e.g. prompt alerts n manually curated afterwards l automatic selection of HEP relevant articles n no longer time delay in border areas due to manual selection l improved search algorithm (planned) n A search for „SUSY“ will also find „supersymmetry“ n narrow/broaden search l user tagging (planned) n improve Inspire generated classification n improve taxonomy
16 Author identification l INSPIRE author id n compatible with other identification schemes n active participation in ORCID l author disambiguation n using e.g. lab id’s, affiliation history, coauthors and more n INSPIRE-id’s already assigned l automatic association of papers with authors n using info on affiliations, coauthors, research topics, from publishers G. Chen: 963 docs, 21 real authors, only 22 docs not assigned, 97.2% success rate n INSPIRE-id part of author lists of large collaborations
17 Coming sooner… l personalization n personal accounts, bookshelves, display formats, alerts, RSS feeds n collaborative tools, user groups l claim my papers l user tagging l fulltext search n snippet display l plot extraction, figure caption search n captions in TeX, display via jsMath, TeX symbols searchable l user submission n paper centric (articles, supplementary material) and beyond
18 … or later l innovative metrics l semantic analysis l content indexing of plots and tables l recommender systems n combining citations, keywords, fulltext, usage pattern data... l open API for 3rd party tools and searching l object aggregation (OAI-ORE) l OAIS standards for long-term document preservation
19 Partnerships l researchers n user tagging, user submission n improved correction interfaces n feedback driving future developments l information providers n close alliance with arXiv n data exchange with publishers/databases n standardized author identities l neighboring fields n open harvesting and searching n ADS (SAO/NASA Astrophysics Data System) n DML !