Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego

Slides:



Advertisements
Similar presentations
Committed to making the worlds scientific and medical literature a public resource Donna Okubo, Institutional Relations Manager.
Advertisements

Committed to making the worlds scientific and medical literature a public resource.
NIH Public Access Compliance Cleveland Health Sciences Library Case Western Reserve University Kathleen C. Blazar.
Business Development Suit Presented by Thomas Mathews.
Throwing Open the Doors: Strategies and Implications for Open Access Heather Joseph Executive Director, SPARC October 23, 2009 Educause Live 1.
Open Access Publishing with Wiley. Gold v Green Open Access Gold or pay to publish Open Access: Article is made freely accessible online to anyone anywhere.
PubMed Central Mahyar Ahmadpour-B. Kowsar Publicatin Corp. Kowsar Editorial Meeting 1 September 19th, 2013 Tehran, Iran.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2004.
Service activities ViBRANT Project Year 3/Final Review Meeting – Brussels Description & Objectives WP Description WP Objectives WP partners.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2005.
Februrary 2005UCSF Library & Center for Knowledge Management Scholarly Communication.
NATIONAL LIBRARY OF MEDICINE NLM Journal Archiving and Interchange Tagset Jeff Beck National Center for Biotechnology Information National Library of Medicine.
Responsible Conduct of Research, Scholarship, and Creative Activities Peer Review Responsible Conduct of Research, Scholarship, and Creative Activities.
1 of 2 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Protecting Your Scholarship: Copyrights, Publication Agreements, and Open Access Harvard University Office for Scholarly Communication May 11, 2009 Kenneth.
BTW (“By The Way…”) Information Annotation By Rudd Stevens, Jason Endo University of San Francisco.
SCIENTIFIC SOLUTIONS Thomson ResearchSoft Paul Torpey April 8, 2005.
New Modes of Scholarly Communication and Learning Philip E. Bourne University of California San Diego 1WSU December 2, 2008.
Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.
Greater Reach for your Research: Author’s Rights & the Shifting Landscape of Scholarly Communication Lisa Goddard & Shannon Gordon Memorial University.
Implementing Metadata Marjorie M K Hlava, President Access Innovations, Inc. Albuquerque, NM
Open Access: a Biomedical Science Perspective Gerald M. Kidder, Ph.D. Associate Vice-President (Research) and Professor of Physiology Schulich School of.
ArcGIS Workflow Manager An Introduction
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
The Role of Ontologies in Improved Scholarly Communication Philip E. Bourne University of California San Diego
Alternative Models of Scholarly Communication: The "Toddler Years" for Open Access Journals and Institutional Repositories Greg Tananbaum President The.
What is SciVee? SciVee Partners University of California, San Diego.
Break Dengue in a Nutshell. WHAT WE WILL DO? Joint all forces against Dengue Leverage the power of social movements Be a pilot for other NTD fighting.
Some Thoughts on Scholarly Communication and the Role of Bio-ontologies Philip E. Bourne University of California San Diego
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
The Department of Energy’s Public Access Solution Giving Voice to Energy and Science R&D Results Jeffrey Salmon Deputy Director for Resource Management.
Thomson Scientific October 2006 ISI Web of Knowledge Autumn updates.
I.T MEDIA MAISRUL www.roelsite.yolasite.com
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
SCOPUS AND SCIVAL EVALUATION AND PROMOTION OF UKRAINIAN RESEARCH RESULTS PIOTR GOŁKIEWICZ PRODUCT SALES MANAGER, CENTRAL AND EASTERN EUROPE LVIV, 11 SEPTEMBER.
Copyright 2006 Thomson Corporation ISI Web of Knowledge EndNote ® Web and EndNote ® Integrated solutions for research and publishing October 2006.
The Promise of Open Access Philip E. Bourne PhD University of California San Diego Open Access Day October 14, 2008
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Creating Change in Scholarly Communications Heather Joseph Executive Director, SPARC September 21, 2009 TCAL, Austin, TX.
Committed to making the world’s scientific and medical literature a public resource.
Data enters Scholarly Communication; how publishers can help make things better Integration of Research Data and Publications Project ODE – workpackage.
I am not a PDBid I am a Biological Macromolecule Philip E. Bourne University of California San Diego
Open Science One Person’s View and What We Are Doing About It Philip E. Bourne University of California San Diego 1PSB Open Science Workshop.
Towards Data Attribution & Citation in the Life Sciences Philip E. Bourne UCSD 8/22/11Data Attribution and Citation.
Directions for Hypertext Research: Exploring the Design Space for Interactive Scholarly Communication John J. Leggett & Frank M. Shipman Department of.
Philip E. Bourne Professional Development Lecture 7 Understanding and Working the Publishing Process.
Deepcarbon.net Xiaogang (Marshall) Ma, Yu Chen, Han Wang, John Erickson, Patrick West, Peter Fox Tetherless World Constellation Rensselaer Polytechnic.
The UC Open Access Policy More information at uc-oa.info.
Data Integration and Management A PDB Perspective.
Examples for Open Access Scholar Electronic Repository by New Bulgarian University IP LibCMASS Sofia 2011 Contract № 2011-ERA-IP-7 Sofia, September,
Open Access Opportunities, Policies & Rights IAS ACE Programme 19 November 2015.
One publisher’s perspectives on an evolving industry Grace Baynes Nature Publishing Group October 2009.
IBM Lotus Software © 2006 IBM Corporation IBM Lotus Notes Domino Blog Template Steve Castledine.
PODCAST term acronym derived from a combination of “pod” (capsule) and Broadcast (dissemination-issue) Its direct antecedents are audioblogs, variants.
Traditional Distribution Electronic Distribution User Florida Entomologist Issues Reprints FTP.
Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.
Telling Research Stories Through SciVee Philip E. Bourne University of California San Diego AAAS February 21, 2010.
The Two Cultures: Mashing up Web 2.0 and the Semantic Web The 16 th International World Wide Web Conference (2007) - Position Paper - Presented By Anupriya.
Vision: Increase regional sharing and collaboration in order to expedite the delivery and adoption of energy efficiency. Conduit is brought to you by NEEA.
Merit JISC Collections Merit: presentation for UKCORR Hugh Look, Project Director.
 GEETHA P.  Originally coined by Tim O’Reilly Publishing Media  Second generation of services available on www.  Lets people collaborate and share.
Committed to making the world’s scientific and medical literature
MS Physiolology 700 Capstone Project
1 3 2 New Elsevier Sharing Policy
Next Generation Preprint Service
Elsevier Activity Range
Philip Bourne University of California San Diego
Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.
Presentation transcript:

Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego 1ICMLA 2008

Disclaimer I am not an expert in machine learning, but have applied SVMs to biological systems on occasion 2 Protein Motions: Gu et al. PLoS Comp. Biol., (7) e90 P-P Interfaces: Chung et al. Proteins (3)

So Why am I Here? There are events happening in what I broadly refer to as “scholarly communication” which I believe offer new opportunities for those interested in machine learning What are those opportunities and how can they be exploited? 3

What If… What if … negative data was as easily obtainable as positive data What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge on a scale not previously possible What if … that knowledge included rich media What if … the value of that knowledge could be weighted according to the authority of the source 4

Some big “Ifs” But.. Would offer a much richer medium to learn from.. Take home message – parts of this medium are already here and those of us generating that medium are keen to collaborate

Some big “Ifs” Lets take a step back and see where we are today

Today’s Research Cycle Research [Grants] Journal Article Conference Paper Poster Session Feds Societies Publishers Reviews Blogs Community Service/Data

Tomorrows Research Cycle The relationship between scientist and publisher is quite different The publisher is a warehouse for the workflow of scientific endeavor not just a repository for the end product 8

Tomorrows Research Cycle: Evidence Publishers hubs: –Elsevier portals –PLoS collections Open Access/open review e.g. Biology Direct NIH Roadmap requires data be accessible New Resources: – –MetaLab (Borya Shakhnovich)

What If… What if … negative data was as easily obtainable as positive data What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge What if … that knowledge included rich media What if … the value of that knowledge could be weighted according to the authority of the source 10

Example: The Protein Structure Initiative The X-ray Crystallography Pipeline What if … negative data was as easily obtainable as positive data Basic Steps Target Selection Crystallomics Isolation, Expression, Purification, Crystallization Data Collection Structure Solution Structure Refinement Functional Annotation Publish Remains more of an Art than a Science

Positive and Negative Data are Required by the NIH to be deposited immediately Data are described by an ontology Perhaps some underlying principles can be learnt, particularly as the amount of data is increasing rapidly

What If… What if … negative data was as easily obtainable as positive data What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge What if … that knowledge included rich media What if … the value of that knowledge could be weighted according to the authority of the source 13

A First Step is to Have Open and Usable Access to the Scientific Literature.. We are making steps in that direction

15 NIH Public Access Policy “The research supported by the National Institutes of Health (NIH) is essential to improving human health. Public access to this research is vital – today and for generations to come.” From a letter from NIH Director Zerhouni to grantees, February 3rd, 2005

16 More and more authors care about improving access to their papers… “Faced with the option of submitting to an open-access or closed-access journal, we now wonder whether it is ethical for us to opt for closed access on the grounds of impact factor or preferred specialist audience.” -- Costello and Osrin in The Lancet

17 Where are we Today? NIH and other government funders have mandated open access Full text increasingly on-line and potentially usable Traditional publishers have used the internet as a distribution medium, but the power of the medium has yet to be realized Data increasingly on-line but not integrated with the publication derived from it

The Growth of Open Access Literature 18

Open Access (Creative Commons License) 1.All published materials available on-line free to all (author pays model) 2.Unrestricted access to all published material in various formats eg XML provided attribution is given to the original author(s) 3.Copyright remains with the author 19

Open Access (Creative Commons License) 1.All published materials available on-line free to all (author pays model) 2.Unrestricted access to all published material in various formats eg XML provided attribution is given to the original author(s) 3.Copyright remains with the author The catalyst PLoS Comp Biol (3) e

Community Reaction? Most scientists have no idea that this implies that anyone can take their material and enhance it e.g., via mashup and effectively republish it 21 What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

Consider an Example What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge Fink & Bourne 2007 CT Watch 3(3) 26-31

Database and Journal Integration- The Test Bed Journals Database 23

The Protein Data Bank Paper not published unless data are deposited – strong data to literature correspondence Highly structured data conforming to an extensive ontology DOI’s assigned to every structure –

Seamless Integration between Data and the Literature – What Does That Imply? Improving semantic consistency in the literature – best done at the point of authoring Post processing to establish semantic content New forms of visualization and interaction at the presentation layer 25

Seamless Integration between Data and the Literature – What Does That Imply? Improving semantic consistency in the literature – best done at the point of authoring Post processing to establish semantic content New forms of visualization and interaction at the presentation layer 26

1. A link brings up figures from the paper 0. Full text of PLoS papers stored in a database 2. Clicking the paper figure retrieves data from the PDB which is analyzed 3. A composite view of journal and database content results BioLit: Tools for New Modes of Scientific Dissemination Biolit integrates biological literature and biological databases and includes: –A database of journal text –Authoring tools to facilitate database storage of journal text –Tools to make static tables and figures interactive 4. The composite view has links to pertinent blocks of literature text and back to the PDB The Knowledge and Data Cycle 27

Nucleic Acids Research (S2) W PSP Washington DC Feb

29

30

31

ICTP Trieste, December 10,

33 What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge Immunology Literature Cardiac Disease Literature

Semantic Consistency is Best Done at the Point of Authoring What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

Author Paper Word File in Docx format Publisher BioLit Plugin Project 35

BioLit Plugin Project Leverages Office Open XML used in Microsoft Office 2007 Custom schema attached to document and used to automatically XML tag ontology terms and database identifiers within a research paper Ontology tagging assists publication of scientific research by aiding efficient and accurate automated categorization and promotion of information dissemination Conversion of manuscript to NLM DTD for direct submission to publisher Automated Ontology & ID Tagging within Microsoft Word Documents 36

BioLit Plugin Project Rather than Post-processing the Document the Author Controls the Semantic Tagging 37

Plugin Architecture 38

Ontologies are Stored in a Local Database 39

User Configurable Selection Fully user configuration ontology and database identifier selection All searches occur within the user’s desktop computer Desired ontologies are downloaded and installed automatically, and update periodically BioLit installer XML file provides the application with the information needed to download and install ontologies. 40

What If… What if … negative data was as easily obtainable as positive data What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge What if … that knowledge included rich media What if … the value of that knowledge could be weighted according to the authority of the source 41

What Do We Mean by Rich Media? Non traditional ways of conveying scientific data and knowledge.. Video, podcasts, postercasts, blogs… What if … that knowledge included rich media

YouTube for Scientists 43 What if … that knowledge included rich media

Motivation 44 What if … that knowledge included rich media

Pubcast – Video Integrated with the Full Text of the Paper 45

Pubcast With voice to text conversion, presentation materials etc. new knowledge is available to supplement already existing knowledge from the paper 46 What if … that knowledge included rich media

Postercasts What if … that knowledge included rich media Again additional knowledge can be used which until now has not been captured

What If… What if … negative data was as easily obtainable as positive data What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge What if … that knowledge included rich media What if … the value of that knowledge could be weighted according to the authority of the source 48

First You Have to Identify the Source What if … the value of that knowledge could be weighted according to the authority of the source

How Do we Weight the Various Knowledge Sources? Peer reviewed literature Reviews (papers, grants, proceedings) Blog postings Database entries What if … the value of that knowledge could be weighted according to the authority of the source

How Do we Weight the Various Knowledge Sources? A token system Tokens can be authenticated by any user of that content Page ranking ?? What if … the value of that knowledge could be weighted according to the authority of the source PLoS Comp Biol Editorial this month

In Conclusion Scholarly communication is in a state of rapid change Content easily available for machine learning is expanding and includes new content types New opportunities are here already

Acknowledgements SciVee Team –Apryl Bailey –Tim Beck –Leo Chalupa –Marc Friedman –Alex Ramos –Willy Suwanto BioLit Team J. Lynn Fink Sergey Kushch Marco Martinez Greg Quinn Parker Williams CT Watch 2007, 3(3)

Questions? 54