Tony Hey Corporate Vice President Corporate Vice President Technical Computing Microsoft Corporation Microsoft Corporation Computer and Information Sciences.

Slides:



Advertisements
Similar presentations
IATUL Porto, May 21, 2006 DOI and e-Science Dr Anne E Trefethen Oxford e-Research Centre
Advertisements

Opening the Research Data Lifecycle Workshop Capturing and Sharing Research Data Simon Coles School of Chemistry, University of Southampton, U.K.
S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge b. a School of Chemistry, University of Southampton, UK.; b School of Electronics.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
RCUK, Octiber Archiving research data and research publications. Dr Leslie Carr, Intelligence, Agents Multimedia, University of Southampton Dr Simon.
Digital Repositories: interoperability & common services Closing Remarks Dr Liz Lyon, UKOLN, University of Bath, UK
E-Science and Cyberinfrastructure Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Electronic publishing: issues and future trends Anne Bell.
The Central Role of Data ‘Capturing and Sharing Chemistry Research Data’ Simon Coles School of Chemistry, University of Southampton, U.K.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Client+Cloud The Future of Research Dr. Daniel A. Reed Corporate Vice President Extreme Computing Group & Technology Strategy and Policy.
David De Roure Manchester Edition. John Taylor There are a number of grid applications being developed and there is a whole raft of computer technologies.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
INFSO-RI Enabling Grids for E-sciencE Grid & Data Preservation Boon Low System Development, EGEE Training National.
Workforce Demand and Career Opportunities in University and Research Libraries NAS Symposium on Digital Curation Anne R. Kenney July 19, 2012.
E-Science and Global Grids in the Information Society: The Role of an EU e-IRG? Tony Hey Director of the UK e-Science Core Programme
University of Southampton, U.K.
You Cannot ReSIST Hugh Glaser Electronics & Computer Science University of Southampton DSSE, 28th February 2007.
EPrints Workshop, January eBank UK: Dissemination of research data using EPrints Simon Coles, School of Chemistry, University of Southampton.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
Digital Library Architecture and Technology
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
Tony Hey Corporate Vice President Corporate Vice President Technical Computing Microsoft Corporation Microsoft Corporation Computer and Information Sciences.
E-Science and Cyberinfrastructure Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Social Science Data and ETDs: Issues and Challenges Joan Cheverie Georgetown University Myron Gutmann ICPSR – University of Michigan Austin McLean ProQuest.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
Tony Hey Corporate Vice President Corporate Vice President Technical Computing Microsoft Corporation Microsoft Corporation Computer and Information Sciences.
CI Days: Planning Your Campus Cyberinfrastructure Strategy Russ Hobby, Internet2 Internet2 Member Meeting 9 October 2007.
Integrated e-Infrastructure for Scientific Facilities Kerstin Kleese van Dam STFC- e-Science Centre Daresbury Laboratory
University of Bergen Library Electronic publishing Bergen – Makerere visit February 2005.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
11 Building a Usable Infrastructure for e-Science: An Information Perspective Christine L. Borgman Professor & Presidential Chair in Information Studies.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
P. Schirmbacher Humboldt-Universität zu Berlin The Changing Process of Scholarly Publishing or the Necessity of a New Culture of Electronic.
The DART Project: building the new collaborative e- research infrastructure Presentation to 2006 AusWeb Conference.
BMC Open Access Colloquium, 8 February Morgan: "Open Access Repositories"
Lewis Shepherd GOVERNMENT AND THE REVOLUTION IN SCIENTIFIC COMPUTING.
Perspectives on Cyberinfrastructure Daniel E. Atkins Professor, University of Michigan School of Information & Dept. of EECS October 2002.
NanoHUB.org and HUBzero™ Platform for Reproducible Computational Experiments Michael McLennan Director and Chief Architect, Hub Technology Group and George.
10/07/2008 Semantic Web Technologies & Higher Education.
SEEK Welcome Malcolm Atkinson Director 12 th May 2004.
The Swiss Grid Initiative Context and Initiation Work by CSCS Peter Kunszt, CSCS.
Infrastructures for Social Simulation Rob Procter National e-Infrastructure for Social Simulation ISGC 2010 Social Simulation Tutorial.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
Interoperability from the e-Science Perspective Yannis Ioannidis Univ. Of Athens and ATHENA Research Center
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
Indiana University School of Informatics The LEAD Gateway Dennis Gannon, Beth Plale, Suresh Marru, Marcus Christie School of Informatics Indiana University.
UKOLN is supported by: Introduction to UKOLN Dr Liz Lyon, Director UKOLN, University of Bath, UK Grand Challenge Meeting, June a centre.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.
The Collaborative Semantic Grid David De Roure University of Southampton, UK
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.
CombeDay Making Data Openly Available Simon Coles.
Toward a common data and command representation for quantum chemistry Malcolm Atkinson Director 5 th April 2004.
Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.
David De Roure Workflows in Support of Large-Scale Science Provenance, a.
UKOLN is supported by: Library futures in the new research landscape. Dr Liz Lyon, UKOLN, University of Bath, UK CURL Members Meeting October 2004, London.
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
Brian Nosek University of Virginia -- Center for Open Science -- Improving Openness.
E-Science and Cyberinfrastructure Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Clouds , Grids and Clusters
e-Science and Cyberinfrastructure
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Presentation transcript:

Tony Hey Corporate Vice President Corporate Vice President Technical Computing Microsoft Corporation Microsoft Corporation Computer and Information Sciences Life Sciences Multidisciplinary Research Earth Sciences e-Science and its Implications for the Library Community Social Sciences New Materials, Technologies and Processes

Licklider’s Vision “Lick had this concept – all of the stuff linked together throughout the world, that you can use a remote computer, get data from a remote computer, or use lots of computers in your job” “Lick had this concept – all of the stuff linked together throughout the world, that you can use a remote computer, get data from a remote computer, or use lots of computers in your job” Larry Roberts – Principal Architect of the ARPANET

Physics and the Web  Tim Berners-Lee developed the Web at CERN as a tool for exchanging information between the partners in physics collaborations  The first Web Site in the USA was a link to the SLAC library catalogue  It was the international particle physics community who first embraced the Web  ‘Killer’ application for the Internet  Transformed modern world – academia, business and leisure

Beyond the Web?  Scientists developing collaboration technologies that go far beyond the capabilities of the Web  To use remote computing resources  To integrate, federate and analyse information from many disparate, distributed, data resources  To access and control remote experimental equipment  Capability to access, move, manipulate and mine data is the central requirement of these new collaborative science applications  Data held in file or database repositories  Data generated by accelerator or telescopes  Data gathered from mobile sensor networks

What is e-Science? ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it’ ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it’ John Taylor Director General of Research Councils Director General of Research Councils UK, Office of Science and Technology UK, Office of Science and Technology

The e-Science Vision  e-Science is about multidisciplinary science and the technologies to support such distributed, collaborative scientific research  Many areas of science are in danger of being overwhelmed by a ‘data deluge’ from new high- throughput devices, sensor networks, satellite surveys …  Areas such as bioinformatics, genomics, drug design, engineering, healthcare … require collaboration between different domain experts  ‘e-Science’ is a shorthand for a set of technologies to support collaborative networked science

e-Science – Vision and Reality Vision  Oceanographic sensors - Project Neptune  Joint US-Canadian proposal Reality  Chemistry – The Comb-e-Chem Project  Annotation, Remote Facilities and e-Publishing

Undersea Sensor Network Connected & Controllable Over the Internet

Data Provenance

Visual Programming Persistent Distributed Storage

Distributed Computation Interoperability & Legacy Support via Web Services

Live Documents Searching & Visualization Reputation & Influence

Reproducible Research

Collaboration

Handwriting

Dynamic Documents Interactive Data

The Comb-e-Chem Project National X-Ray Service Data Mining and Analysis Automatic Annotation Combinatorial Chemistry Wet Lab HPC Simulation Video Data Stream Diffractometer Middleware Structures Database

National Crystallographic Service Send sample material to NCS service Search materials database and predict properties using Grid computations Download full data on materials of interest Collaborate in e-Lab experiment and obtain structure

A digital lab book replacement that chemists were able to use, and liked

Monitoring laboratory experiments using a broker delivered over GPRS on a PDA

Crystallographic e-Prints Direct Access to Raw Data from scientific papers Raw data sets can be very large - stored at UK National Datastore using SRB software

Grid E-Scientists Entire E-Science Cycle Encompassing experimentation, analysis, publication, research, learning 5 Institutional Archive Local Web Publisher Holdings Digital Library E-Scientists Graduate Students Undergraduate Students Virtual Learning Environment E-Experimentation E-Scientists Technical Reports Reprints Peer- Reviewed Journal & Conference Papers Preprints & Metadata Certified Experimental Results & Analyses Data, Metadata & Ontologies eBank Project

Support for e-Science  Cyberinfrastructure and e-Infrastructure  In the US, Europe and Asia there is a common vision for the ‘cyberinfrastructure’ required to support the e-Science revolution  Set of Middleware Services supported on top of high bandwidth academic research networks  Similar to vision of the Grid as a set of services that allows scientists – and industry – to routinely set up ‘Virtual Organizations’ for their research – or business  Many companies emphasize computing cycle aspect of Grids  The ‘Microsoft Grid’ vision is more about data management than about compute clusters

Six Key Elements for a Global Cyberinfrastructure for e-Science 1. High bandwidth Research Networks 2. Internationally agreed AAA Infrastructure 3. Development Centers for Open Standard Grid Middleware 4. Technologies and standards for Data Provenance, Curation and Preservation 5. Open access to Data and Publications via Interoperable Repositories 6. Discovery Services and Collaborative Tools

The Web Services ‘Magic Bullet’ Company A (J2EE) Open Source (OMII) Company C (.Net) Web Services

Computational Modeling Real-world Data Interpretation & Insight Persistent Distributed Data Workflow, Data Mining & Algorithms

Technical Computing in Microsoft  Radical Computing  Research in potential breakthrough technologies  Advanced Computing for Science and Engineering  Application of new algorithms, tools and technologies to scientific and engineering problems  High Performance Computing  Application of high performance clusters and database technologies to industrial applications

New Science Paradigms  Thousand years ago: Experimental Science - description of natural phenomena - description of natural phenomena  Last few hundred years: Theoretical Science - Newton’s Laws, Maxwell’s Equations … - Newton’s Laws, Maxwell’s Equations …  Last few decades: Computational Science - simulation of complex phenomena - simulation of complex phenomena  Today: e-Science or Data-centric Science - unify theory, experiment, and simulation - unify theory, experiment, and simulation - using data exploration and data mining - using data exploration and data mining  Data captured by instruments  Data generated by simulations  Processed by software  Scientist analyzes databases/files (With thanks to Jim Gray)

CONTENT Scholarly Communication, Institutional Repositories DATA Acquisition, Storage, Annotation, Provenance, Curation, Preservation TOOLS Workflow, Collaboration, Visualization, Data Mining Advanced Computing for Science and Engineering...

Top 500 Supercomputer Trends Industry usage rising Clusters over 50% x86 is winning GigE is gaining

Key Issues for e-Science  Workflows  The LEAD Project  The Data Chain  From Acquisition to Preservation  Scholarly Communication  Open Access to Data and Publications

The LEAD Project The LEAD Project Better predictions for Mesoscale weather

Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction/Detection PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision DYNAMIC OBSERVATIONS Models and Algorithms Driving Sensors The CS challenge: Build a virtual “eScience” laboratory to support experimentation and education leading to this vision.

Composing LEAD Services Need to construct workflows that are: Need to construct workflows that are:  Data Driven  The weather input stream defines the nature of the computation  Persistent and Agile  An agent mines a data stream and notices an “interesting” feature. This event may trigger a workflow scenario that has been waiting for months  Adaptive  The weather changes  Workflow may have to change on-the-fly  Resources

Example LEAD Workflow

The e-Science Data Chain  Data Acquisition  Data Ingest  Metadata  Annotation  Provenance  Data Storage  Curation  Preservation

The Data Deluge  In the next 5 years e-Science projects will produce more scientific data than has been collected in the whole of human history  Some normalizations:  The Bible = 5 Megabytes  Annual refereed papers = 1 Terabyte  Library of Congress = 20 Terabytes  Internet Archive (1996 – 2002) = 100 Terabytes  In many fields new high throughput devices, sensors and surveys will be producing Petabytes of scientific data

The Problem for the e-Scientist  Data ingest  Managing a petabyte  Common schema  How to organize it?  How to reorganize it?  How to coexist & cooperate with others?  Data Query and Visualization tools  Support/training  Performance  Execute queries in a minute  Batch (big) query scheduling Experiments & Instruments Simulations facts answers questions ? Literature Other Archives facts

Digital Curation?  In 20 years can guarantee that the operating system and spreadsheet program and the hardware used to store data will not exist  Need research ‘curation’ technologies such as workflow, provenance and preservation  Need to liaise closely with individual research communities, data archives and libraries  The UK has set up the ‘Digital Curation Centre’ in Edinburgh with Glasgow, UKOLN and CCLRC  Attempt to bring together skills of scientists, computer scientists and librarians

Digital Curation Centre  Actions needed to maintain and utilise digital data and research results over entire life-cycle  For current and future generations of users  Digital Preservation  Long-run technological/legal accessibility and usability  Data curation in science  Maintenance of body of trusted data to represent current state of knowledge  Research in tools and technologies  Integration, annotation, provenance, metadata, security…..

Berlin Declaration 2003  ‘To promote the Internet as a functional instrument for a global scientific knowledge base and for human reflection’  Defines open access contributions as including:  ‘original scientific research results, raw data and metadata, source materials, digital representations of pictorial and graphical materials and scholarly multimedia material’

NSF ‘Atkins’ Report on Cyberinfrastructure  ‘the primary access to the latest findings in a growing number of fields is through the Web, then through classic preprints and conferences, and lastly through refereed archival papers’  ‘archives containing hundreds or thousands of terabytes of data will be affordable and necessary for archiving scientific and engineering information’

MIT DSpace Vision ‘Much of the material produced by faculty, such as datasets, experimental results and rich media data as well as more conventional document-based material (e.g. articles and reports) is housed on an individual’s hard drive or department Web server. Such material is often lost forever as faculty and departments change over time.’ ‘Much of the material produced by faculty, such as datasets, experimental results and rich media data as well as more conventional document-based material (e.g. articles and reports) is housed on an individual’s hard drive or department Web server. Such material is often lost forever as faculty and departments change over time.’

Publishing Data & Analysis Is Changing Roles Authors Publishers Curators Archives Consumers Traditional Scientists Journals Libraries Archives Scientists Emerging Collaborations Project web site Data+Doc Archives Digital Archives Scientists

Data Publishing: The Background In some areas – notably biology – databases are replacing (paper) publications as a medium of communication In some areas – notably biology – databases are replacing (paper) publications as a medium of communication  These databases are built and maintained with a great deal of human effort  They often do not contain source experimental data - sometimes just annotation/metadata  They borrow extensively from, and refer to, other databases  You are now judged by your databases as well as your (paper) publications  Upwards of 1000 (public databases) in genetics

Data Publishing: The issues  Data integration  Tying together data from various sources  Annotation  Adding comments/observations to existing data  Becoming a new form of communication  Provenance  ‘Where did this data come from?’  Exporting/publishing in agreed formats  To other programs as well as people  Security  Specifying/enforcing read/write access to parts of your data

Interoperable Repositories?  Paul Ginsparg’s arXiv at Cornell has demonstrated new model of scientific publishing  Electronic version of ‘preprints’ hosted on the Web  David Lipman of the NIH National Library of Medicine has developed PubMedCentral as repository for NIH funded research papers  Microsoft funded development of ‘portable PMC’ now being deployed in UK and other countries  Stevan Harnad’s ‘self-archiving’ EPrints project in Southampton provides a basis for OAI-compliant ‘Institutional Repositories’  Many national initiatives around the world moving towards mandating deposition of ‘full text’ of publicly funded research papers in repositories

Microsoft Strategy for e-Science Microsoft intends to work with the scientific and library communities: Microsoft intends to work with the scientific and library communities:  to define open standard and/or interoperable high-level services, work flows and tools  to assist the community in developing open scholarly communication and interoperable repositories

Acknowledgements With special thanks to Kelvin Droegemeier, Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim Gray, Yike Guo, Liz Lyon and Beth Plale With special thanks to Kelvin Droegemeier, Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim Gray, Yike Guo, Liz Lyon and Beth Plale