E-Science: Stuart Anderson National e-Science Centre Stuart Anderson National e-Science Centre.

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

Scientific Data Mining Principles and applications with astronomical data. Amos Storkey Institute for Adaptive and Neural Computation Division of Informatics.
A centre of expertise in data curation and preservation DCC/NeSC eScience Workshop, June 2008 Working in partnership with the eScience community This work.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service King’s College London.
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
EStar – Combining Telescopes and Databases Tim Naylor - University of Exeter Iain Steele – Liverpool John Moores University Dave Carter - Liverpool John.
Symposium on Digital Curation in the Era of Big Data: Career Opportunities and Educational Requirements Workforce Demand and Career Opportunities From.
Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng.
University of Southampton, U.K.
On the road toward an Archaeological Knowledge Management System David Gabai – CIO Israel Antiquities Authority.
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
UK e-Science and the White Rose Grid Paul Townend Distributed Systems and Services Group Informatics Research Institute University of Leeds.
Data mining and statistical learning: lecture 1a Statistics and computer science for a data-rich world.
Data preservation & the Virtual Observatory Bob Mann Wide-Field Astronomy Unit Royal Observatory Edinburgh
KDD for Science Data Analysis Issues and Examples.
BinX and Astronomy Bob Mann Institute for Astronomy and National e-Science Centre.
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
© HATII, University of Glasgow Introduction to the UK ’ s Digital Curation Centre Prof Seamus Ross Visiting Fellow at Oxford Internet Institute ,
Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz.
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Limitations of the relational model. Just as the relational model supplanted the network and hierarchical model so too will the object – orientated model.
A long tradition. e-science, Data Centres, and the Virtual Observatory why is e-science important ? what is the structure of the VO ? what then must we.
The Cosmic Simulator Daniel Kasen (UCB & LBNL) Peter Nugent, Rollin Thomas, Julian Borrill & Christina Siegerist.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Chapter 1 Introduction to Data Mining
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
NEON Obs School 11-Aug-2005 Archival Data and Virtual Observatories 1 Virtual Observatories...or how to do your research from a beach in the Bahamas rather.
E-science in the Netherlands Maria Heijne TU Delft Library Director / Chair Consortium of University Libraries and National Library.
ICSTI Annual Members’ Meeting & Workshop Dr. Stefan Winkler-Nees; Paris, 5. March 2012 The Alliance of German Science Organisations - Recommendations on.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.
1 10-June-2004Andy Lawrence : PPARC data curation panel meeting AstroGrid, Data Centres, & Edinburgh What is curation ? Data Centres in the VO era Data.
Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.
Research Networks and Astronomy Richard Schilizzi Joint Institute for VLBI in Europe
Usability Talk, 26 th January 2006 Development of Usable Grid Services for the Biomedical Community Prof Richard Sinnott Technical Director National e-Science.
Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Sky Survey Database Design National e-Science Centre Edinburgh 8 April 2003.
Data Archives: Migration and Maintenance Douglas J. Mink Telescope Data Center Smithsonian Astrophysical Observatory NSF
GEOSCIENCE NEEDS & CHALLENGES Dogan Seber San Diego Supercomputer Center University of California, San Diego, USA.
Edinburgh e-Science MSc Bob Mann Institute for Astronomy & NeSC University of Edinburgh.
Context: The Strategic Plan for Establishing the Network Integrated Biocollections Alliance Judith E. Skog, Office of the Assistant Director, Biological.
What is Astronomy? An overview..
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
A centre of expertise in digital information management Shaping the e-future? Grids, Web Services and Digital Libraries Professor Tony.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service Arts and Humanities e-Science Support Centre King’s.
E. Solano. GAIA Meeting, Menorca, Oct 2009 GAIA and the Virtual Observatory Enrique Solano, LAEX/CAB (INTA-CSIC) Spanish VO Principal Investigator.
Annotation of “special structures” in astronomy Bob Mann Institute for Astronomy and National e-Science Centre University of Edinburgh.
High throughput biology data management and data intensive computing drivers George Michaels.
The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.
Federal Land Manager Environmental Database (FED) Overview and Update June 6, 2011 Shawn McClure.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Virtual Laboratory Amsterdam L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam.
Biological Databases By: Komal Arora.
aspects of archive system design
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehousing and Data Mining
Digital Curation Centre research agenda
Big Data Architectures
What is a Grid? Grid - describes many different models
Presentation transcript:

e-Science: Stuart Anderson National e-Science Centre Stuart Anderson National e-Science Centre

Cool White Dwarves

Issues 1 Astronomers are looking for: –Many objects in globular clusters –Very faint objects –Interested in observations of many locations But: –The observations are noisy: Artifacts created by the sensor technology, scanning and digitizing. Junk in orbit, e.g. satellite tracks. Computer Science can help: -Pattern recognition, computational learning, data mining. -But: Astronomers are more picky. Astronomers are looking for: –Many objects in globular clusters –Very faint objects –Interested in observations of many locations But: –The observations are noisy: Artifacts created by the sensor technology, scanning and digitizing. Junk in orbit, e.g. satellite tracks. Computer Science can help: -Pattern recognition, computational learning, data mining. -But: Astronomers are more picky.

Cool Dwarves are faint and close The sky is full of faint objects. Cool White Dwarves are close. So they move about relative to the background stars. The illustrated observations cover a period of 30 years. We need to match up very faint objects observed by different equipment at different times.

Issues 2 Astronomers have a model of how luminous CWDs are that predicts how distant they are and hence how they move over time. We can use computational learning (aka data mining) to recognize CWDs provided we have a model that allows tractable learning. We can use the model to create training cases for various learning techniques. Astronomers also want to observe the same objects at different wavelengths. Models of objects can be used as a basis for data mining to link observations. Astronomers have a model of how luminous CWDs are that predicts how distant they are and hence how they move over time. We can use computational learning (aka data mining) to recognize CWDs provided we have a model that allows tractable learning. We can use the model to create training cases for various learning techniques. Astronomers also want to observe the same objects at different wavelengths. Models of objects can be used as a basis for data mining to link observations.

Problem Scale Cosmos (old technology), megabytes per plate. Super Cosmos (current technology), gigabytes per plate. Cosmos and Super Cosmos use 1m telescope images Vista (new technology): imaging in visible and x-ray using digital detectors, 4m telescope, terabytes per night. Sky surveys look at large-scale structure of space so many images are involved e.g. to estimate the density of CWDs in the galaxy. Cosmos (old technology), megabytes per plate. Super Cosmos (current technology), gigabytes per plate. Cosmos and Super Cosmos use 1m telescope images Vista (new technology): imaging in visible and x-ray using digital detectors, 4m telescope, terabytes per night. Sky surveys look at large-scale structure of space so many images are involved e.g. to estimate the density of CWDs in the galaxy.

E-Science and Old Science Computational models have been used for many years. e-Science systems will include vast collections of observed data. Scientific models are the essential organizing principle for data in such systems. Currently we are hand-crafting models that organise subsets of the data (e.g. CWDs). Can we create experimental environments that allow scientists to create new models of phenomena and test them against data? Computational models have been used for many years. e-Science systems will include vast collections of observed data. Scientific models are the essential organizing principle for data in such systems. Currently we are hand-crafting models that organise subsets of the data (e.g. CWDs). Can we create experimental environments that allow scientists to create new models of phenomena and test them against data?

Data, Information and Knowledge Much Grid work identifies a three-layer architecture for data. Data is the raw data acquired from sensors (e.g. telescopes, microscopes, particle detectors). Information is created when we “clean up” data to eliminate artifacts of the collection process. Knowledge is information embedded within an interpretive framework. Science provides strong interpretive frameworks Much Grid work identifies a three-layer architecture for data. Data is the raw data acquired from sensors (e.g. telescopes, microscopes, particle detectors). Information is created when we “clean up” data to eliminate artifacts of the collection process. Knowledge is information embedded within an interpretive framework. Science provides strong interpretive frameworks

Pattern: More science “in silico” Improved sensors, more sensors, huge increase in data volume. Need to “clean”, “mine” structure data. Support complex models and large-scale data collections inside the computer(s) Support for flexible model development and using models to organise and access data. E.g. in databases, spatial organisation, temporal organisation and support for queries exploiting that structure – useful for Geoscience? Improved sensors, more sensors, huge increase in data volume. Need to “clean”, “mine” structure data. Support complex models and large-scale data collections inside the computer(s) Support for flexible model development and using models to organise and access data. E.g. in databases, spatial organisation, temporal organisation and support for queries exploiting that structure – useful for Geoscience?

Credits Cosmos, Super Cosmos and Vista are projects looking at large scale structure of the cosmos, based at the Royal Observatory Edinburgh. Chris Williams, Bob Mann and Andy Lawrence are working on using computational learning to analyse super Cosmos data at RoE. Andy Lawrence is director of the AstroGrid project that is a major UK contribution to the international “Virtual Observatory” that will federate the worlds major astronomical data assets. Cosmos, Super Cosmos and Vista are projects looking at large scale structure of the cosmos, based at the Royal Observatory Edinburgh. Chris Williams, Bob Mann and Andy Lawrence are working on using computational learning to analyse super Cosmos data at RoE. Andy Lawrence is director of the AstroGrid project that is a major UK contribution to the international “Virtual Observatory” that will federate the worlds major astronomical data assets.

Whither Data Management? Scientific data is not particularly well behaved. In particular, it does not fit the relational model particularly well. We need new data models that are better suited to the needs of science (and everyone else too!). The model should attempt to support the work of scientists effectively. Current data models are not particularly useful. Scientific data is not particularly well behaved. In particular, it does not fit the relational model particularly well. We need new data models that are better suited to the needs of science (and everyone else too!). The model should attempt to support the work of scientists effectively. Current data models are not particularly useful.

Curated Databases Useful scientific databases are often curated : they are created/ maintained with a great deal of “manual” labour. select xyz from pqr where abc Database people’s idea of what happens What really happens DB1 DB2

Inter-dependence is Complex GERD TRRD GenBank Swissprot EpoDB TransFac GAIA BEAD A few of the 500 or so public curated molecular biology databases

Issues in Curated Databases Data integration (always a problem). Need to deal with schema evolution Data provenance. How do you track data back to its source (this information is typically lost) Data annotation. How should annotations spread through this network? Archiving. How do you keep all the archives when you are “publishing” a new database every day? Data integration (always a problem). Need to deal with schema evolution Data provenance. How do you track data back to its source (this information is typically lost) Data annotation. How should annotations spread through this network? Archiving. How do you keep all the archives when you are “publishing” a new database every day?

Archiving Some recent results on efficient archiving (Buneman, Khanna, Tajima, Tan) OMIM (On-line Mendelian Inheritance in Man) is a widely used genetic database. A new version is released daily. Bottom line, we can archive a year of versions of OMIM with <15% more space than the most recent version Some recent results on efficient archiving (Buneman, Khanna, Tajima, Tan) OMIM (On-line Mendelian Inheritance in Man) is a widely used genetic database. A new version is released daily. Bottom line, we can archive a year of versions of OMIM with <15% more space than the most recent version

A Sequence of Versions

“Pushing” time down [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]

The final result (for the randomly selected data) Predicted expansion for a year’s archive: < 15%

Summary: technical issues Why and where: –better characterization of where (new ideas needed) –negation/aggregation Keys: –inference rules for relative keys –foreign key constraints –interaction between keys and DTDs/types Types for deterministic model (and other models). Annotation Temporal QLs and archives Why and where: –better characterization of where (new ideas needed) –negation/aggregation Keys: –inference rules for relative keys –foreign key constraints –interaction between keys and DTDs/types Types for deterministic model (and other models). Annotation Temporal QLs and archives

Pattern: Better support for work Data is increasingly complex and interdependent. “Curating” the data is continuous, and involves international effort to increase the scientific value of the data. Understanding the way we work with data is the key to providing adequate support for that work. Deeper support for projects working across the globe. Data is increasingly complex and interdependent. “Curating” the data is continuous, and involves international effort to increase the scientific value of the data. Understanding the way we work with data is the key to providing adequate support for that work. Deeper support for projects working across the globe.

Credits These issues are being addressed by Peter Buneman at Edinburgh. Peter has recently joined Informatics and NeSC. He has worked for a number of years on Digital Libraries and Biological Data Management. These issues are being addressed by Peter Buneman at Edinburgh. Peter has recently joined Informatics and NeSC. He has worked for a number of years on Digital Libraries and Biological Data Management.