Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com.

Similar presentations


Presentation on theme: "Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com."— Presentation transcript:

1 Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com John.rumble@randrdata.com

2 To understand scientific and technical data today, we must first understand how the information revolution has changed both Science and Data and their relationship 2 DC Data Science May 2012

3 My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 3 DC Data Science May 2012

4 Why Do We Do Science Two primary motivations for advancing science First is our insatiable thirst to understand the world– probably from when we started thinking Second is a direct result of the Industrial Revolution: How does the technology we are inventing actually work? 4 DC Data Science May 2012

5 21st Century Science From the fundamental to the complex – Determining the laws of nature for a few particles to understanding real systems - cells, the atmosphere, the Earth, ecology From reductionism to constructionism – Using our basic knowledge to make models and predict behavior of real systems – that is all systems we find in nature or that we can construct 5 DC Data Science May 2012

6 6 Science Vol. 336, p. 707 (2012)

7 7 DC Data Science May 2012 J. Schmitt et al, Science vol 336, p 708, 2012

8 Today’s Science and E-Science The Data Revolution has enabled E-Science through – Advanced telecommunications and networks – Computation power and storage – New algorithms for data management, visualization, analysis, and mathematics Today, E-Science can be done faster and more powerfully, and scientific communication can occur almost instantly The real revolution, however, is in the relationship between science and data 8 DC Data Science May 2012

9 9 When it hits the New York Times, you know it is for real!

10 Science and Data To understand scientific and technical data today, we must first understand how the information revolution has changed Science and Data and their relationship Science today is not about reduction to a few basic laws Science is about how do we understand and control all aspects of nature How is this done? – By careful measurement, accurate tests, keen observations, and powerful models and simulations that lead to scientific knowledge The results are expressed as scientific data! 10 DC Data Science May 2012

11 Scientific Knowledge What does this really mean? 11 DC Data Science May 2012

12 Scientific Knowledge What does this really mean? 12 DC Data Science May 2012 Recognize a new phenomenon Analyze its components Identify the variables that govern it Isolate the important variables Demonstrate understanding by control Change the phenomenon Scientific knowledge means understanding the independent variables governing a phenomenon and how they influence it

13 Science Today A major theme of science today is that we are able to make accurate measurements on a complex world that – Advance our understanding of nature, – Improve our ability to harness technology, And, in spite of many challenges, – Increase the importance of science to society in the future Scientific data are at the core of modern science 13 DC Data Science May 2012

14 My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 14 DC Data Science May 2012

15 The Data Revolution in Science 15 DC Data Science May 2012 Today, E-Science is real Computer at every desk Connectivity: The Internet/WWW explosion Computerized experiments and observations Database tools on every computer Electronic publications Model and simulation-based R&D Comprehensive databases Virtual libraries

16 Four Ways to Generate Scientific Data Observations Experiments Standardized testing Modeling and simulation 16 DC Data Science May 2012

17 Observational Science Today Today we have exciting new capability to observe nature in situ better than ever before – Hubble Space Telescope – High sensitivity seismographs – Bio-macromolecule sequencing instruments – LTER (Long-term ecological research) platforms – Earth-observing satellites – High power computers to analyze data Generates huge amounts of quality data 17 DC Data Science May 2012

18 Experimental Science Today 18 DC Data Science May 2012 Today we have exciting new capability to observe nature in controlled circumstances better than ever before – Atomic force microscopes – Micro-electronics and lasers – High energy accelerators – Femto-second chemical reactors – High power computers to analyze data Generates large amounts of high quality data

19 Testing Today Today we have new capability to test and analyze materials using standard methods – Electronic test equipment – Analytical databases fully integrated into equipment – Analyzing unknown substances – Carbon and other techniques dating objects – Genomic sequencing – National and international standard test procedures – Data analysis tools to generate properties – Self-calibrating instruments Generates medium amounts of high quality data 19 DC Data Science May 2012

20 Computation Today We now also have the ability to create a Virtual World  Models and simulations of complex systems  Techniques to do advanced mathematics  Computers to execute immense calculations  Visualization tools to examine our virtual world Uses and generates large amounts of data 20 DC Data Science May 2012

21 Characteristics of Approaches for Generating Scientific Data 21 DC Data Science May 2012

22 The Data Revolution in Science is Real Observation, experimentation, testing, and calculation all produce, and in some cases use, large amounts of data E-Science has provided an incredible array of tools, technologies, and methods to collect, store, manage, analyze, exploit, preserve, and disseminate these data Science today is more fully based on data and data collections than ever before! 22 DC Data Science May 2012

23 My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 23 DC Data Science May 2012

24 Scientific Data and Scientific Databases Data communicate measurement (experimental and observational) and computational results “When you can measure what you are speaking about, and express it in numbers, you know something about it; Lord Kelvin 24 DC Data Science May 2012

25 Types of Scientific Data Numbers Simple text Complex text Equations Graphs Diagrams Pictures Software Rules 25 DC Data Science May 2012 1, 2, 3… ABCs Greek, scripts, symbol E=mc 2

26 All Data Are Not the Same Measurement or property: There is a difference! Measurements are a one-time look at nature Properties are the inherent characteristics of nature – They are Nature Itself 26 DC Data Science May 2012

27 Measurements are for Today 27 DC Data Science May 2012 Measurements are what you see now Capture one point of view Usually limited number of variables changed  One of 1300 measurements of Diego Giacometti

28 Properties are Forever Properties are the real thing Need many repeated measurements Far too many substances and systems to determine properties Will never properties of everything 28 DC Data Science May 2012 The real Diego Giacometti

29 Scientific Knowledge Theories Models Hypotheses Questions Data Measure- ment The Classical Paradigm for Science and Data 29

30 Scientific Knowledge Theories Models Hypotheses Questions Data Measure- ment Data Collections The True data paradigm has always been this 30

31 Scientific Databases in History Preserved data collections (large and small) At first, simply data preservation Data was stored, but not really exploited 1.Accuracy 2.Comprehensiveness 3.Systematizing 31 DC Data Science May 2012

32 Accuracy Newgrange – Ireland 6000 years old Aligned to the rising sun in the winter solstice Depended on careful observational data on the rising sun One data point! 32 DC Data Science May 2012

33 Volume and Accuracy Improving Stonehenge 5000 years old Over 100 stones Complicated stone alignments Marks position of the moon and major stars as well as the sun Storage of several observations 33 DC Data Science May 2012

34 Comprehensive Data Sets 34 DC Data Science May 2012 Galen Greek physician Experimental physiologist Arabic copy from 800 AD Pictorial, descriptive, function describing Representative of botanical and animal catalogs

35 Systematizing a Comprehensive Collection 35 DC Data Science May 2012 Pliny the Elder Roman scholar Natural History (77 AD) One of earliest known encyclopedias of the natural world Systemization of data

36 My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 36 DC Data Science May 2012

37 Data and Scientific Discovery The advent of the Baconian Revolution –anchoring scientific understanding to physical observation Led to databases becoming the foundation of scientific discovery True Beginnings of Data Science! 37 DC Data Science May 2012

38 Scientific Databases in History Preserved data collections (large and small) form the foundation of scientific discovery Trends in data preservation and discovery 1.Accuracy 2.Comprehensiveness 3.Systematizing 4.Extraction of essence 5.Explanation of the complex 6.Prediction of new phenomena! 7.Physical theory from data! 38 DC Data Science May 2012

39 Extraction of Essence Tycho Brahae Late 16 th Century Danish Astronomer Made precise measurements that led to Kepler’s theories Led to discovery of simple relationships 39 DC Data Science May 2012

40 Explanation of the Complex 40 DC Data Science May 2012 Charles Darwin Combined with others in geology, zoology and botany A wide variety of facts and phenomena recorded Theory of Evolution had to explain many diverse observations and measurements from different disciplines

41 Prediction of New Phenomena 41 DC Data Science May 2012 Mendeleev and the Chemical Periodic Table Predicting properties of unknown elements from properties (data) of known elements

42 Physical Theory from Data Notes on the Spectral Lines of Hydrogen: Johann Jacob Balmer Annalen der Physik und Chemie 25 80-5 (1885) – “I gradually arrived at a formula which, at least for these four lines, expresses a law by which their wavelengths can be represented by striking precision…From the formula, we obtained for a fifth hydrogen line 3936.65x10-7 mm. “ The development of quantum mechanics Bohr 42 DC Data Science May 2012 Schrödinger

43 Brief History of Modern S&T Databases 1950sCrystal structures (software generated data -1960s Neutron data (modeling weapons) 1970s Analytical chemistry (identify chemicals) Thermochemistry (properties linked) Environmental and toxicology Large physics experiments Space science 1980s Astronomy Materials Earth sciences Biology Genomics 43 DC Data Science May 2012

44 Scientific Databases Today Preserving” Data is Easy Database management tools are inexpensive and powerful Many models for good interfaces exist Collecting data (data deposition) can be routine Expertise is easily available from many sources Building databases today is remarkably easy 44 DC Data Science May 2012

45 Comprehensive Data Collections for 21 st Century Science International Virtual Observatory Structural Genomics Proteomics Climate change Historic geologic Chemistry on demand Biodiversity Brain scans 45 DC Data Science May 2012 All observation for every point in the sky For all living things! 30,000 or 300,000? Water, earth, atmosphere and all they contain Many millennia, the entire planet 60 elements, 5 at a time, many ratios, 10 9 – 10 10 compounds 5M species? or 10M? or 50M? Every person, every thought forever Very large databases will be found in every scientific discipline

46 The Face of 21st Century Science Complex Multi-disciplinary Real systems Virtual as well as physical Access to quality data becomes critical Attention to the problems and challenges of long term preservation of and access to data becomes more important than ever! 46 DC Data Science May 2012

47 Scientific Discovery and Data Collections The Paradigm has Changed Yesterday Collections managed by a small number of people Collections readable by one scientist Collections interpretable by one person Discoveries made by thinking, with analysis by one person 47 DC Data Science May 2012 Today èCollections managed by groups èCollections not readable by any individual èCollections interpretable only with aid of software The Future èDiscoveries aided or made by computers, with verification by people?

48 The Proposition Scientific databases in the future will be even more important source for scientific discovery Data collections are critical for – New insights – New scientific principles – New knowledge – Understanding complex systems Let’s look at 3 problems and the challenges they present 48 DC Data Science May 2012

49 My Talk 1.Science today 2.The Data Revolution in science 3.Scientific data and scientific databases 4.Data and scientific discovery 5.The challenges of using data science on scientific data 49 DC Data Science May 2012

50 Three Problems in the Data Era 1.Too much data 2.Complex systems 3.Complex science 50 DC Data Science May 2012 Scientific Knowledge Theory Models Hypotheses Questions Data Measurement Data Collections Science and Data How the information revolution has changed their relationship

51 Scientific Data in the Future Problem 1: Too much data The Challenges How do you look at large volumes of data? What does data quality mean for large data collections? How do you determine which data are important? 51 DC Data Science May 2012

52 Challenge: Too Much Data Too much data for any one person to read or understand Can use Visualization Data reduction Anomalies and outliers How does anyone read a terabyte of data? Software must be used to “read” data Can we allow software to determine what are important data? 52 DC Data Science May 2012

53 Challenge: Too Much Data Do we have the technology to handle the overwhelming volume of data from new measurement techniques? What to capture when we generate too much data too fast? How to store, represent, manipulate and display too voluminous data? How to find out which data are important? 53 DC Data Science May 2012

54 Challenge: Too Much Data and Data Quality Evaluating data quality How can large amounts of data be evaluated? In real time? As new data are published? How can large data sets be integrated together correctly? What does quality mean in a terabyte of data? For Each data point? Each set of points? Sub-collections? An entire collection? 54 DC Data Science May 2012

55 Challenge: Too Much Data and Data Quality Evaluating data quality Bad data quality leads to bad science and bad decisions based on science One measurement does not make a property Agreement between theory and experiment does not mean both are correct In today’s world with terabytes of data, what does quality mean? 55 DC Data Science May 2012

56 Challenge: Modeling and Data Quality Data Quality Making accurate virtual measurements on virtual systems What is the quality in a calculation? How do you establish uncertainty for a calculation? Which computational results should be stored, and how can those data be handled? How do you discover something new in a mass of computational results? 56 DC Data Science May 2012 Some models have mechanisms for assessing quality HΨ = EΨ Schrödinger equation The variational principle applies only to energy Quality of other properties calculated from the equation is unknown

57 Challenge: Modeling and Data Quality Documenting Quality of a calculation Making accurate virtual measurements on virtual systems How do you establish uncertainty for each step of a calculational result? Science Vol. 336 pp. 159-160 (2012): Software created by public funding must be released, just as with data themselves 57 DC Data Science May 2012 1.Model assumptions (which ind. Var. used) 2.Translation into algorithms 3.Coding 4.Input 5.Finite arithmetic 6.Post-processing analysis

58 Challenge: With Too Much Data, What is Important? When you have a lot of data, what can you do with it? Abstraction of important features How can we find what is important when we have too much data? Or not enough of the correct data? Truly great science is having the insight of what is important Can we teach software how to do that? 58 DC Data Science May 2012 80 trillion cells in body Number of human proteins is estimated to be 30,000 – 70,000 Which are important and why? At least 150 proteins repair DNA damage

59 Scientific Data in the Future Problem 2: Real systems are very complex The Challenges Large number of objects Large number of independent variables Changing scientific language Data Integration 59 DC Data Science May 2012

60 Complexity Challenge: Many Objects There are too many objects to count, observe, measure, or calculate Number of stars Number of species Number of chemicals Number of individuals Number of rocks Number of cells Number of thoughts Number of ecosystems You get the point 60 DC Data Science May 2012

61 Complexity Challenge: Independent Variables How do we use metadata (report the relevant independent variables) to describe what we preserve? Independent variables are a quantitative mechanism for expressing our knowledge about how and why a phenomenon occurs Capturing complete knowledge of independent variables requires a large or (perhaps) even an impossible amount of data One goal of research is to understand which variables are important and why Our knowledge clearly evolves over time 61 DC Data Science May 2012 P=(n/V)RT (Ideal gas law) Dependent variable P=pressure Independent variables n/V = number/volume=density T = temperature

62 Complexity Challenge: Independent Variables Major challenge in data collections is to capture evolution of knowledge of independent variables Must be done in a way as to preserve data set compatibility Let’s work through a quick examples of the complexity and how knowledge changes with time 62 DC Data Science May 2012 Most data have numerous independent variables they are functions of

63 Brain Imaging Recording techniques evolve and improve over time – X-ray, CT, MRI, PET, next? Each technology individually evolves, as do the types of signals collected, their association with brain activity and region Monitoring reactions to stimulus: pain, visual, auditory, tactile, etc. Details must be defined and recorded 63 DC Data Science May 2012

64 Brain History If we imagine the details necessary to describe this, the number of independent variables expand rapidly – Stimuli history – Physiological history – Developmental history – Environmental exposures – Drugs taken – More As with the development of unifying theories of the gross physical world – motion, evolution, chemistry, genetics - the details are necessary to find the dominant factors What are the most important independent variables for recording brain history? Still an open question! 64 DC Data Science May 2012

65 Complexity Challenge: Independent Variables Modern science requires data from many disciplines If we must aggregate different data sets (e.g., over the Web) to do discovery, how do we know data are comparable? How do we integrate data sets with varying numbers of independent variables? Especially if their names and meaning change over time? 65 DC Data Science May 2012

66 Complexity Challenge: Evolving Language These are powerful change factors that cannot be ignored in preserving data How do languages evolve? Contractions of words Reordering sentences Borrowing words Dropping and adding startings and endings Differentiation of concepts Evolution of concepts John McWhorter – The Power of Babel Ontologies can help 66 DC Data Science May 2012

67 Complexity Challenge: Evolving Language Time and Scientific Language Grammar rules appeared only a few hundred years ago Language change factors ignore authority Usage wins over regulations every time! – Are terminology standards actually used? Are efforts such as that on the right doomed to fail? 67 DC Data Science May 2012

68 Complexity Challenge: Evolving Language Time and the evolution of Scientific Language New knowledge requires new language Data preservation efforts must recognize evolution of scientific language Not just independent variables and metadata – the scientific language itself So if you are going to do “discovery,” you’d better know what you are working with 68 DC Data Science May 2012

69 Complexity Challenge: Data Integration Developing standards for scientific data and metadata 69 DC Data Science May 2012 What is the business case for such standards? How can you standardize scientific language if it continues to evolve? How can you determine object equivalency and uniqueness with partial data sets? How do you persuade scientists to back off the state-of-the-art to agree on standards?

70 Complexity Challenge: Data Integration Data Standards: Making exploitation of large data sets possible 70 DC Data Science May 2012 What standards are needed for making data sets work together? How can you trust integrated data sets? No science is an island by itself Science today is multi- discipline, international, multi- lingual, ever-changing Integration can be achieved by standards and clear reporting of measurements As knowledge of variables increases, integrating old and new data becomes more difficult

71 Complexity Challenge: Data Integration Data Ownership We must differentiate between discovery and adding value Observing nature should not lead to data ownership Transforming observations through value-added intellectual effort can create IPR For scientific data, must be very careful not to restrict use by others The same observations led to many different theories of planetary motion – from Aristotle and Ptolemy to Kepler to Newton to today 71 DC Data Science May 2012

72 Complexity Challenge: Data Integration Maintaining full and open access to the large number of databases required for making new scientific discoveries 72 DC Data Science May 2012 What policies are needed for full and open access? Open access aims to provide everyone with the information and data to advance science Open is not necessarily free Long term preservation does cost money – Data and literature collections must be supported How can discoverers profit from their automated discoveries? How do you get the information industry to understand the new paradigm for discovery?

73 Complexity Challenge: Data Integration Data Costs Nothing is ever really free! It costs significant money to generate, capture, manage, store, analyze, use, disseminate, and preserve scientific data Data costs must be integrated into the cost of generation Policy and practice will vary from discipline to discipline, but nothing is ever free 73 DC Data Science May 2012

74 Complexity Challenge: Data Integration Data Repositories Started with crystal structure Genomics Other disciplines following NSF now requiring data management plans Often required to publish papers Curation (everything reported correctly) now automated Model does not translate easily for evolving fields 74 DC Data Science May 2012

75 Complexity Challenge: Data Integration Progression of Data Collection Individual Collegial Institution or discipline repository Evaluated data “Property values” Each step requires more metadata to provide adequate documentation Very difficult to add metadata after the fact For new phenomenon, difficult to know what Ind. Var. are necessary 75 DC Data Science May 2012

76 Scientific Discovery in Preserved Data Problem 3: Real systems are very complex and complex behavior in systems is difficult to find The Challenges How do we recognize real understanding? What is knowledge discovery in the future? 76 DC Data Science May 2012

77 Challenge: Real Understanding Real systems are very complex How can you identify the existence of a unifying theory or concept? Could we have derived quantum mechanics from a complete database of atomic and molecular spectra? What features does quantum mechanics have beyond these data? 77 DC Data Science May 2012

78 Challenge: Real Understanding Real systems are very complex Multiple views of the same phenomena exist The Simple (?) Laws of Interaction String theory Quantum theory Matrix mechanics Maxwell’s theory Quantum electrodynamics Newton’s laws of motion Are all views of nature equally discoverable? By computer-aided discovery? 78 DC Data Science May 2012

79 Challenge: Real Understanding How do we develop real understanding? Real Scientific knowledge? Just because we measure a phenomenon do not mean we understand it – Do we know how many genes there? – Does measuring the mass of the universe makes us understand dark matter? How does data lead to understanding? 79 DC Data Science May 2012

80 Challenge: Real Understanding Knowledge Discovery Large amounts of data can help find new discoveries How to know which data are the most important, the key to discovery Hoe to know something is there to be discovered? Can too much data make discovery more difficult? Will/Can discovery have to be automated? 80 DC Data Science May 2012

81 Data Collections in 21 st Century Science The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them William Bragg 81 DC Data Science May 2012

82 Some Final Thoughts Scientific databases in the future will be even more important source for scientific discovery Preservation of data needed for – New insights – Scientific principles – New knowledge – Understanding complex systems The problems and challenges I have just outlined are not insurmountable – just problems and challenges 82 DC Data Science May 2012

83 Some Final Thoughts Science has changed and with that change, our expectations for science have changed. We now expect science to be a force for shaping the future, not just understanding nature Scientific databases in the future will be even more important source for scientific discovery The Data Revolution has become an enabling force to meet our expectations for 21 st Century Science 83 DC Data Science May 2012


Download ppt "Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com."

Similar presentations


Ads by Google