Microsoft “information at your fingertips” for scientists Collaborating with Scientists to build better ways to organize, analyze, and understand.

Slides:



Advertisements
Similar presentations
Closing the Gap Between Global Environmental Sensing Needs and Cyber Infrastructure Tools Jim Gray Jeff Burch Mark Ellisman Miron Livny David Maidment.
Advertisements

Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Astronomy Data Bases Jim Gray Microsoft Research.
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Virtual Observatory & Grid Technique ZHAO Yongheng (National Astronomical Observatories of China) CANS2002.
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
The aims of SC4DEVO and SC4DEVO-1 Bob Mann Institute for Astronomy and National e-Science Centre, University of Edinburgh.
Online Data Archives for Use In the Classroom Robert Sparks (FNAL/The Prairie School) GLAST/SEU Educator Ambassador Workshop.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Requirements from astronomy in the Virtual Observatory era Bob Mann Institute for Astronomy & NeSC University of Edinburgh.
Aus-VO: Progress in the Australian Virtual Observatory Tara Murphy Australia Telescope National Facility.
Long-Term Preservation of Astronomical Research Results Robert Hanisch US National Virtual Observatory Space Telescope Science Institute Baltimore, MD.
eScience -- A Transformed Scientific Method"
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
Science with the Virtual Observatory Brian R. Kent NRAO.
1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
Fourth Paradigm Science-based on Data-intensive Computing.
The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004.
Lewis Shepherd GOVERNMENT AND THE REVOLUTION IN SCIENTIFIC COMPUTING.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Building BIG Data Servers on the Web Jim Gray Microsoft Research Talk at Flash Mob Supercomputer.
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
NVO Review -- San Diego Jan The VO compared to Other O‘s Jim Gray Microsoft T HE US N ATIONAL V IRTUAL O BSERVATORY.
Sky Survey Database Design National e-Science Centre Edinburgh 8 April 2003.
Real Web Services Jim Gray Microsoft Research 455 Market St, SF, CA, Talk at Charles Schwab.
German Astrophysical Virtual Observatory Overview and Results So Far W. Voges, G. Lemson, H.-M. Adorf.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Entering the Data Era; Digital Curation of Data-intensive Science…… and the role Publishers can play The STM view on publishing datasets Bloomsbury Conference.
Statistics in WR: Lecture 1 Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
HELIO: Discovery and Analysis of Data in Heliophysics Robert Bentley, John Brooke, André Csillaghy, Donal Fellows, Anja Le Blanc, Mauro Messerotti, David.
E-Science and Cyberinfrastructure Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Enhancements to Galaxy for delivering on NIH Commons
How much information? Adapted from a presentation by:
R For The SQL Developer Kevin Feasel Manager, Predictive Analytics
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Jim Gray Alex Szalay SLAC Data Management Workshop
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Jim Gray Researcher Microsoft Research
Jim Gray Microsoft Research
Google Sky.
Presentation transcript:

Microsoft “information at your fingertips” for scientists Collaborating with Scientists to build better ways to organize, analyze, and understand information. Jim Gray Microsoft Research 2 March 2006

Collaborating with Scientists to build better ways to organize, analyze, and understand information. Digital Library & Peer Review TerraServer SkyServer AIDS – Intelligent vaccine design Others: Eco/Hydro/Oceanograpic/

Science Paradigms Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) unify theory, experiment, and simulation using data management and statistics –Data captured by instruments Or generated by simulator –Processed by software –Scientist analyzes database / files

Experiment Budgets ¼…½ Software Software for Instrument scheduling Instrument control Data gathering Data reduction Database Analysis Visualization Millions of lines of code Repeated for experiment after experiment Not much sharing or learning Let’s work to change this Identify generic tools Workflow schedulers Databases and libraries Analysis packages Visualizers …

Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: dark matter, dark energy, turbulence, ecosystem dynamics Needles are easier than haystacks Global statistics have poor scaling –Correlation: N 2, likelihood: N 3 –N Log(N) is most we can consider A way out? –Relax optimal notion (data is fuzzy, answers are approximate) –Limited compute & memory Requires combination of statistics & computer science

Data Access Hitting a Wall Current science practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~5,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (~1$) … 2 days and 1K$ … 3 years and 1M$

What X-info Needs from us (cs) (not drawn to scale) Science Data & Questions Scientists Database To store data Execute Queries Systems Data Mining Algorithms Miners Question & Answer Visualization Tools

Collaborating with Scientists to build better ways to organize, analyze, and understand information. Digital Library & Peer Review TerraServer SkyServer AIDS – Intelligent vaccine design Others: Eco/Hydro/Oceanograpic/

Literature Is Coming Online Agencies and Foundations mandating research be public domain. –NIH (30 B$/y, 40k PIs,…) (see –Welcome Trust –Japan, China, Italy, South Africa,.… –Public Library of Science.. Other agencies will follow NIH Publishers will resist (not surprising) Professional societies will resist (amazing!)

How Does the New Library Work? Who pays for storage access? (unfunded mandate). –Its cheap: 1 milli-dollar per access But… curation is not cheap: –Author/Title/Subject/Citation/….. –Dublin Core is great but… –NLM has a 6,000-line XSD for documents –Need to capture document structure from author Sections, figures, equations, citations,… Automate curation –NCBI-PubMedCentral is doing this Preparing for 1M articles/year MUST be automatic.

Portable PubMedCentral “Information at your fingertips” Helping build PortablePubMedCentral Deployed US, China, England, Italy, South Africa, (Japan soon). Each site can accept documents Archives replicated Federate thru web services Working to integrate Word/Excel/… with PubmedCentral – e.g. WordML, XSD, To be clear: NCBI is doing 99% of the work.

Conference Management Tool Currently support a conference peer-review system (~300 conferences) –Form committee –Accept Manuscripts –Declare interest/recuse –Review –Decide –Form program –Notify –Revise

eJournal Management Tool Add publishing steps –Form committee –Accept Manuscripts –Declare interest/recuse –Review –Decide –Form program –Notify –Revise –Publish Connect to Archives Manage archive document versions Capture Workshop presentations proceedings Capture classroom ConferenceXP Moderated discussions of published articles

Why Not a Wicki? Peer-Review is –It is very structured –It is moderated –There is a degree of confidentiality Wicki is egalitarian –It’s a conversation –It’s completely transparent Don’t get me wrong: –Wicki’s are great –SharePoints are great –But.. Peer-Review is different. –And, incidentally: review of proposals, projects,… is more like peer-review.

Collaborating with Scientists to build better ways to organize, analyze, and understand information. Digital Library & Peer Review TerraServer SkyServer AIDS – Intelligent vaccine design Others: Eco/Hydro/Oceanograpic/

KVM / IP TerraServer – Got us into GeoSpatial Photo of US Was big data (Terabytes) Is high traffic (Mega hits/day) Pioneered tiled access to database Scalable servers 1996…2006 Now LOTS of continuing research in spatialhttp://local.live.com/

World Wide Telescope Virtual Observatory Premise: Most data is (or could be online) So, the Internet is the world’s best telescope: –It has data on every part of the sky –In every measured spectral band: optical, x-ray, radio.. –As deep as the best instruments (2 years ago). –It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). –It’s a smart telescope: links objects and data to literature on them.

Why Astronomy Data? It has no commercial value –No privacy concerns –Can freely share results with others –Great for experimenting with algorithms It is real and well documented –High-dimensional data (with confidence intervals) –Spatial data –Temporal data Many different instruments from many different places and many different times Federation is a goal There is a lot of it (petabytes) IRAS 100  ROSAT ~keV DSS Optical 2MASS 2  IRAS 25  NVSS 20cm WENSS 92cm GB 6cm

Time and Spectral Dimensions The Multiwavelength Crab Nebulae X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Slide courtesy of Robert CalTech. Crab star 1053 AD

SkyServer.SDSS.org A modern archive –Access to Sloan Digital Sky Survey Spectroscopic and Optical surveys –Raw Pixel data lives in file servers –Catalog data (derived objects) lives in Database –Online query to any and all Also used for education –150 hours of online Astronomy –Implicitly teaches data analysis Interesting things –Spatial data search –Client query interface via Java Applet –Query from Emacs, Python, …. –Cloned by other surveys (a template design) –Web services are core of it.

Demo of SkyServer Shows standard web serverShows standard web server Pixel/image dataPixel/image data Point and clickPoint and click Explore one objectExplore one object Explore sets of objects (data mining)Explore sets of objects (data mining)

SkyQuery ( Distributed Query tool using a set of web services Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England) Has grown from 4 to 15 archives, now becoming international standard WebService Poster Child Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

SkyServer/SkyQuery Evolution MyDB and Batch Jobs Problem: multi-step data analysis Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “Batch schedule” on portal. Deposits answer in personal database.

Rational Vaccine Design for AIDS HIV mutates rapidly so immense genetic diversity. Vaccine must be able to deal with this diversity. –recognize many DNA sequences (epitopes) on virus surface Rational design: finding cocktails with best epitope coverage –Use sample HIV strains from multiple patients –Compactly encode as many epitopes (or likely epitopes) as possible –Collaboration with Jim Mullins at University of Washington and Simon Mallal, Royal Perth Hospital, Australia David Heckerman

Toy Vaccine Example Observed sequences RGGKLD --- ERFAVN --- RGEV --- KGEKLD --- DRFALN --- KEDL --- KGEKLD --- ERFAVN --- KEEV --- RGGKLD --- DRFALN --- RGDL --- SGGKLD --- ERFAVN --- SGEV --- KGEKLD --- ERFAVN --- KEEV --- RGGKLD --- ERFAVN --- RGEV --- SGGKLD --- ERFAVN --- SGEV --- KGEKLD --- ERFAVN --- KEEV Type of model Number of errors Two useful patterns cover all cases --- RGGKLD --- ERFAVN --- Only in SGG… patterns - 2 errors --- KGEKLD --- DRFALN ---

Other Collaborations Cornell FEA MSR TR , JHU Sensors Berkeley Hydrology U. Washington Oceanography Agro Remote Sensing Others are developing.

Collaborating with Scientists to build better ways to organize, analyze, and understand information. Digital Library & Peer Review TerraServer SkyServer AIDS – Intelligent vaccine design Others: Eco/Hydro/Oceanograpic/