The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Challenges in Using Lifetime Personal Information Stores based on MyLifeBits Gordon Bell, Jim Gemmell, Roger Lueder SIGIR University of Sheffield, July.
Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Challenges in Using Lifetime Personal Information Stores based on MyLifeBits Gordon Bell Alpbach Forum 26 August 2004.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Experience Building The World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay.
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
The Data Avalanche Jim Gray Microsoft Research Talk at National Youth Leadership Forum on Technology,
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
Astronomy Data Bases Jim Gray Microsoft Research.
“ Everything that can be invented has been invented.” Commissioner, U.S. Office of Patents, 1899 From
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
High Performance Computing Course Notes Grid Computing.
TP Lite is back Jim Gray. Classic: What’s Outside? Three Tier Computing Clients do presentation, gather input Do some workflow (script) Send high-level.
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
eScience -- A Transformed Scientific Method"
How much information? Adapted from a presentation by: Jim Gray Microsoft Research Alex Szalay Johns Hopkins University.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
“Here comes the Grid” Mark Hayes Technical Director - Cambridge eScience Centre NIEeS Summer School 2003.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
1 1 Slide Introduction to Data Mining and Business Intelligence.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
Section 1 # 1 CS The Age of Infinite Storage.
Surviving The Information Avalanche Jim Gray Microsoft Research Adobe Developers Conference 26 April 2004
Section 1 # 1 CS The Age of Infinite Storage.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Building BIG Data Servers on the Web Jim Gray Microsoft Research Talk at Flash Mob Supercomputer.
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NVO Review -- San Diego Jan The VO compared to Other O‘s Jim Gray Microsoft T HE US N ATIONAL V IRTUAL O BSERVATORY.
Some Grid Science California Institute of Technology Roy Williams Paul Messina Grids and Virtual Observatory Grids and and LIGO.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Sky Survey Database Design National e-Science Centre Edinburgh 8 April 2003.
Real Web Services Jim Gray Microsoft Research 455 Market St, SF, CA, Talk at Charles Schwab.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Microsoft “information at your fingertips” for scientists Collaborating with Scientists to build better ways to organize, analyze, and understand.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
WP18, High-speed data recording Krzysztof Wrona, European XFEL
How much information? Adapted from a presentation by:
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
CS The Age of Infinite Storage
Jim Gray Alex Szalay SLAC Data Management Workshop
BARC Scaleable Servers
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Jim Gray Researcher Microsoft Research
Jim Gray Microsoft Research
Jim Gray Microsoft Research
Presentation transcript:

The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004

How much information is there? Almost everything is recorded digitally. Most bytes are never seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All books (words) All Books MultiMedia Everything ! Recorded A Photo

Memex As We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

MyLifeBits The guinea pig Gordon Bell is digitizing his life Has now scanned virtually all: –Books written (and read when possible) –Personal documents (correspondence, memos, , bills, legal,0…) –Photos –Posters, paintings, photo of things (artifacts, …medals, plaques) –Home movies and videos –CD collection –And, of course, all PC files Recording: phone, radio, TV, web pages… conversations Paperless throughout ” scanned, 12’ discarded. Only 30GB Excluding videos Video is 2+ TB and growing fast

25Kday life ~ Personal Petabyte 1PB Will anyone look at web pages in 2020? Probably new modalities & media will dominate then.

Challenges Capture: Get the bits in Organize: Index them Manage: No worries about loss or space Curate/ Annotate: automate where possible Privacy: Keep safe from theft. Summarize: Give thumbnail summaries Interface: how ask/anticipate questions Present: show it in understandable ways.

80% of data is personal / individual. But, what about the other 20%? Business –Wall Mart online: 1PB and growing…. –Paradox: most “transaction” systems < 1 PB. –Have to go to image/data monitoring for big data Government –Government is the biggest business. Science –LOTS of data.

CERN Tier 0 Instruments: CERN – LHC Peta Bytes per Year Looking for the Higgs Particle Sensors: 1000 GB/s (1TB/s ~ 30 EB/y) Events 75 GB/s Filtered 5 GB/s Reduced 0.1 GB/s ~ 2 PB/y Data pyramid: 100GB : 1TB : 100TB : 1PB : 10PB

Information Avalanche Both –better observational instruments and –Better simulations are producing a data avalanche Examples –Turbulence: 100 TB simulation then mine the Information –BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information –CERN: LHC will generate 1GB/s 10 PB/y –VLBA (NRAO) generates 1GB/s today –NCBI: “only ½ TB” but doubling each year, very rich dataset. –Pixar: 100 TB/Movie Image courtesy of C. Meneveau & A. JHU

One Challenge: Move Data from CERN to Remote 1GBps Disk-to-Disk Disk-to-Disk gigabyte / second data rates gigabyte / second data rates 80TB/day 80TB/day 30 petabytes by petabytes by exabyte by exabyte by 2014 ~5 GBps CERN Filter Tier 2 Tier 3 Tier 1 … INP3RALINFNFNAL Tier 2 Institute Tier 2 Institute Tier 4 Experiment ~1 GBps ~PBps.1 GBps Physics data cache ~1 GBps Workstations OC192 = 9.9 Gbps Graphics courtesy of Harvey Caltech

Current Status: CERN → Pasadena Multi Stream tpc/ip 7.1 Gbps ~900 MBps –New speed Single Stream tpc/ip 6.5 Gbps ~800 MBps File Transfer Speed ~450 MBps mbps per second 0 1,000 2,000 3,000 4,000 5,000 6,000 7,

The Evolution of Science Observational Science –Scientist gathers data by direct observation –Scientist analyzes data Analytical Science –Scientist builds analytical model –Makes predictions. Computational Science –Simulate analytical model –Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator –Processed by software –Placed in a database / files –Scientist analyzes database / files

e-Science Data captured by instruments Or data generated by simulatorData captured by instruments Or data generated by simulator Processed by softwareProcessed by software Placed in a files or databasePlaced in a files or database Scientist analyzes files / databaseScientist analyzes files / database Virtual laboratoriesVirtual laboratories –Networks connecting e-Scientists –Strong support from funding agencies Better use of resourcesBetter use of resources –Primitive today

The Big Picture Experiments & Instruments Simulations facts answers questions Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance –Execute queries in a minute –Batch query scheduling ? The Big Problems Literature Other Archives facts

FTP - GREP Download (FTP and GREP) are not adequate –You can GREP 1 MB in a second –You can GREP 1 GB in a minute –You can GREP 1 TB in 2 days –You can GREP 1 PB in 3 years. Oh!, and 1PB ~3,000 disks At some point we need indices to limit search parallel data search and analysis This is where databases can help Next generation technique: Data Exploration –Bring the analysis to the data!

Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 As data and computers grow at same rate, we can only keep up with N logN A way out? –Relax notion of optimal (data is fuzzy, answers are approximate) –Don’t assume infinite computational resources or memory Combination of statistics & computer science

Analysis and Databases Much statistical analysis deals with –Creating uniform samples – –data filtering –Assembling relevant subsets –Estimating completeness –censoring bad data –Counting and building histograms –Generating Monte-Carlo subsets –Likelihood calculations –Hypothesis testing Traditionally these are performed on files Most of these tasks are much better done inside a database Move Mohamed to the mountain, not the mountain to Mohamed.

Virtual Observatory Premise: Most data is (or could be online) So, the Internet is the world’s best telescope: –It has data on every part of the sky –In every measured spectral band: optical, x-ray, radio.. –As deep as the best instruments (2 years ago). –It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). –It’s a smart telescope: links objects and data to literature on them.

Why Astronomy Data? It has no commercial value –No privacy concerns –Can freely share results with others –Great for experimenting with algorithms It is real and well documented –High-dimensional data (with confidence intervals) –Spatial data –Temporal data Many different instruments from many different places and many different times Federation is a goal The questions are interesting –How did the universe form? There is a lot of it (petabytes) IRAS 100  ROSAT ~keV DSS Optical 2MASS 2  IRAS 25  NVSS 20cm WENSS 92cm GB 6cm

Time and Spectral Dimensions The Multiwavelength Crab Nebulae X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Slide courtesy of Robert CalTech. Crab star 1053 AD

Estimating Cosmological Constant CPU Time vs Memory CPU time is 5000xNXlog 2 N in memory For large data sets, split into M disk chunks => time goes as M 2 Have 80M objects now, time is 10 days with 32GB – 4x1GHz CPU Need to run this many times with different DB cuts more objects soon! year decade 1 week 1 day 1 month

SkyServer.SDSS.org A modern archive –Raw Pixel data lives in file servers –Catalog data (derived objects) lives in Database –Online query to any and all Also used for education –150 hours of online Astronomy –Implicitly teaches data analysis Interesting things –Spatial data search –Client query interface via Java Applet –Query interface via Emacs –Popular -- 1% of Terraserver –Cloned by other surveys (a template design) –Web services are core of it.

Demo of SkyServer Shows standard web server Pixel/image data Point and click Explore one object Explore sets of objects (data mining)

Federation Data Federations of Web Services Massive datasets live near their owners: –Near the instrument’s software pipeline –Near the applications –Near data knowledge and curation –Super Computer centers become Super Data Centers Each Archive publishes a web service –Schema: documents the data –Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives –A common global schema

Federation: SkyQuery.NetSkyQuery.Net Combine 4 archives initially Just added 10 more Send query to portal, portal joins data from archives. Problem: want to do multi-step data analysis (not just single query). Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “batch schedule” on portal server, Deposits answer in personal database.

2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery Structure Each SkyNode publishes –Schema Web Service –Database Web Service Portal is –Plans Query (2 phase) –Integrates answers –Is itself a web service

SkyQuery: Distributed Query tool using a set of web services Four astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England). Feasibility study, built in 6 weeks –Tanu Malik (JHU CS grad student) –Tamas Budavari (JHU astro postdoc) –With help from Szalay, Thakar, Gray Implemented in C# and.NET Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

2MASS INT SDSS FIRST SkyQuery Portal Image Cutout MyDB added to SkyQuery Let users add personal DB 1GB for now. Use it as a workbook. Online and batch queries. Moves analysis to the data Users can cooperate (share MyDB) Still exploring this MyDB

The Big Picture Experiments & Instruments Simulations facts answers questions Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance –Execute queries in a minute –Batch query scheduling ? The Big Problems Literature Other Archives facts