Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
Computer Technology Forecast Jim Gray Microsoft Research
1 Store Everything Online In A Database Jim Gray Microsoft Research
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.
Development of China-VO ZHAO Yongheng NAOC, Beijing Nov
Daniel Eisenstein – Univ. of Arizona Dark Energy and Cosmic Sound Bob Nichol on behalf of the SDSS Collaboration Copy of presentation to be given by Daniel.
The Next I.T. Tsunami Paul A. Strassmann. Copyright © 2005, Paul A. Strassmann - IP4IT - 11/15/05 2 Perspective Months  Weeks.
The Open Science Grid: Bringing the power of the Grid to scientific research
Sloan Digital Sky Survey Astronomy April 2006 Margaret Flynn.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
CS597A: Managing and Exploring Large Datasets Kai Li.
Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2011.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Aus-VO: Progress in the Australian Virtual Observatory Tara Murphy Australia Telescope National Facility.
The Cost of Storage about 1K$/TB 12/1/1999 9/1/2000 9/1/2001 4/1/2002.
Sky Surveys and the Virtual Observatory Alex Szalay The Johns Hopkins University.
Systems analysis and design, 6th edition Dennis, wixom, and roth
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Sloan Digital Sky Survey Experimental Astrophysics Group Fermilab.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
SDSS-KSG 08 Workshop1 The SDSS DR7 and KIAS SDSS mirror Won-Kee Park ARCSEC, Sejong University 2008 SDSS-KSG Workshop.
Section 1 # 1 CS The Age of Infinite Storage.
Alex Szalay Department of Physics and Astronomy The Johns Hopkins University and the SDSS Project The Sloan Digital Sky Survey.
Future Directions of the VO Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research.
Section 1 # 1 CS The Age of Infinite Storage.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
1 Store Everything Online In A Database Jim Gray Microsoft Research
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
CSCI 765 Big Data and Infinite Storage One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing.
© 2008 Quest Software, Inc. ALL RIGHTS RESERVED. Perfmon and Profiler 101.
Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach Alexander Szalay, JHU Jim Gray, Microsoft Reserach.
Where to find LiDAR: Online Data Resources.
MAST Users Group – June 29, 2007 MAST Team:  cmo Pat Brown  cmo Alberto Conti  Tony Rogers  Bernie Shiao  Myron Smith  Shui-Ay Tseng  *A. Volpicelli.
Research Networks and Astronomy Richard Schilizzi Joint Institute for VLBI in Europe
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
Astronomy, Petabytes, and MySQL MySQL Conference Santa Clara, CA April 16, 2008 Kian-Tat Lim Stanford Linear Accelerator Center.
EVLA Data Processing PDR Scale of processing needs Tim Cornwell, NRAO.
Sloan Digital Sky Survey Status Brian Yanny, reporting for the Experimental Astrophysics Group Fermilab Presentation to the Physics Advisory Committee.
Hardware Software InternetMiscellaneous
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.
Introduction to the VO ESAVO ESA/ESAC – Madrid, Spain.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Mbps over 5,626 km ~ 4e15 bit meters per second 4 Peta Bmps (“peta bumps”) Single Stream TCP/IP throughput Information Sciences Institute Microsoft.
The Worldwide LHC Computing Grid Frédéric Hemmer IT Department Head Visit of INTEL ISEF CERN Special Award Winners 2012 Thursday, 21 st June 2012.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Clint Kunz Data Platform Technology Specialist
1 Meta-Message: Technology Ratios Matter Price and Performance change. If everything changes in the same way, then nothing really changes. If some things.
How much information? Adapted from a presentation by:
Berkeley Cluster: Zoom Project
Computer Technology Forecast
CS The Age of Infinite Storage
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Jim Gray Microsoft Research
Google Sky.
Presentation transcript:

Building Peta-Byte Data Stores Jim Claus Shira Anniversary European Media Lab 12 February 2001

How Much Information Is there? Soon everything can be recorded and indexed Most data never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

ops/s/$ Had Three Growth Phases Now doubling every year Mechanical Relay 7-year doubling Tube, transistor, year doubling Microprocessor 1.0 year doubling

Gilder’s Law: 3x bandwidth/year for 25 more years Today: –10 Gbps per channel (per lambda) –4 channels per fiber: 40 Gbps –32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

Redmond/Seattle, WA San Francisco, CA New York Arlington, VA 5626 km 10 hops Information Sciences Institute MicrosoftQwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA

Storage capacity beating Moore’s law 3 k$/TB today (raw disk) 3 M$ /PB

Microsoft TerraServer: Build a multi-TB SQL Server database Data must be –1 TB –Unencumbered –Interesting to everyone everywhere –And not offensive to anyone anywhere Loaded –1.5 M place names from Encarta World Atlas –7 M Sq Km USGS doq (1 meter resolution) –10 M sq Km USGS topos (2m) –1 M Sq Km from Russian Space agency (2 m) On the web (world’s largest atlas) Sell images with commerce server.

TerraServer 4.0 Configuration SQL\Inst1 - Topo & Relief Data SQL\Inst2 – Aerial Imagery SQL\Inst3 – Aerial Imagery MetaData 101GB Image TB cooked 10 x 339 GB volumes Spread across 3 servers 2x4 to photo servers 1x2 for topo/relief server Compaq 8500 Passive Srvr Compaq Compaq 8500 SQL\Inst3 Compaq 8500 SQL\Inst2 Compaq 8500 SQL\Inst1 Controller Controller Controller Compaq E F G HI Controller Controller Controller Compaq L MN OP Controller Controller Controller Compaq S TU VU Compaq DL360 DL360 DL360 DL360 DL360 DL360 DL360 DL360 Web Servers 8 2-proc “Photon” DL360 3 Active Database Servers Logical Volume Structure One rack per database All volumes triple mirrored (3x) MetaData on 15k rpm 18.2 GB drives Image Data on 10k rpm 72.8 GB drives

TerraServer Activity Usage Summary July 1998 –Oct 2000 TotalsMonthlyDaily Users38,285,034 1,367,323 46,127 Page Views729,063,781 26,037, ,390 Image Tiles3,154,632, ,665,458 3,800,762 Db Queries3,791,078, ,395,662 4,567,564 Hits4,153,678, ,345,663 5,004,432

TerraServer.Microsoft.NET A Web Service Before.NET TerraServer SQL Db HtmlPage ImageTile Internet Web Browser TerraServerWebSite GetAreaByPointGetAreaByRectGetPlaceListByNameGetPlaceListByRectGetTileMetaByLonLatPtGetTileMetaByTileIdGetTileConvertLonLatToNearestPlaceConvertPlaceToLonLatPt... TerraServer SQL Db Internet ApplicationProgram TerraServerWebService With.NET

TerraServer Recent/Current Effort Added USGS Topographic maps (4 TB) High availability (4 node cluster with failover) Integrated with Encarta Online The other 25% of the US DOQs (photos) Adding digital elevation maps Open architecture: publish SOAP interfaces. Adding mult-layer maps (with UC Berkeley) Geo-Spatial extension to SQL Server

Astronomy is Changing (and so are other sciences) The World Virtual Observatory Doubles every 2 years. Astronomers have a few PB Data is public after 2 years. So: Everyone has ½ the data Some people have 5%more “private data” So, it’s a nearly level playing field: –Most accessible data is public. Cyberspace is the new telescope: –Multi-spectral, very deep,… Computer Science challenge: Organize these datasets Provide easy access to them.

Special 2.5m telescope Two surveys in one: Photometric survey in 5 bands. Spectroscopic redshift survey. Huge CCD Mosaic 30 CCDs 2K x 2K(imaging) 22 CCDs 2K x 400(astrometry) Two high resolution spectrographs 2 x 320 fibers, with 3 arcsec diameter. R=2000 resolution with 4096 pixels. Spectral coverage from 3900Å to 9200Å. Automated data reduction Over 70 man-years of development effort. (Fermilab + collaboration scientists) Very high data volume 40 TB of raw, 3TB cooked data (all public). The Sloan Digital Sky Survey The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study SLOAN Foundation, NSF, DOE, NASA The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study SLOAN Foundation, NSF, DOE, NASA Goal: Create a detailed multicolor map of the Northern Sky over 5 years

The Cosmic Genome Project The SDSS will create the ultimate map of the Universe, with much more detail than any other measurement before Gregory and Thompson 1978 deLapparent, Geller and Huchra 1986 daCosta etal 1995 SDSS Collaboration 2002

Area and Size of Redshift Surveys

Experiment with Relational DBMS See if SQL’s Good Indexing and Scanning Compensates for Poor Object Support. Leverage Fast/Big/Cheap Commodity Hardware. Ported 40 GB Sample Database (from SDSS Sample Scan) to SQL Server 2000 Building public web site and data server

20 Astronomy Queries Implemented spatial access extension to SQL (HTM) Implement 20 Astronomy Queries in SQL (see paper for details). 15M rows 378 cols, 30 GB. Can scan it in 8 minutes (disk IO limited). Many queries run in seconds Create Covering Indexes on queried columns. Create ‘Neighbors’ Table listing objects within 1 arc- minute (5 neighbors on the average) for spatial joins. Install some more disks!

Query to Find Gravitational Lenses Find all objects within 1 arc-minute of each other that have very similar colors (the color ratios u-g, g-r, r-i are less than 0.05m) 1 arc-minute

SQL Query to Find Gravitational Lenses Find nearby objects with similar color ratios. select count(*) from Objects L, Objects O, neighbors N where L.Obj_id = N.Obj_id and O.Obj_id = N.neighbor_Obj_id and L.Obj_id < O.Obj_id -- no dups and ABS((L.u-L.g)-(O.u-O.g))< similar color and ABS((L.g-L.r)-(O.g-O.r))<0.05 – ratios and ABS((L.r-L.i)-(O.r-O.i))<0.05 – (=dif of log) and ABS((L.z-L.r)-(O.z-O.r))<0.05 Finds 5223 objects, executes in 6 minutes.

SQL Results so far. Have run 17 of 20 Queries so far. Working on spectra load and queries now. Most Queries IO bound, ( 80MB/sec on 4 disks in 6 minutes) Covering indexes reduce execution to < 30 secs. Common to get Grid Distributions: select convert(int,ra*30)/30.0, as ra_bucket convert(int,dec*30)/30.0, as dec_bucket count(*) as bucket count from Galaxies where (u-g) > 1 and r < 21.5 group by ra_bucket, dec_bucket

Summary Technology: –1M$/PB: store everything online (twice!) –Gigabit to the desktop : store it anywhere So: You can store everything, Anywhere in the world Online everywhere Research driven by apps: –TerraServer –National Virtual Astronomy Observatory.