1 Store Everything Online In A Database Jim Gray Microsoft Research

Slides:

Advertisements

Similar presentations

Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.

Advertisements

Symantec 2010 Windows 7 Migration Global Results.

Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.

1 Nia Sutton Becta Total Cost of Ownership of ICT in schools.

1 Storage Bricks Jim Gray Microsoft Research FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

ARIZONA DEPARTMENT OF ADMINISTRATION INFORMATION SERVICES DIVISION - DATA CENTER.

Hadoop at ContextWeb February ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:

Trying to Use Databases for Science Jim Gray Microsoft Research

Computer Technology Forecast Jim Gray Microsoft Research

Clustering Technology For Scaleability Jim Gray Microsoft Research

U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.

1 The 5 Minute Rule Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10 12 today,

1 Designing for 20TB Disk Drives And enterprise storage Jim Gray, Microsoft research.

1 Store Everything Online In A Database Jim Gray Microsoft Research

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research

Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.

1 Mixing Public and private clouds a Practical Perspective Maarten Koopmans Nordunet Conference 2009 Maarten Koopmans Nordunet Conference 2009.

1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.

£1 Million £500,000 £250,000 £125,000 £64,000 £32,000 £16,000 £8,000 £4,000 £2,000 £1,000 £500 £300 £200 £100 Welcome.

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.

Database Systems: Design, Implementation, and Management

1 DDS Xpress Digital Data Storage Solution. 2 Long-term Goal Legacy Telecoms switches are still operational Expected lifespan at least another 10 years.

Cache Storage For the Next Billion Students: Anirudh Badam, Sunghwan Ihm Research Scientist: KyoungSoo Park Presenter: Vivek Pai Collaborator: Larry Peterson.

Storing Data: Disk Organization and I/O

Storing Data: Disks and Files

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Fast Crash Recovery in RAMCloud

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.

CS 440 Database Management Systems RDBMS Architecture and Data Storage 1.

Skyward Server Design Mike Bianco.

Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.

1 Disks Introduction ***-. 2 Disks: summary / overview / abstract The following gives an introduction to external memory for computers, focusing mainly.

Basic Principles of PACS Networking Emily Seto Medical Engineering/SIMS Center for Global eHealth Innovation April 29, 2004.

The IP Revolution. Page 2 The IP Revolution IP Revolution Why now? The 3 Pillars of the IP Revolution How IP changes everything.

Network, Local, and Portable Storage Media Computer Literacy for Education Majors.

Describing Storage Devices Store data when computer is off Two processes –Writing data –Reading data Storage terms –Media is the material storing data.

Presented to CUGG by Jamie Leben 10/9/10 IT-Works Computer Services

No Discipline is an Island No Discipline is an Island: Where Computing and Other Disciplines Meet Lillian (Boots) Cassel Villanova University.

Why should I consider Implementing a Document Imaging / Management System? Created by Harold Hegerhorst North American Technology. LLC © North American.

1 Migrating from Access to SQL Server Simon Kingston, CSU / NPS NRGIS.

1 The information industry and the information market Summary.

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

A Comparison of HTTP and HTTPS Performance Arthur Goldberg, Robert Buff, Andrew Schmitt [artg, buff, Computer Science Department Courant.

Database System Concepts and Architecture

ArrayExpress Query Interface Gonzalo Garc í a Lara January, / 24.

Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006.

Adding Up In Chunks.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.

Page 1 / 18 Internet Traffic Monitor IM Page 2 / 18 Outline Product Overview Product Features Product Application Web UI.

Addition 1’s to 20.

Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.

1. SQL Server 2014 In-Memory by Design Arthur Zubarev June 21, 2014.

Number bonds to 10,

CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU

RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.

CS597A: Managing and Exploring Large Datasets Kai Li.

The Cost of Storage about 1K$/TB 12/1/1999 9/1/2000 9/1/2001 4/1/2002.

The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.

Section 1 # 1 CS The Age of Infinite Storage.

Persistent Storage (disk?) Requirements (For The Low End ==the bottom 99%of the market ) Jim Gray Microsoft Research.

Section 1 # 1 CS The Age of Infinite Storage.

1 Store Everything Online In A Database Jim Gray Microsoft Research

1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure.

CSCI 765 Big Data and Infinite Storage One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing.

Computer Guts and Operating Systems CSCI 101 Week Two.

Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.

How much information? Adapted from a presentation by:

CS The Age of Infinite Storage

Jim Gray Microsoft Research

Presentation transcript:

1 Store Everything Online In A Database Jim Gray Microsoft Research

2 Outline Store Everything Online (Disk not Tape) In a Database

3 How Much is Everything? Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

4 Storage capacity beating Moores law 3 k$/TB today (raw disk) 1k$/TB by end of 2002

5 Outline Store Everything Online (Disk not Tape) In a Database

6 Online Data Can build 1PB of NAS disk for 5M$ today Can SCAN ( read or write ) entire PB in 3 hours. Operate it as a data pump: continuous sequential scan Can deliver 1PB for 1M$ over Internet –Access charge is 300$/Mbps bulk rate Need to Geoplex data (store it in two places). Need to filter/process data near the source, –To minimize network costs.

7 The Absurd Disk 2.5 hr scan time (poor sequential access) 1 access per second / 5 GB (VERY cold data) Its a tape! 1 TB 100 MB/s 200 Kaps

8 Disk vs Tape Disk –80 GB –35 MBps – 5 ms seek time – 3 ms rotate latency – 3$/GB for drive 2$/GB for ctlrs/cabinet –15 TB/rack –1 hour scan Tape –40 GB –10 MBps –10 sec pick time – second seek time –2$/GB for media 8$/GB for drive+library –10 TB/rack –1 week scan The price advantage of disk is growing the performance advantage of disk is huge! At 10K$/TB, disk is competitive with nearline tape. Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =12 drives

9 Building a Petabyte Disk Store Cadillac ~ 500k$/TB = 500M$/PB plus FC switches plus…800M$/PB TPC-C SANs (Brand PC 18GB/…) 60 M$/PB Brand PC local SCSI 20M$/PB Do it yourself ATA 5M$/PB

10 Cheap Storage and/or Balanced System Low cost storage (2 x 3k$ servers) 5K$ TB 2x ( 800 Mhz, 256Mb + 8x80GB disks + 100MbE) raid5 costs 6K$/TB Balanced server (5k$/.64 TB) –2x800Mhz (2k$) –512 MB –8 x 80 GB drives (2K$) –Gbps Ethernet + switch (300$/port) –9k$/TB 18K$/mirrored TB 2x800 Mhz 512 MB

11 Next step in the Evolution Disks become supercomputers –Controller will have 1bips, 1 GB ram, 1 GBps net –And a disk arm. Disks will run full-blown app/web/db/os stack Distributed computing Processors migrate to transducers.

12 Its Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, –use other copy until failure repaired, –refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)

13 Outline Store Everything Online (Disk not Tape) In a Database

14 Why Not file = object + GREP ? It works if you have thousands of objects (and you know them all) But hard to search millions/billions/trillions with GREP Hard to put all attributes in file name. –Minimal metadata Hard to do chunking right. Hard to pivot on space/time/version/attributes.

15 The Reality: its build vs buy If you use a file system you will eventually build a database system : –metadata, –Query, –parallel ops, – security,…. –reorganize, –recovery, –distributed, –replication,

16 OK: so Ill put lots of objects in a file Do It Yourself Database Good news: –Your implementation will be 10x faster than the general purpose one easier to understand and use than the general purpose on. Bad news: –It will cost 10x more to build and maintain –Someday you will get bored maintaining/evolving it –It will lack some killer features: Parallel search Self-describing via metadata SQL, XML, … Replication Online update – reorganization Chunking is problematic (what granularity, how to aggregate)

17 Top 10 reasons to put Everything in a DB 1.Someone else writes the million lines of code 2.Captures data and Metadata, 3.Standard interfaces give tools and quick learning 4.Allows Schema Evolution without breaking old apps 5.Index and Pivot on multiple attributes space-time-attribute-version…. 6.Parallel terabyte searches in seconds or minutes 7.Moves processing & search close to the disk arm (moves fewer bytes (qestons return datons). 8.Chunking is easier (can aggregate chunks at server). 9.Automatic geo-replication 10.Online update and reorganization. 11.Security 12.If you pick the right vendor, ten years from now, there will be software that can read the data.

18 DB Centric Examples TerraServer –All images and all data in the database (chunked as small tiles). – SkyServer & Virtual Sky –Both image and semantic data in a relational store. –Parallel search & NonProcedural access are important. – – – 45s&T=4&P=12&S=10&X=5096&Y=4121&W=4&Z=- 1&tile.2.1.x=55&tile.2.1.y=20http://virtualsky.org/servlet/Page?F=3&RA=16h+10m+1.0s&DE=%2B0d+42m+ 45s&T=4&P=12&S=10&X=5096&Y=4121&W=4&Z=- 1&tile.2.1.x=55&tile.2.1.y=20

19 Outline Store Everything Online (Disk not Tape) In a Database