How much information? Adapted from a presentation by:

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
1 Store Everything Online In A Database Jim Gray Microsoft Research
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Astronomy Data Bases Jim Gray Microsoft Research.
©2011 Quest Software, Inc. All rights reserved.. Database Management Martin Rapetti Business Development Manager.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
CS597A: Managing and Exploring Large Datasets Kai Li.
Chapter 14 The Second Component: The Database.
Unit 3—Part A Computer Memory
Big Data, Future of Computing, Parting Thoughts Slobodan Vucetic Associate Professor Department of Computer and Information Sciences Temple University.
Big Data A big step towards innovation, competition and productivity.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
How much information? Adapted from a presentation by: Jim Gray Microsoft Research Alex Szalay Johns Hopkins University.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Chapter 1 Introduction to Data Mining
1 1 Slide Introduction to Data Mining and Business Intelligence.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
Section 1 # 1 CS The Age of Infinite Storage.
The Magic of the Cloud: Supercomputers for Everyone, Everywhere Prof. Eric A. Brewer UC Berkeley.
Surviving The Information Avalanche Jim Gray Microsoft Research Adobe Developers Conference 26 April 2004
Section 1 # 1 CS The Age of Infinite Storage.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
1 Store Everything Online In A Database Jim Gray Microsoft Research
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004.
CSCI 765 Big Data and Infinite Storage One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing.
Unit 2—Part A Computer Memory Computer Technology (S1 Obj 2-3)
“Big Data” and Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington July.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Features Of SQL Server 2000: 1. Internet Integration: SQL Server 2000 works with other products to form a stable and secure data store for internet and.
DBMS RESEARCH Are We On The Right Tracks? Jim Gray Microsoft Research
CyVerse Data Store Managing Your ‘Big’ Data. Welcome to the Data Store Manage and share your data across all CyVerse platforms.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
There is an inherent meaning in everything. “Signs for people who can see.”
Data Analytics (CS40003) Introduction to Data Lecture #1
How Has This Course Changed Your Perception of Digital Media
Popular Database Management Systems
What is Information? What will we retrieve with information retrieval?
MIS2502: Data Analytics Advanced Analytics - Introduction
Lecture 16: Data Storage Wednesday, November 6, 2006.
Computer Memory Digital Literacy.
Unit 2 Computer Memory Computer Technology (S1 Obj 2-3)
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Unit 3—Part A Computer Memory
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Big Data The huge amount of data being collected and stored about individuals, items, and activities and to the process of drawing useful information from.
CS The Age of Infinite Storage
What is Information? What will we retrieve with information retrieval?
6 October 2016 Irmingard Eder Data Scientist, Munich Re
Unit 3—Part A Computer Memory
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Unit# 5: Internet and Worldwide Web
Jim Gray Microsoft Research
Web Mining Department of Computer Science and Engg.
Jim Gray Microsoft Research
Data Mining: Concepts and Techniques
Big DATA.
Presentation transcript:

How much information? Adapted from a presentation by: Jim Gray Microsoft Research http://research.microsoft.com/~gray Alex Szalay Johns Hopkins University http://tarkus.pha.jhu.edu/~szalay/

How much information is there in the world What can we store. What is stored. Why are we interested.

Infinite Storage? The Terror Bytes are Here Petrified by Peta Bytes? 1 TB costs 1k$ to buy 1 TB costs 300k$/y to own Management & curation are expensive Searching 1TB takes minutes or hours Petrified by Peta Bytes? But… people can “afford” them so, – Even though they can never actually be seen in your lifetime Automate the process Yotta Zetta Exa Peta Tera Giga Mega Kilo We are here

How much information is there? Yotta Zetta Exa Peta Tera Giga Mega Kilo Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ Everything! Recorded All Books MultiMedia All books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

First Disk 1956 IBM 305 RAMAC 4 MB 50x24” disks 1200 rpm 100 ms access 35k$/y rent Included computer & accounting software (tubes not transistors)

Storage capacity beating Moore’s law Improvements: Capacity 60%/y Bandwidth 40%/y Access time 16%/y 1000 $/TB today 100 $/TB in 2007 Moores law 58.70% /year TB growth 112.30% /year since 1993 Price decline 50.70% /year since 1993 Most (80%) data is personal (not enterprise) This will likely remain true.

Disk Storage Cheaper Than Paper File Cabinet (4 drawer) 250$ Cabinet: Paper (24,000 sheets) 250$ Space (2x3 @ 10€/ft2) 180$ Total 700$ 0.03 $/sheet 3 pennies per page Disk: disk (250 GB =) 250$ ASCII: 100 m pages 2e-6 $/sheet(10,000x cheaper) micro-dollar per page Image: 1 m photos 3e-4 $/photo (100x cheaper) milli-dollar per photo Store everything on disk Note: Disk is 100x to 1000x cheaper than RAM

Trying to fill a terabyte in a year Item Items/TB Items/day 300 KB JPEG 3 M 9,800 1 MB Doc 1 M 2,900 1 hour 256 kb/s MP3 audio 9 K 26 1 hour 1.5 Mbp/s MPEG video 290 0.8 Bottom line: we will be able to keep LOTS of video, and vast amounts of smaller data types (audio, photos, documents). Note: probably not worth the time to delete an object

Portable Computer: 2006? 100 Gips processor 1 GB RAM 1 TB disk 1 Gbps network “Some” of your software finding things is a data mining challenge

80% of data is personal / individual. But, what about the other 20%? Business Wall Mart online: 1PB and growing…. Paradox: most “transaction” systems < 1 PB. Have to go to image/data monitoring for big data Government Government is the biggest business. Science LOTS of data.

Q: Where will the Data Come From? A: Sensor Applications Earth Observation 15 PB by 2007 Medical Images & Information + Health Monitoring Potential 1 GB/patient/y  1 EB/y Video Monitoring ~1E8 video cameras @ 1E5 MBps  10TB/s  100 EB/y  filtered??? Airplane Engines 1 GB sensor data/flight, 100,000 engine hours/day 30PB/y Smart Dust: ?? EB/y http://robotics.eecs.berkeley.edu/~pister/SmartDust/ http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html

Premise: DataGrid Computing Store exabytes twice (for redundancy) Access them from anywhere Implies huge archive/data centers Supercomputer centers become super data centers Examples: Google, Yahoo!, Hotmail, BaBar, CERN, Fermilab, SDSC, …

Thesis Most new information is digital (and old information is being digitized) An Information Science Grand Challenge: Capture Organize Summarize Visualize this information Optimize Human Attention as a resource Improve information quality

The Evolution of Science Observational Science Scientist gathers data by direct observation Scientist analyzes data Analytical Science Scientist builds analytical model Makes predictions. Computational Science Simulate analytical model Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator Processed by software Placed in a database / files Scientist analyzes database / files

Computational Science Evolves Historically, Computational Science = simulation. New emphasis on informatics: Capturing, Organizing, Summarizing, Analyzing, Visualizing Largely driven by observational science, but also needed by simulations. Too soon to say if comp-X and X-info will unify or compete. BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ Space Telescope

Next-Generation Data Analysis Looking for Needles in haystacks – the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling Correlation functions are N2, likelihood techniques N3 As data and computers grow at same rate, we can only keep up with N logN A way out? Discard notion of optimal (data is fuzzy, answers are approximate) Don’t assume infinite computational resources or memory Requires combination of statistics & computer science

Smart Data (active databases) If there is too much data to move around, take the analysis to the data! Do all data manipulations at database Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality Databases & Procedures being unified Example temporal and spatial indexing Pixel processing Easy to reorganize the data Multiple views, each optimal for certain types of analyses Building hierarchical summaries are trivial Scalable to Petabyte datasets

Challenge: Make Data Publication & Access Easy Augment FTP with data query: Return intelligent data subsets Make it easy to Publish: Record structured data Find: Find data anywhere in the network Get the subset you need Explore datasets interactively Realistic goal: Make it as easy as publishing/reading web sites today.

Data Federations of Web Services Massive datasets live near their owners: Near the instrument’s software pipeline Near the applications Near data knowledge and curation Super Computer centers become Super Data Centers Each Archive publishes a web service Schema: documents the data Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives A common global schema Challenge: What is the object model for your science? Federation

Web Services: The Key? Internet-scale distributed computing Web SERVER: Given a url + parameters Returns a web page (often dynamic) Web SERVICE: Given a XML document (soap msg) Returns an XML document Tools make this look like an RPC. F(x,y,z) returns (u, v, w) Distributed objects for the web. + naming, discovery, security,.. Internet-scale distributed computing Your program Web Server http Web page Your program Web Service soap Data In your address space object in xml

Emerging technologies Look at science High end computation and storage