Building BIG Data Servers on the Web Jim Gray Microsoft Research Talk at Flash Mob Supercomputer.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Implementing Tableau Server in an Enterprise Environment
Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.
Trying to Use Databases for Science Jim Gray Microsoft Research
Online Science -- The World-Wide Telescope Archetype
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Astronomy Data Bases Jim Gray Microsoft Research.
Geographic Information Systems “GIS”
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
GIS Overview. What is GIS? GIS is an information system that allows for capture, storage, retrieval, analysis and display of spatial data.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
16 months…. The Visibility Information Exchange Web System is a database system and set of online tools originally designed to support the Regional Haze.
GIS 200 Introduction to GIS Buildings. Poly Streams, Line Wells, Point Roads, Line Zoning,Poly MAP SHEETS.
1 Geographic Information Systems (GIS) Fundamentals for Program Managers.
CS597A: Managing and Exploring Large Datasets Kai Li.
Implementing ISO Aleta Vienneau and David Danko ESRI.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
eScience -- A Transformed Scientific Method"
GIS Lecture 1 Introduction to GIS Buildings. Poly Streams, Line Wells, Point Roads, Line Zoning,Poly MAP SHEETS.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
1 Microsoft Research and Big Databases Information at your fingertips Jim Gray & Tom Barclay &
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Jeremy D. Bartley Kansas Geological Survey An Introduction to an Index of Geospatial Web Services.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
Section 1 # 1 CS The Age of Infinite Storage.
The 2000 Decennial Census School District Project: Using Census Data for the School District Mapping System **** Development and Implementation Tai A.
Section 1 # 1 CS The Age of Infinite Storage.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
The Data Avalanche Jim Gray Microsoft Research Talk at HP Labs/MSR: Research Day July 2004.
Where to find LiDAR: Online Data Resources.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Web Services for the National Virtual Observatory Tamás Budavári Johns Hopkins University.
Some Grid Science California Institute of Technology Roy Williams Paul Messina Grids and Virtual Observatory Grids and and LIGO.
Sky Survey Database Design National e-Science Centre Edinburgh 8 April 2003.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
Center for Computational Visualization University of Texas, Austin Visualization and Graphics Research Group University of California, Davis Molecular.
Real Web Services Jim Gray Microsoft Research 455 Market St, SF, CA, Talk at Charles Schwab.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Microsoft “information at your fingertips” for scientists Collaborating with Scientists to build better ways to organize, analyze, and understand.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
GIS Basic Training June 7, 2007 – ICIT Midyear Conference
How much information? Adapted from a presentation by:
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
CS The Age of Infinite Storage
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Jim Gray Researcher Microsoft Research
Jim Gray Microsoft Research
Jim Gray Microsoft Research
Google Sky.
Presentation transcript:

Building BIG Data Servers on the Web Jim Gray Microsoft Research Talk at Flash Mob Supercomputer 3 April 2004

Numbers TeraBytes and Gigabytes are BIG! Mega – a house in san francisco Giga – a very rich person Tera – ~ The Bush national debt Peta – more than all the money in the world A Gigabyte: the Human Genome A Terabyte: 150 mile long shelf of books.

How much information is there? Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

e-Science Data captured by instruments Or data generated by simulatorData captured by instruments Or data generated by simulator Processed by softwareProcessed by software Placed in a files or databasePlaced in a files or database Scientist analyzes files / databaseScientist analyzes files / database Virtual laboratoriesVirtual laboratories –Networks connecting e-Scientists –Strong support from funding agencies Better use of resourcesBetter use of resources –Primitive today

The Big Picture Experiments & Instruments Simulations facts answers questions Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance –Execute queries in a minute –Batch query scheduling ? The Big Problems Literature Other Archives facts

e-Science is Data Mining There are LOTS of data –people cannot examine most of it. –Need computers to do analysis. Manual or Automatic Exploration –Manual: person suggests hypothesis, computer checks hypothesis –Automatic: Computer suggests hypothesis person evaluates significance Given an arbitrary parameter space: –Data Clusters –Points between Data Clusters –Isolated Data Clusters –Isolated Data Groups –Holes in Data Clusters –Isolated Points Nichol et al Slide courtesy of and adapted from Robert CalTech.

Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 As data and computers grow at same rate, we can only keep up with N logN A way out? –Discard notion of optimal (data is fuzzy, answers are approximate) –Don’t assume infinite computational resources or memory Requires combination of statistics & computer science

TerraServer/TerraService US Geological Survey Photo (DOQ) & Topo (DRG) images online. On Internet since June 1998 Operated by Microsoft Corporation Cross Indexed with –Home sales, –Demographics, –Encyclopedia A web service 20 TB data source 10 M web hits/day

USGS Image Data Digital OrthoQuads –18 TB, 260,000 files uncompressed –Digitized aerial imagery –88% coverage conterminous US –1 meter resolution –< 10 years old Digital Raster Graphics –1 TB compressed TIFF, 65,000 files –Scanned topographic maps –100% U.S. coverage –1:24,000, 1:100,000 and 1:250,000 scale maps –Maps vary in age

User Interface Concept Display Imagery: 316 m 200 x 200 pixel images 7 level image pyramid Resolution 1 meter/pixel to 64 meter/pixel Navigation Tools: 1.5 m place names “Click-on” Coverage map Longitude and Latitude search U.S. Address Search External Geo-Spatial Links to: USGS On-line Stream Gauges Home Advisor Demographics Home Advisor Real Estate Encarta Articles Steam flow gauges Concept: User navigates an ‘almost seamless’ image of earth Buttons to pan NW, N, NE, W, E, SW, S, SE Click on image to zoom in Links to switch between Topo, Imagery, and Relief data Links to Print, Download and view meta-data information

Terra Service New Things A popular web service –Exactly the map you want. Dynamic Map Re-projection –UTM to Geographic projection –Dynamic texture mapping? New Data –1 foot resolution natural color imagery –Census Tiger data Lights Out Management –MOM –Auto-backup / restore on drive failure

New “Urban Area” Data “Redundant Bunch 1” Microsoft Campus at 4 meter resolution Ball field at.25 meter resolution resolution

TerraServer Becomes a Web Service TerraServer.net -> TerraService.Net TerraService.Net Web server is for people. Web Service is for programs –The end of screen scraping –No faking a URL: pass real parameters. –No parsing the answer: data formatted into your address space. Hundreds of users but a specific example: –US Department of Agriculture

TerraServer Web Services Get image meta-data Query TS Gazetteer Retrieve TS ImageTiles Projection conversions Web Map Client –OpenGIS “like” –Landmarks layered on TerraServer imagery Geo-coded data of well- known objects (points), e.g. Schools, Golf Courses, Hospitals, etc. Polygons of well-known objects (shapes), e.g. Zip Codes, Cities, etc Fat Map Client –Visual Basic / C# Windows Form –Access Web Services for all data Terra-Tile-Service Landmark-Service Sample Apps

Web Services Web SERVER: –Given a url + parameters –Returns a web page (often dynamic) Web SERVICE: –Given a XML document (soap msg) –Returns an XML document –Tools make this look like an RPC. F(x,y,z) returns (u, v, w) –Distributed objects for the web. –+ naming, discovery, security,.. Internet-scale distributed computing Your program Data In your address space Web Service soap object in xml Your program Web Server http Web page

Terraserver Architecture Map UI Web Forms TerraServer Web Service DB Server 668 m Rows Map Server Http Handler StandardBrowsers SmartClients Windows Forms.NETFramework ADO.NET OLEDB SQL TB Db HTML Image/jpeg Image/jpeg XML

TerraServer Schema TerraServer Terra Database Admin Imagery SourceMeta Imagery Image Type AltState State Name Place Name AltPlace Gazetteer Feature Type Pyramid Image Search External Link External Group External Geo Search Famous Category Famous Place Scale Job Load Job Search Job Search Dest Search Job Log LoadMgmt ImageMeta AltCountry Country Name Small PlaceName Image Source JobQueue JobSystemMedia MediaFile NoImage

Internet Data Center SQL Server Stored Procs SQL Server Stored Procs SQL Server Stored Procs 2 TB Database Terra Scale Read 4 Images Write 1 Load Process Terra Cutter Read Image Files Corporate Network Active Server Pages Loading Scheduling System Terminal Server Remote Management 6 TB Staging Area Bricks Fire Wire disks

KVM / IP TerraServer Hardware Storage Bricks –“White-box commodity servers” –4tb raw / 2TB Raid1 SATA storage –Dual Hyper-threaded Xeon 2.4ghz, 4GB RAM Partitioned Databases (PACS – partitioned array) –3 Storage Bricks = 1 TerraServer data –Data partitioned across 20 databases –More data & partitions coming Low Cost Availability –4 copies of the data RAID1 SATA Mirroring 2 redundant “Bunches” –Spare brick to repair failed brick 2N+1 design –Web Application “bunch aware” Load balances between redundant databases Fails over to surviving database on failure ~100K$ capital expense.

Research Objectives Public: Access to remote sensing data with no GIS expertise required Ubiquitous: No special hw/sw required by client Delivery: All OnLine/Internet Based, no tape or CD distribution Simple: Designed to be used by a “6 th grade geography student” Test/show scalability Test/show availability: Test/show lights out: –all operations & maintenance occurs remotely –Minimal ops and dev staff “web service” poster child User/App Goals Technology Goals

Virtual Observatory Premise: Most data is (or could be online) So, the Internet is the world’s best telescope: –It has data on every part of the sky –In every measured spectral band: optical, x-ray, radio.. –As deep as the best instruments (2 years ago). –It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). –It’s a smart telescope: links objects and data to literature on them.

Why Astronomy Data? It has no commercial value –No privacy concerns –Can freely share results with others –Great for experimenting with algorithms It is real and well documented –High-dimensional data (with confidence intervals) –Spatial data –Temporal data Many different instruments from many different places and many different times Federation is a goal The questions are interesting –How did the universe form? There is a lot of it (petabytes) IRAS 100  ROSAT ~keV DSS Optical 2MASS 2  IRAS 25  NVSS 20cm WENSS 92cm GB 6cm

Time and Spectral Dimensions The Multiwavelength Crab Nebulae X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Slide courtesy of Robert CalTech. Crab star 1053 AD

SkyServer.SDSS.org A modern archive –Raw Pixel data lives in file servers –Catalog data (derived objects) lives in Database –Online query to any and all Also used for education –150 hours of online Astronomy –Implicitly teaches data analysis Interesting things –Spatial data search –Client query interface via Java Applet –Query interface via Emacs –Popular -- 1% of Terraserver –Cloned by other surveys (a template design) –Web services are core of it.

Demo of SkyServer Shows standard web server Pixel/image data Point and click Explore one object Explore sets of objects (data mining)

Federation Data Federations of Web Services Massive datasets live near their owners: –Near the instrument’s software pipeline –Near the applications –Near data knowledge and curation –Super Computer centers become Super Data Centers Each Archive publishes a web service –Schema: documents the data –Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives –A common global schema

Federation: SkyQuery.NetSkyQuery.Net Combine 4 archives initially Just added 10 more Send query to portal, portal joins data from archives. Problem: want to do multi-step data analysis (not just single query). Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “batch schedule” on portal server, Deposits answer in personal database.

2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery Structure Each SkyNode publishes –Schema Web Service –Database Web Service Portal is –Plans Query (2 phase) –Integrates answers –Is itself a web service

SkyQuery: Distributed Query tool using a set of web services Four astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England). Feasibility study, built in 6 weeks –Tanu Malik (JHU CS grad student) –Tamas Budavari (JHU astro postdoc) –With help from Szalay, Thakar, Gray Implemented in C# and.NET Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

SkyNode Basic Web Services Metadata information about resources –Waveband –Sky coverage –Translation of names to universal dictionary (UCD) Simple search patterns on the resources –Cone Search –Image mosaic –Unit conversions Simple filtering, counting, histogramming On-the-fly recalibrations

Portals: Higher Level Services Built on Atomic Services Perform more complex tasks Examples –Automated resource discovery –Cross-identifications –Photometric redshifts –Outlier detections –Visualization facilities Goal: –Build custom portals in days from existing building blocks (like today in IRAF or IDL)

2MASS INT SDSS FIRST SkyQuery Portal Image Cutout MyDB added to SkyQuery Let users add personal DB 1GB for now. Use it as a workbook. Online and batch queries. Moves analysis to the data Users can cooperate (share MyDB) Still exploring this MyDB

The Big Picture Experiments & Instruments Simulations facts answers questions Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance –Execute queries in a minute –Batch query scheduling ? The Big Problems Literature Other Archives facts

Grid and Web Services Synergy I believe the Grid will be many web services share data (computrons are free) IETF standards Provide –Naming –Authorization / Security / Privacy –Distributed Objects Discovery, Definition, Invocation, Object Model –Higher level services: workflow, transactions, DB,.. Synergy: commercial Internet & Grid tools