Download presentation
Presentation is loading. Please wait.
Published byHollie Amanda Merritt Modified over 9 years ago
1
The Future of MOCHA Nick Roussopoulos October 5, 2001
2
Stanford Oct 5, 2001 Nick Roussopoulos 2 The Problem Data Sources for an enterprise are: –Distributed Internet, intranets, extranets –Heterogeneous Web servers, relational databases, file systems –Mission-critical Weather service, ocean temperature, stock status, … –Costly to replace or upgrade Risk of breaking it and loss of investment Distributed and heterogeneous data sources
3
Stanford Oct 5, 2001 Nick Roussopoulos 3 The Problem Internet Oracle 8iInformixXML DataText Data High volume access from everywhere Client
4
Stanford Oct 5, 2001 Nick Roussopoulos 4 Client-Server Client-Server 2-tier architecture complex FAT clients Bad Idea Client Internet Oracle 8iInformixXML DataText Data
5
Stanford Oct 5, 2001 Nick Roussopoulos 5 Middleware 3-tier architecture Oracle 8iInformixXML DataText Data Internet Translator Integration Server Catalog Client Thin & fit clients
6
Stanford Oct 5, 2001 Nick Roussopoulos 6 Nice but… Most middleware solutions are static Not flexible for dynamic environments Not scalable to hundreds of client and server sites Development cost is high One-site-at-a-time at a fixed cost Maintenance cost is high Upgrades are practically redevelopments
7
Stanford Oct 5, 2001 Nick Roussopoulos 7 A dynamic world needs Code extensibility & auto-deployment Need for user-defined types and functions –Polygon –Composite() – image aggregation Porting and manual installation of code (C/C++) –Operating System –Hardware Platform High cost of code maintenance –Updates on all platforms –Version management Security in hostile platforms
8
Stanford Oct 5, 2001 Nick Roussopoulos 8 Code Deployment Problem Client Oracle 8iInformixXML DataText Data Internet Translator Integration Server Catalog Not Scalable
9
Stanford Oct 5, 2001 Nick Roussopoulos 9 Query Processing Query execution options –Limited by site-dependent software Composite() – must be ported before use Most processing done at the Integration Server –Powerful Data Servers are under-utilized I/O Nodes –Excessive data movement over the network Network bottleneck Slow internet access
10
Stanford Oct 5, 2001 Nick Roussopoulos 10 Query Processing Problem Client Oracle 8iInformixXML DataText Data Internet Translator Integration Server Catalog 100MB 200MB Inefficient & not scalable
11
Stanford Oct 5, 2001 Nick Roussopoulos 11 Solution MOCHA Middleware Based On a Code SHipping Architecture
12
Stanford Oct 5, 2001 Nick Roussopoulos 12 MOCHA Solution: Ship Java Code Mochlets Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Client Oracle Informix DAP QPC Code Repository Catalog Internet Virginia Maryland Virginia Texas QQQQ Q Q QQ Q No code porting & no maintenance
13
Stanford Oct 5, 2001 Nick Roussopoulos 13 MOCHA Solution: Filter Data @ Source Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Client Oracle Informix DAP QPC Code Repository Internet Virginia Maryland Virginia Texas Catalog 200MB tuples 100MB tuples results 200KB results 150KB results 150KB results 200KB results 150KB results 200KB results 350KB results 350KB No bandwidth waste
14
Stanford Oct 5, 2001 Nick Roussopoulos 14 Software architecture Client DBMS OS File DAP QPC Code Repository Catalog
15
Stanford Oct 5, 2001 Nick Roussopoulos 15 QPC: The Query Processing Coordinator Client API Query Parser Catalog Manager Query Optimizer Execution Engine Code Loader SQL & XML Proc. Interface DAP Access API XML Catalog Code Repository DAP QPC Controls and Coordinates Query Execution
16
Stanford Oct 5, 2001 Nick Roussopoulos 16 DAP: The Data Access Provider DAP Provides QPC with Remote Access to the Data Data Source DAP Access API Control Module Execution Engine Code Loader SQL & XML Proc. Interface Data Source Access Layer JDBCI/O APIDOMJNI
17
Stanford Oct 5, 2001 Nick Roussopoulos 17 Data Server: Storage System Stores and Manages the data sets –database, web server, file system, XML repository Data Server
18
Stanford Oct 5, 2001 Nick Roussopoulos 18 Processing a Query in MOCHA Query Parsing Resource Discovery Query Optimization Metadata and Control Exchange Code Deployment Phase Query Execution Table Rasters location image week band Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location Query:
19
Stanford Oct 5, 2001 Nick Roussopoulos 19 Plan Generation Client Informix Oracle QPC DAP Code Repository Catalog Coordination Thread Execution Thread Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location
20
Stanford Oct 5, 2001 Nick Roussopoulos 20 Automatic Code Deployment Client Informix Oracle QPC DAP Code Repository Catalog Coordination Thread Execution Thread Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location
21
Stanford Oct 5, 2001 Nick Roussopoulos 21 Data Processing Client Informix Oracle QPC DAP Code Repository Catalog Coordination Thread Execution Thread Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location
22
Stanford Oct 5, 2001 Nick Roussopoulos 22 Features of MOCHA Automatic code deployment “Plug-N-Play” no system-wide installations Metadata and Schema Mapping framework XML, RDF easy to exchange and map schemas semi-automatic mapping Query optimization based on code shipping –reduce data movement overhead filters at the source expands at the client metrics for code (operator) placement optimization for selection, union and join plans
23
Stanford Oct 5, 2001 Nick Roussopoulos 23 MOCHA Demo: Global Land Cover Facility Integrates the following DAP sites –University of New Hampshire (Webster), NASA GSFC, UMD-CS, UMD-Geography, UMD-UMIACS SP-2 HPSS GLCF hosts the QPC Operations supported: –Coverage queries –Visualization of preview images for –Data sets MODIS, TM, AVHRR –GIS Features Dynamic Sub-setting of TM scenes Composites of GIS Features and AVHRR images
24
Stanford Oct 5, 2001 Nick Roussopoulos 24 Multi-Sensor Analysis of the Los Alamos Fire Event Using MOCHA Data Synergy and Multi-Resolution Instrument Analysis using MOCHA –Access data residing at various data sources –Utilize image processing tools Fire Analysis required a multi-resolution approach –MOCHA is independent of instrument or resolution specifics High Resolution: IKONOS and TM data Moderate Resolution: 250m MODIS Coarse Resolution: AVHRR and DMSP
25
Stanford Oct 5, 2001 Nick Roussopoulos 25 MOCHA Search Utility
26
Stanford Oct 5, 2001 Nick Roussopoulos 26 MOCHA Search Utility (cont’d)
27
Stanford Oct 5, 2001 Nick Roussopoulos 27 MOCHA Search Utility (cont’d)
28
Stanford Oct 5, 2001 Nick Roussopoulos 28 MOCHA Query Results
29
Stanford Oct 5, 2001 Nick Roussopoulos 29 MOCHA ETM+ Subsetting Utility
30
Stanford Oct 5, 2001 Nick Roussopoulos 30 May 9, 2000 Los Alamos (Bands 1,2,3)
31
Stanford Oct 5, 2001 Nick Roussopoulos 31 May 9, 2000 Los Alamos (Bands 7,5,4)
32
Stanford Oct 5, 2001 Nick Roussopoulos 32 Multi-Sensor Query
33
Stanford Oct 5, 2001 Nick Roussopoulos 33 Tabular Query Results
34
Stanford Oct 5, 2001 Nick Roussopoulos 34 MODIS: May 11, 2000: During Fire
35
Stanford Oct 5, 2001 Nick Roussopoulos 35 MODIS: May 24, 2000: After Fire
36
Stanford Oct 5, 2001 Nick Roussopoulos 36 DMSP: Night Visibility of Fire
37
Stanford Oct 5, 2001 Nick Roussopoulos 37 IKONOS 4m resolution
38
Stanford Oct 5, 2001 Nick Roussopoulos 38 IKONOS 4m Subset
39
Stanford Oct 5, 2001 Nick Roussopoulos 39 IKONOS 1m resolution
40
Stanford Oct 5, 2001 Nick Roussopoulos 40 IKONOS 1m Subset
41
Stanford Oct 5, 2001 Nick Roussopoulos 41 MOCHA Metadata Publishing Framework Provides information about system resources Data sources schemas and mappings user-defined types and functions Automates operation of MOCHA Incremental system growth neither fixed nor hardwired parameters no extension by re-compilation Share metadata with others (Internet) machine readable form
42
Stanford Oct 5, 2001 Nick Roussopoulos 42 Metadata Publishing Framework Select location, Composite(image) From Rasters Where week BETWEEN t1 and t2 Group By location location image week band Table Rasters Query: 1. What kind of metadata are needed? 2. How to specify them?
43
Stanford Oct 5, 2001 Nick Roussopoulos 43 MOCHA Catalog Organization Metadata about “resources” –Local and global tables –UDF data types and operators –Schema mapping rules –DAPs Each one has Uniform Resource Identifier (URI) global namespace –e.g.: mocha://cs1.umd.edu/EarthSci/Polygon Modeled with RDF, serialized with XML easy to understand, use and exchange
44
Stanford Oct 5, 2001 Nick Roussopoulos 44 RDF Model: Data Types mocha:Type mocha:Class mocha:Repository mocha:Size mocha:Creator mocha://cs1.umd.edu/EarthSci/Raster Raster Raster.class cs1.umd.edu/EarthSci 1 megabyte user1@cs.umd.edu
45
Stanford Oct 5, 2001 Nick Roussopoulos 45 XML Serialization: Data Types W3C Standards Easy to specify using GUI tools Easy to exchange Crawlers can harvest it Stored in –DB –File System <rdf:Description about= “mocha://cs1.umd.edu/EarthSci/Raster”> Raster Raster.class cs1.umd.edu/EarthSci 1 MB user1@cs1.umd.edu
46
Stanford Oct 5, 2001 Nick Roussopoulos 46 Other Resources in MOCHA Local and Global tables –data sources + columns + types UDF Functions –argument types + return type –code repository Schema mapping rules DAPs –URL –login information
47
Stanford Oct 5, 2001 Nick Roussopoulos 47 Schema Mapping in MOCHA location image week band point1 point2 photo date band Direct column mappings Complex Expressions Rasters RastersMD rect() week()
48
Stanford Oct 5, 2001 Nick Roussopoulos 48 MOCHA Schema Mapping Rules Use XML to encode mapping rules Schema mapping sub-plans –leaf nodes image photo location rect(point1, point2) … Plan Tree SMP
49
Stanford Oct 5, 2001 Nick Roussopoulos 49 Query Optimization Problem Issue 1: Cost of query execution –What is the dominant factor? Issue 2: Placement of UDF operator execution –Which go to QPC? –Which go to DAPs? Issue 3: How to generate query plans? –Dynamic programming [SAC+79], [ML86] –But search space is enormous and full of “bad” plans Placement of UDF, joins, execution sites …
50
Stanford Oct 5, 2001 Nick Roussopoulos 50 MOCHA Optimization Framework Query optimization based on heuristics cost = network + CPU + I/O Network is the dominant factor (WAN) optimize for it first CPU and I/O are cheaper optimize for them later Operator placement: Enhanced Hybrid Shipping Code Data
51
Stanford Oct 5, 2001 Nick Roussopoulos 51 Operator Placement in MOCHA Data-Reducing Operators –“Filter” the data –aggregates, predicates, projections, semi-joins Composite(), Overlaps(), AvgEnergy() Push to the DAPs Return distilled results Less data movement Composite()
52
Stanford Oct 5, 2001 Nick Roussopoulos 52 Operator Placement in MOCHA Data-Inflating Operators “Expand” the data projections, image processing, some joins … DoubleResolution(), RotateSolid() Pull to the QPC Data Shipping policy [FJK96] Only send back raw arguments Less data movement DoubleRes()
53
Stanford Oct 5, 2001 Nick Roussopoulos 53 Placement Metric: VRF Volume Reduction Factor : Given operator f and relation R, then VDT - volume of data transmitted after applying f to R VDA - volume of data originally present in R f is Data-Reducing VRF < 1 Composite() f is Data-Inflating VRF 1 DoubleRes()
54
Stanford Oct 5, 2001 Nick Roussopoulos 54 Goal: Plans with small CVRF Cumulative Volume Reduction Factor: Given a plan P to solve query Q over relations R1, …, Rn CVDT - volume of data transmitted by applying all operators in P to R1, …, Rn CVDA- volume of data originally present in R1, …, Rn Search Space Optimizer searches for plans that move minimal amount of data. CVRF(Plan) [0,1]
55
Stanford Oct 5, 2001 Nick Roussopoulos 55 MOCHA Query Optimizer System R style –Left-deep plans (joins at QPC) –cost: execution time (network + CPU + I/O) –operator placement : VRF and plan cost –selections, unions and joins Placement Policy: Enhanced Hybrid Shipping –Code Shipping: operators at DAPs –Data Shipping: operators at QPC –generalizes Hybrid Shipping [FJK96]
56
Stanford Oct 5, 2001 Nick Roussopoulos 56 Sequoia 2000 Benchmark Goals of first experiment: –Measure how good code shipping can be –Validate heuristics being proposed VRF CVRF Configured MOCHA with plans that place operators –at DAP with code shipping –at QPC with data shipping
57
Stanford Oct 5, 2001 Nick Roussopoulos 57 Reducing vs. Inflating Running Time (secs) QPC DAP Query Class Q1Q2Q3 Query classes –Q1: Composite of all images –Q2: Clipping and sub-setting –Q3: Double resolution of images Performance –composites 99% data reduction 4-1 better performance –clipping and expansion 80% data reduction 3-1 better performance Validates heuristics
58
Stanford Oct 5, 2001 Nick Roussopoulos 58 VRF vs. Selectivity Selectivity and cardinality not enough for distributed predicate placement Consider 50% selectivity DAP CVRF = 0.01 QPC CVRF = 1 Running Time (secs) Selectivity QPC DAP QPC DAP QPC DAP QPC DAP QPC DAP 0.25.50.75 1 VRF is a better metric
59
Stanford Oct 5, 2001 Nick Roussopoulos 59 WAN Experiment Sites used: –University of Maryland (QPC) –University of Puerto Rico –Oregon Graduate Institute –University of North Dakota –University of Alabama
60
Stanford Oct 5, 2001 Nick Roussopoulos 60 Union with Data-Reducing EHS is the better option –Filters data –2-1 better performance –Minimal resource usage Q6: Select landuse, location From polygons Where perimeter(location) > 2000.0 Sites: UPR and OGI
61
Stanford Oct 5, 2001 Nick Roussopoulos 61 Union with Reducing and Inflating Q5: Select landuse, location, triangulate(location) From Polygons Where perimeter(location) > 2000.0 EHS is better than DS and QS 2-1 better than QS 6-1 better than DS Consumes least resources Sites: UPR and OGI
62
Stanford Oct 5, 2001 Nick Roussopoulos 62 Join with Data-Reducing EHS is the better option 3-1 better performance –Minimal resource usage Same pattern as with unions –Data movement is the key Q8: Select P.landuse, R.location, R.week From polygons P, rasters R Where overlaps(P.location, R.location) And perimeter(P.location) > 2000.0 Sites: UPR and OGI
63
Stanford Oct 5, 2001 Nick Roussopoulos 63 Union with Extra Load EHS is still the better option Extra load has impact on both Not clear if data shipping wins in real situations Q9: Select landuse, location From polygons Where perimeter(location) > 2000.0 Sites: UPR and OGI Load 20 Load 10
64
Stanford Oct 5, 2001 Nick Roussopoulos 64 MOCHA System Status Operational MOCHA prototype –It’s real! –over 40,000 lines of 100% Java code (JDK 1.3) –People involved: Manuel Rodriguez-Martinez (lead) Mike McGann Steve Kelley Vadim Katz John Towshend, Frank Lindsay, Ben White (Geographers) Joseph JaJa (Algorithms) –Tested with NASA ESIP Federation Los Alamos fire –Supports: Oracle, Postgres, Informix, Sybase, HPSS
65
Stanford Oct 5, 2001 Nick Roussopoulos 65 Features of MOCHA Automatic Code Deployment Scalable middleware architecture Query optimization based on data movement reduction Metadata publishing framework [RMR00a] RDF and XML Publish schemas, mappings, types and functions Drives automatic code deployment Schema mapping rules expressed in XML attach as leaf nodes in query plan extensible
66
Stanford Oct 5, 2001 Nick Roussopoulos 66 MOCHA Publications Research papers and talks –ACM SIGMOD 2000 –EDBT 2000 Demos –ACM SIGMOD 2000 –SSDBM 2001 –NASA ESIP meetings and workshops –U.S. National Academy of Sciences
67
Stanford Oct 5, 2001 Nick Roussopoulos 67 The Future of MOCHA A Million Site MOCHA
68
Stanford Oct 5, 2001 Nick Roussopoulos 68 The Future of MOCHA The role of MOCHA in distributed software systems –sensors –satellites –network switches and routers – laptops, palm computers –custom-built devices –cars, planes, boats –people (fireman), animals (whales)
69
Stanford Oct 5, 2001 Nick Roussopoulos 69 Network of MOCHA enabled sensors Sensors are deployed in an area using ad hoc network techniques Sensors run Java JDK 1.3 Lighter Sensors run Java JDK 1.3 Micro Edition DAP
70
Stanford Oct 5, 2001 Nick Roussopoulos 70 Organization of sensors Leader Normal Sensor Groups Sensors are grouped together for specific goal or service data acquisition data aggregation, analysis data streaming Group leaders are responsible for establishing themselves (broadcast, voting, …) coordination among sensors making decisions (agents) participate in other higher level groups (hybrid P2P)
71
Stanford Oct 5, 2001 Nick Roussopoulos 71 Concrete Example (from NASA) Constellation of Satellites (with sensors) A group observes Gamma radiation –aggregates measurements –determines an important radiation event Group leader tells other peer group leaders to instruct their sensors to observe the Gamma radiation event (reaction). system adapts to changes in the environment
72
Stanford Oct 5, 2001 Nick Roussopoulos 72 MOCHAs Code Shipping feature for upgrades to fix bugs fresh code to gather data –at different resolution –new aggregates or functions dynamically configured code –application-specific security protocol –location-dependent encryption
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.