Presentation is loading. Please wait.

Presentation is loading. Please wait.

Disklets –Take streams as inputs, generate streams as outputs –Streams accessed using interface that delivers data in buffers with known size –Cannot allocate.

Similar presentations


Presentation on theme: "Disklets –Take streams as inputs, generate streams as outputs –Streams accessed using interface that delivers data in buffers with known size –Cannot allocate."— Presentation transcript:

1 Disklets –Take streams as inputs, generate streams as outputs –Streams accessed using interface that delivers data in buffers with known size –Cannot allocate or free memory –Has access only to pre-allocated buffers and scratch memory –Cannot initiate I/O operations

2 Streams Disk resident streams – files or ranges in files Host-resident streams – used by host-resident code to interact with disklets Pipe streams – used to pipe results of one disklet into another Streams are accessed using interface that delivers data in buffers with size known a-priori Disklets must have at least one input and one output stream

3 Additional Properties of Disklets Initialization function that is run when disklet is installed Processing function (read/write) which is run as data is read/written Long term scratch space, set of parameters to customize behavior, finalization function run when disket terminates Disklet cannot initiate I/O – all I/O unctions are initiated by host- resident program and checked for validity by host-resident file-system –Disklets cannot corrupt file system –OS layer on disk need not provide file-system functionality –Disklet is allowed to skip sub-ranges in input stream by notifying OS layer on disk –Makes possible algorithms which use indexing through use of two streams Data delivered on index stream used to decide which parts of data stream are to be read and which are to be skipped

4 More disklet properties Disklet cannot allocate or free memory – all memory management is done by operating-system layer on disk Memory accesses must be within a sandbox defined by buffers for input streams and long term scratch space Disklet binary is analyzed at download-time – disklets that may violate memory safety are rejected Communication between disklet and environment are restricted to input and output streams Sources and sinks are specified by host-resident program as part of disklet installation –Disklet cannot determine where input comes from or where output goes Figure 1 --- disklet psuedocode

5 DiskOs Services –Memory management Stream based model simplifies memory management as memory is allocated in contiguous blocks whose size is known a priori and lifetime of blocks is known –Stream communication All stream buffers are preallocated –Disklet scheduling Ready to run when new data is available on one or more streams Host level support –Installation of disklets and management of host-resident streams

6 Utility of Active Disks Figure 4 –Compare conventional, active disk for 4, 32 disk configurations Select, groupby are not helped much by active disks –Perform little computation per byte of data Cube, Sort, Conv, Earth show major improvements, at least in part due to ability to distribute processing Figure 5 –Impact of variations in interconnect bandwidth 40 MB/s Ultra-SCSCI, 200 MB/s fibre channel, 400 MB/s Cube, Sort, Conv, Earth are compute limited so bandwidth doesn’t matter but Active Disks help Select, Groupby, Cube –Bandwidth limited with conventional disks, not bandwidth limited with active diks Figure 7 – Scalability Figure 8 – Impact of variation in central processor

7 Mocha Ship Java code implementing query operations and user defined functions Query plans push data reducing operations to the data source sites while executing data inflating operations at client sites Implemented in Java and runs on top of Informix and Oracle Data integration server – provides client applications with uniform view and access mechanisms to data at each source Impose global data model on top of local data model used for each source –Database server configured to access remote data source through database gateway –Mediator used as integration server --- wrappers access and translate information from data sources into global model

8 More Mocha User defined application specific data types and query operators are contained in libraries which must be linked to clients, integration servers, gateways or wrappers –Mocha targets optimized execution of: Implementation of complex data types and query operators not provide by commercial systems User defined functions –Ships code for: data reducing operators i.e. filters –Aggregates, predicates, data mining operators Data inflating operators –Decompression

9 Applications Integration of sites with images, audio, text, objects, programs –Invoke objects, user defined functions –Efficiently execute user defined queries Consider earth science application that manipulates distributed data –Assume one site per state –Schema: Rasters(time:Integer,band:Integer, location:Rectangle,image:Raster); Stores weekly energy readings from satellites Time is week number, band is energy band, location is rectangle covering region under study and image is raster image –Need to implement Rectangle, Raster classes at each site –Ongoing local changes to classes need to be tracked

10 Data Shipping v.s. Query Shipping Data shipping –Most operators in query are evaluated by the integration server at an integration site –Wrappers and gateways are used to extract data items from sources, and translate into middleware schema for further processing –Cannot assume that all sites have same ability to process queries Query shipping –One or more query operators are evaluated at data source and results sent back to integration server Hybrid shipping – combines data and query shipping Processing can only be carried out by operators already implemented at the data source

11 Example of Query involving Data Reducing Operator Select time, location, AvgEnergy(image) From Rasters Where AveEnergy(image) < 100 Assume 200 entries in table Rasters, image having size 1MB, time, band 4 bytes, location is 16 bytes, AveEnergy returns 8 byte double precision Evaluate query at data source and at worst you have to move 200*28 = 5KB Evaluate query on client and you have to move 200MB To evaluate at data source, need to have AveEnergy implemented there

12 Mocha Architecture Details Major components of MOCHA (Overview, Figure 1) –Client application –Applets, servelets, stand-alone client applications –Query processing coordinator (Figure 2) Controls execution of all queries and commands Parses, optimizes queries, monitors execution process Provides access to repository containing function classes, metadata QPC provides access to distributed data sites modeled as object-relational sources Infrastructure to carry out SQL queries posed over distributed data sources QPC process queries over XML repositories Procedural interface where HTTP requests, ftp downloads, file system access requests can access data sources Extensible query engine based on iterators – iterators used to carry out local selections, local joins, remote selections, distributed joins, sorting etc.

13 More Architectural Details Data Access Provider –Uniform access mechanism to remote data source –Extensible query execution engine that can load and use application- specific code obtained from the network with help of QPC –DAP is run close to data source –Mocha pushes down user code and queries –Figure 3 Data Server –Stores data for particular data site –Support for object relational systems (Oracle 8i, Informix) and flat file systems

14 More Architectural Details Catalog –Metadata about user defined function types, user-defined operators, selectivity of various operators, views defined over data sources –Views, data types and operators are uniquely identified by a Uniform Resource Identifier (URI) –Encoded in Resource Description Framework Data –MWObject interface that identifies class as one implementing MOCHA data type – Specifies methods used to read/write each data value into the network –MW LargeObject, MWSmallObject interfaces partition objects into two groups – large objects and small objects –Figure 5

15 Automatic Code Deployment Compiled Java classes are shipped When administrator incorporates new or updated data type –Stores Java class into a well-known code repository –Registers new type or operator by adding entries into system catalog showing Name or type of operator, associated URI, other info such as version number, user privileges Request from client –QPC generates a list with data types and operators needed to process query –QPC access catalog and maps each type or operator into the specific implementing class –Class retrieved from code repository by QPC’s code loader –QPC distributed pieces of the plan to be executed by ech of the DAPs running on the targeted data sites –QPC ships classes to client and DAPs, then ships classes for query operators to be executed by the DAP –Figure 4

16 Query Operators Projections and predicates (Figure 6a) Accumulators (Figure 6b) –Reset –Update –Summarize Memory management –Object preallocation and reuse –Iterator creates one structure to buffer columns read from database –One column to store results returned by each call of Next() Communications –Java RMI to marshall & unmarshal objects – this was inefficient and sometimes gave incorrect results –Methods associated with MWObject used to marshal and unmarshal

17 Query Processing Cost-based approach – evaluation of data-reducing operators moved to DAPs running on data sites Evaluation of data-inflating operators to QPC Execution cost of operator approximated as –Cost(X) = CompCost(X) + NetworkCost(X) –CompCost -- total cost of computing X over input relation R –NetworkCost is total cost of data movement while executing X on R If X is evaluated on DAP, cost is that of moving to QPC results generated after applying X to all tuples in R If X is evaluated at QPC, component is cost of moving to QPC each of the arguments to X in each of the tuples in R Volume reduction factor – total volume transmitted after applying X to R over total volume in R

18 Optimization Algorithm Cumulative volume reduction factor – –CVRF(P) = CVDT/CVDA –CVDT -- total data volume to be transmitted over the network after applying all operators P to R1, …, Rn –CVDA is total data volume in R1, …, Rn –Want to minimize CVRF Algorithm in Figure 7 is a heuristic that attempts to do this –Plans for single relation expressions are selected to best place complex functions –Complex predicates are sorted on increasing value of metric involving selectivity and computational cost –Once single table access plans are built, Figure 7a explores all different possibilities to perform a join, incrementally building a left-deep plan in which a new relation Rj is added to existing join plan Sj for subset of relations –After join plan is complete, algorithm then places complex operators


Download ppt "Disklets –Take streams as inputs, generate streams as outputs –Streams accessed using interface that delivers data in buffers with known size –Cannot allocate."

Similar presentations


Ads by Google