Download presentation
Presentation is loading. Please wait.
1
1 A Suite of Web-Services for Wavelet- based Analysis of Large Atmospheric Datasets A Suite of Web-Services for Wavelet- based Analysis of Large Atmospheric Datasets Cyrus Shahabi University of Southern California Dept. of Computer Science Los Angeles, CA 90089-0781 shahabi@usc.edu http://infolab.usc.edu
2
2 Outline Motivating Applications Approach: Wavelets ProPolyne: Overview and Features ProPolyne: Details Development Status ProDA Demonstration ProDA Web-Services A Sample Client for Progressive Queries Why Web Services? (Answers to Dan’s Questions!) Conclusion Future Work
3
3 Motivation: New Multidimensional Data Intensive Applications Multidimensional data sets: (w/ dimension & measure) Remote sensory date (from JPL): Sensor readings from GPS ground stations (from NASA): Petroleum sales (from Digital-Government research center): ACOUSTIC data (from UCLA sensor-network project): Market data (from NCR): Large size, e.g., current (toy!) NASA/JPL data set: Past 10 years, sampling twice a day, at a lat-long-alt grid of 64 * 128 * 16, recording 8 bytes of temperature & 16 bytes of dimensions This is 6 MB of data per day; a total of 21 GB for 10 years Increase: twice an hour sampling, 1024 * 4096 * 128 grid, …
4
4 Motivation: Multidimensional Applications I/O and computationally complex queries Range-aggregate queries (w/ aggregate function) Average temperature, given an area and time interval Average velocity of upward movement of the station Total petroleum sales volume of a given product in a given location and year Number of jackets sold in Seattle in Sep. 2003 Tougher queries: Covariance of temperature and altitude (correlation) Variance of sale of petroleum in 2004 in CA Quick response-time (interactive): the results can be approximate and/or progressively become exact
5
5 Specific Example by Brian and Amy
6
6 Recap! Multidimensional data Large data Aggregate queries Approximate answers Progressive answers Multi-resolution compression Wavelets!
7
7 Approach: Enabling Data Manipulation, Query & Analysis in the WAVELET Domain Everybody else’s idea: let’s compress data Reason: save space? No not really! Implicit reason: queries deal with smaller data sets and hence faster (not always true!) More problems: not only query results can never be 100% accurate anymore, but also different queries can have very different error rates given their areas of interest Why? At the data population time, we don’t know which coefficients are more/less important to our queries! ( also observed by [Garofalakis & Gibbons, SIGMOD’02], but they proposed other ways to drop coefficients assuming a uniform workload) Different than the signal-processing objective to reconstruct the entire signal as good as possible
8
8 Approach: Enabling Data Manipulation, Query & Analysis in the WAVELET Domain Our idea/distinction: storage is cheap and queries are ad-hoc; let’s keep all the wavelet coefficients! (no data compression) Opportunity: At the query time, however, we have the knowledge of what is important to the pending query Query ProPolyne: Progressive Evaluation of Polynomial Range-Aggregate Query
9
9 Define range-sum query as dot product of query vector and data vector (also observed by [Gilbert et. al, VLDB’2001]) Offline: Multidimensional wavelet transform of data At the query time: “lazy” wavelet transform of query vector (very fast) Dot product of query and data vectors in the transformed domain exact result Choose high-energy query coefficients only fast approximate result (90% accuracy by retrieving < 10% of data) Choose query coefficients in order of energy progressive result Overview of ProPolyne
10
10 ProPolyne Unique Features “Measure” can be any polynomial on any combination of attributes Can support COUNT, SUM, AVERAGE Also supports Covariance, Kurtosis, etc. All using one set of pre-computed aggregates Independent from how well the data set can be compressed/approximated by wavelets Because: We show “range-sum queries” can always be approximated well by wavelets (not always HAAR though!) Potentially low update cost Can be used for exact, approximate and progressive range-sum query evaluation
11
11 Outline Motivating Applications ProPolyne: Overview and Features ProPolyne: Details Polynomial Range-Sum Queries as Vector Queries Evaluation of Vector Queries Progressive/Approximate Evaluation of Vector Queries Development Status Conclusion Future Work
12
12 Polynomial Range-Sum Queries Polynomial range-sum queries: Q(R,f,I) I is a finite instance of schema F R SubSetOf Dom( F ), is the range f : Dom( F ) R is a polynomial of degree Example: F = (Age, Salary) R : (25 < age < 40) & (55k < salary < 150k) Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I
13
13 Polynomial Range-Sum Queries as “Vector Queries” The data frequency distribution of I is the function I : Dom( F ) Z that maps a point x to the number of times it occurs in I To emphasize the fact that a query is an operator on the data frequency distribution, we write Example: (25,50)= (28,55)=…= (57,120)=1 and (x)=0 otherwise. Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I where: if Hence: Or: Vector Query querydata
14
14 Wavelet Transformation of Data and Query Vectors DWT of a aka wavelet coefficients of a Hence, vector queries can be computed in the wavelet- transformed space as: Transform the query vector at submission: O (N d ) ? Nop! Not with our “lazy” wavelet transform for range-aggregate queries
15
15 Fast Evaluation of Vector Queries Using Wavelets … Technical Requirements: Wavelets should have small support (i.e., the shorter the filter, the better) Wavelets must satisfy a “moment condition” Supports any Polynomial Range-Sum up to a degree determined by the choice of wavelets E.g., Haar can only support degree 0 (e.g., COUNT), while db4 can support up to degree 1 (e.g., SUM), and db6 for degree 2 (e.g., VARIANCE) Standard DWT: (N) Our lazy wavelet transform: (l log N), where l is the length of the filter
16
16 Exact Evaluation of Vector Queries Query: SUM(salary) when (25 < age < 40) & (55k < salary < 150k) # of Wavelet Coefficients: 837# of Nonzero Coordinates: 4380
17
17 Progressive Evaluation of Vector Queries
18
18 Outline Motivating Applications Approach: Wavelets ProPolyne: Overview and Features ProPolyne: Details Development Status ProDA Demonstration ProDA Web-Services A Sample Client for Progressive Queries Why Web Services? (Answers to Dan’s Questions!) Conclusion Future Work
19
19 ProDA Demonstration
20
20 3-Tier System Architecture DBMS Oracle 9.2.0.1.0 Web Services.Net C# Client (Query in C#) Standard SOAP (XML / HTTP) ODP.NET
21
21 Database Schema Cube NameDescription# of DimFilter Dim #TitleDim Size IndexDim Value IndexWavelet Coefficient Cube Directory Cube Header Dimension Values Wavelet Coefficients Cube … …
22
22 ProDA Web Services General For wavelet transformation Directory To list the cubes’ metadata Cube To select a cube and access its content Query To submit a Range-Aggregate query and ask for result: exact, approximate or progressive http://mahour.usc.edu/proda/webservices.asmx
23
23 General Web Services double[ ] GetFilterH (int filter) To return high-pass (differences) wavelet filter coefficients double[ ] GetFilterL (int filter) To return low-pass (averages) wavelet filter coefficients string GetFilterName (int filter) To return wavelet filter name float[ ] DWT1 (float[ ] a, int n, int filter) To wavelet transform an array float[ ] DWTn (float[ ] a, int[ ] nn, int ndim, int filter) To wavelet transform a multidimensional array
24
24 Directory Web Services string[ ] AllCubeNames ( ) To list the names of all available cubes string[ ] AllCubeTitles ( ) To list the descriptions of all available cubes int[ ] AllCubeDims ( ) To list the dimension numbers of all cubes
25
25 Cube Web Services void SelectDB (string dbn) To select a cube void CloseConnection ( ) To end the connection int GetFilterIndex ( ) To return the Wavelet Filter used to transform the cube int GetDimSize (int idim) To return the dimension size for dimension #idim float GetDimValue (int idim, int index) To return the dimension value for dimension #idim at index position float[ ] GetDimValues (int idim) To return the dimension values for dimension #idim int GetDimMax (int idim) To return the maximum value for dimension #idim int GetDimMin (int idim) To return minimum value for dimension #idim int DimN ( ) To return the number of dimensions string GetDimTitle (int idim) To return the dimension title for dimension #idim
26
26 Query Web Services (1) Aggregate Functions void Cnt ( ) To submit count query void Sum (int idim) To submit sum query void Avg (int idim) Average query void Var (int idim) Variance query void Std (int idim) Standard deviation void Cov (int idim, int jdim) Covariance query void Cor (int idim, int jdim) Correlation query void PushTerm (float coef, int[ ] pow) To add polynomial query void PushOperator (char op) To add unary or binary operator for polynomials
27
27 A polynomial example Let us re-write Var(x) in post order: Consider Var(x) We can generate this query by using PushTerm and PushOperator calls: PushTerm(1,{2}); PushTerm(1,{1}); PushOperator(‘^’); PushOperator(‘-’); PushTerm(1,{0}); PushOperator(‘/’);
28
28 Query Web Services (2) Cursor Functions long GetRangeSize ( ) To return range size long GetRetrieveSize ( ) To return number of required retrieves void Submit ( ) To submit added polynomial query float Advance (int step) To advance query progress bool HasMore ( ) To check query progress long GetCounter ( ) To return number of retrieves float GetPercentage ( ) To return percentage of query progress float GetResult ( ) To return the latest result void SetRange (int[] startRange, int[] endRange) To specify a range (in multidimension)
29
29 Select a cube and Get some properties pws.SelectDB("…"); int n=pws.GetDimN(); for(int i=0;i<n;i++) Console.WriteLine("Dim"+i+"("+pws.GetDimTitle(i)+")="+pws.GetDimSize(i)); List Datacubes string []dbnames=pws.AllCubeNames(); for(int i=0;i<dbnames.Length;i++) Console.WriteLine(dbnames[i]); A Sample Client for Progressive Query (in C#) Initialization ProDA_WebServices pws=new ProDA_WebServices(); pws.Credentials = System.Net.CredentialCache.DefaultCredentials; pws.CookieContainer = new CookieContainer(); Select Ranges and Submit Query int []rStart={…}; int []rEnd ={…}; pws.SetRange(rStart,rEnd); pws.Var(1); Progressive Results while(pws.HasMore()) Console.Write("Result="+pws.Advance(500)+"["+pws.GetPercentage()+"]\t\r"); pws.CloseConnection(); Creating an instance Storing session state Listing all cube names Asking for result progressively Closing connection Selecting a cube Asking for some properties Defining a range Submitting a query
30
30 Why Web Services? How the technology or arch principals used in the projects could be used by others - even in different domains? Wavelet-based query processing Cursor functions for approximate and progressive query evaluation Have webservices changed the way we publish data, information, or expose processes/applications? Exposure Web services make ProDA available to anyone, anytime, anywhere (e.g., Web service URL’s in NSF reports to make our research available) UIs are redundant Web services are the user interface (spending time on things we know how to do! GENESIS vs. GENESIS-II) Tailor-suited functionality We design the building blocks that scientists put together and customize according to their needs (don’t know what are their queries! And don’t want to limit them.)
31
31 Why Web Services? How has the use of DB and Web Services changed the way we do research? Web services constitute a data-access layer on top of a DB With web services, ProDA seamlessly extends basic DBMS functionality Publishing DB techniques for testing purposes is simplified Is there technology/software that others could utilize? ProDA web services
32
32 Conclusion A novel pre-aggregation strategy Supports conventional aggregates: COUNT, SUM and beyond: multivariate statistics First pre-aggregation technique that does not require measures be specified a priori Measures treated as functions of the attributes at the time Provides a data independent progressive and approximate query answering technique With provably poly-logarithmic worst-case query and update costs And storage cost comparable or better than other pre-aggregation methods
33
33 Future Work Almost Complete Batch queries Efficient layout on disk In Progress …. I/O efficient wavelet transformation and update Hybrid ordering of data and query coefficients Error forecasting Longer Term Best basis per dimension ProPolyne on relation algebra operators
34
34 THANKS! (visit http://infolab.usc.edu) Acknowledgements Collaborators: Brian Wilson and Amy Braverman (JPL) Students: R. Schmidt, M. Jahangiri & D. Sacharidis Sponsors: NASA/JPL: ESIP & REASoN programs NSF: ERC, ITR & CAREER programs Industry: Microsoft and Chevronhttp://infolab.usc.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.