Download presentation
Presentation is loading. Please wait.
Published byMarian Carr Modified over 5 years ago
1
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
Carlos Ordonez, Yiqun Zhang University of Houston, USA 1
2
Parallel architecture
Shared-nothing, message-passing N nodes Data partitioned before computation Examples: Parallel DBMSs, HDFS, MapReduce, Spark
3
3
4
Old: separate sufficient statistics
4
5
New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]
5
6
Linear Algebra: Our main result for parallel and scalable computation
7
2-phase algorithm
8
Equivalent equations with projections from Gamma (descriptive, predictive)
8
9
Fundamental properties: non-commutative but distributive
9
10
Parallel Theoretical Guarantees of
10
11
Dense matrix algorithm: O(d2 n)
11
12
Sparse matrix algorithm: O(d n) for hyper-sparse matrix
12
13
Pros: Algorithm evaluation with physical array operators
Since xi fits in one chunk joins are avoided (at least 2X I/O with hash or merge join) Since xi*xiT can be computed in RAM we avoid an aggregation (avoid sorting points by I) No need to store X twice: X, XT: half I/O, half RAM space No need transpose X, costly reorg even in RAM, especially if X spans several RAM segments C++ compiled code: fast; vector accessed once; direct assignment (bypass C++ calls) 13
14
Running on the cloud
15
Running in the cloud, 100 nodes
16
Conclusions One pass summarization matrix operator: parallel, scalable. Algorithm compatible with any parallel shared-nothing system, but better for array systems Optimization of outer matrix multiplication as sum (aggregation) of vector outer products Algorithm:Dense and sparse matrix versions required Gamma matrix must fit in RAM, but n unlimited ML methods to two phases: 1: Summarization, 2: Computing model parameters. Summarization matrix can be exploited in many intermediate computations. 16
17
Future work: Theory Study Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs, Kalman filters Clustering Model: for frequent itemset Higher-order expected moments, co-variates Numeric stability with unnormalized sorted data: unlikely 17
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.