Algebraic Techniques for Analysis of Large Discrete-Valued Datasets Mehmet Koyuturk1, Ananth Grama1, and Naren Ramakrishnan2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech naren@cs.vt.edu
Motivation Handling large discrete-valued datasets Extracting relations between data items Summarizing data in an error-bounded fashion Clustering of data items Finding coinsize representations for clustered data
Background Singular Value Decomposition (SVD) [Berry et.al., 1995] Decompose matrix into A=USVT U and V orthogonal matrices, S diagonal with singular values Used for Latent Semantic Indexing in Information Retrieval Truncate decomposition to compress data
Background Semi-Discrete Decomposition (SDD) [Kolda and O’Leary, 1998] Restrict entries of U and V to {-1,0,1} Requires very small amount of storage Can perform as well as SVD in LSI using less than one-tenth the storage Effective in finding outlier clusters works well for datasets containing a large number of small clusters
Rank-1 Approximation x : presence vector y : pattern vector
Rank-1 Approximation Problem: Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize = number of non-zeros in the error matrix Heuristic: Fix y, set solve for x to Maximize Iteratively solve for x and y until no improvement possible
Initialization of pattern vector Crucial to escape from local optima Must require at most (nz(A)) time, not to Some possible schemes AllOnes: Set all entries to 1, poor. Threshold: Set only the entries that have corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima. Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns. Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.
Recursive Algorithm - At any step, given rank-one approximation AxyT, split A to A1 and A0 based on rows : - if x(i)=0 row i goes to A0 - if x(i)=1 row i goes to A1 - Stop when hamming radius of A1, maximum of the hamming distances of A1pattern vector, is less then some threshold all rows of A are present in A1 (if A1does not satisfy hamming radius condition, can split A1 based on hamming distances)
Recursive Algorithm
Effectiveness of Analysis
Effectiveness of Analysis
Run-time Scalability Rank-1 approximation requires O(nz(A)) time Total run-time at each level in the recursive tree cannot exceed this since total # of nonzeros at each level is at most nz(A) Run-time is linear in nz(A) runtime vs # columns runtime vs # rows runtime vs # nonzeros
Conclusions and Ongoing Work Proposed algorithm is Scalable to exteremely high-dimensions Effective in discovering dominant patterns Hierarchical in nature, allowing multi-resolution analysis Currently working on Real-world applications of proposed method Effective initialization schemes Parallel implementation