Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Mehmet Koyuturk1, Ananth Grama1, and Naren Ramakrishnan2 Dept. of Computer Sciences, Purdue University {koyuturk, 2. Dept. of Computer Sciences, Virginia Tech

Motivation Handling large discrete-valued datasets
Extracting relations between data items Summarizing data in an error-bounded fashion Clustering of data items Finding coinsize representations for clustered data

Background Singular Value Decomposition (SVD) [Berry et.al., 1995]
Decompose matrix into A=USVT U and V orthogonal matrices, S diagonal with singular values Used for Latent Semantic Indexing in Information Retrieval Truncate decomposition to compress data

Background Semi-Discrete Decomposition (SDD) [Kolda and O’Leary, 1998]
Restrict entries of U and V to {-1,0,1} Requires very small amount of storage Can perform as well as SVD in LSI using less than one-tenth the storage Effective in finding outlier clusters works well for datasets containing a large number of small clusters

Rank-1 Approximation x : presence vector y : pattern vector

Rank-1 Approximation Problem: Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize = number of non-zeros in the error matrix Heuristic: Fix y, set solve for x to Maximize Iteratively solve for x and y until no improvement possible

Initialization of pattern vector
Crucial to escape from local optima Must require at most (nz(A)) time, not to Some possible schemes AllOnes: Set all entries to 1, poor. Threshold: Set only the entries that have corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima. Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns. Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.

Recursive Algorithm - At any step, given rank-one approximation AxyT, split A to A1 and A0 based on rows : - if x(i)=0 row i goes to A0 - if x(i)=1 row i goes to A1 - Stop when hamming radius of A1, maximum of the hamming distances of A1pattern vector, is less then some threshold all rows of A are present in A1 (if A1does not satisfy hamming radius condition, can split A1 based on hamming distances)

Recursive Algorithm

Effectiveness of Analysis

Run-time Scalability Rank-1 approximation requires O(nz(A)) time
Total run-time at each level in the recursive tree cannot exceed this since total # of nonzeros at each level is at most nz(A)  Run-time is linear in nz(A) runtime vs # columns runtime vs # rows runtime vs # nonzeros

Conclusions and Ongoing Work
Proposed algorithm is Scalable to exteremely high-dimensions Effective in discovering dominant patterns Hierarchical in nature, allowing multi-resolution analysis Currently working on Real-world applications of proposed method Effective initialization schemes Parallel implementation

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Similar presentations

Presentation on theme: "Algebraic Techniques for Analysis of Large Discrete-Valued Datasets"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Similar presentations

Presentation on theme: "Algebraic Techniques for Analysis of Large Discrete-Valued Datasets"— Presentation transcript:

Similar presentations

About project

Feedback