Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Slides:



Advertisements
Similar presentations
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Advertisements

Covariance Matrix Applications
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
Solving Linear Systems (Numerical Recipes, Chap 2)
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Principal Component Analysis
3D Geometry for Computer Graphics
Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.
Lecture 19 Quadratic Shapes and Symmetric Positive Definite Matrices Shang-Hua Teng.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Singular Value Decomposition
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Kathryn Linehan Advisor: Dr. Dianne O’Leary
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Non Negative Matrix Factorization
SVD: Singular Value Decomposition
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
A Note on Rectangular Quotients By Achiya Dax Hydrological Service Jerusalem, Israel
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Instructor: Mircea Nicolescu Lecture 8 CS 485 / 685 Computer Vision.
2.5 – Determinants and Multiplicative Inverses of Matrices.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets  Mehmet Koyutürk and Ananth Grama, Dept. of Computer Sciences, Purdue University {koyuturk,
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
CS246 Linear Algebra Review. A Brief Review of Linear Algebra Vector and a list of numbers Addition Scalar multiplication Dot product Dot product as a.
Introduction to Vectors and Matrices
Singular Value Decomposition and its applications
PREDICT 422: Practical Machine Learning
Document Clustering Based on Non-negative Matrix Factorization
School of Computer Science & Engineering
Matrix Sketching over Sliding Windows
CS 685: Special Topics in Data Mining Jinze Liu
LSI, SVD and Data Management
Parallelism in High-Performance Computing Applications
Singular Value Decomposition
Discovering Functional Communities in Social Media
CS 685: Special Topics in Data Mining Jinze Liu
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Parallelization of Sparse Coding & Dictionary Learning
Lecture 13: Singular Value Decomposition (SVD)
Introduction to Vectors and Matrices
Restructuring Sparse High Dimensional Data for Effective Retrieval
CS 685: Special Topics in Data Mining Jinze Liu
Linear Algebra Lecture 16.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets Mehmet Koyuturk1, Ananth Grama1, and Naren Ramakrishnan2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech naren@cs.vt.edu

Motivation Handling large discrete-valued datasets Extracting relations between data items Summarizing data in an error-bounded fashion Clustering of data items Finding coinsize representations for clustered data

Background Singular Value Decomposition (SVD) [Berry et.al., 1995] Decompose matrix into A=USVT U and V orthogonal matrices, S diagonal with singular values Used for Latent Semantic Indexing in Information Retrieval Truncate decomposition to compress data

Background Semi-Discrete Decomposition (SDD) [Kolda and O’Leary, 1998] Restrict entries of U and V to {-1,0,1} Requires very small amount of storage Can perform as well as SVD in LSI using less than one-tenth the storage Effective in finding outlier clusters works well for datasets containing a large number of small clusters

Rank-1 Approximation x : presence vector y : pattern vector

Rank-1 Approximation Problem: Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize = number of non-zeros in the error matrix Heuristic: Fix y, set solve for x to Maximize Iteratively solve for x and y until no improvement possible

Initialization of pattern vector Crucial to escape from local optima Must require at most (nz(A)) time, not to Some possible schemes AllOnes: Set all entries to 1, poor. Threshold: Set only the entries that have corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima. Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns. Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.

Recursive Algorithm - At any step, given rank-one approximation AxyT, split A to A1 and A0 based on rows : - if x(i)=0 row i goes to A0 - if x(i)=1 row i goes to A1 - Stop when hamming radius of A1, maximum of the hamming distances of A1pattern vector, is less then some threshold all rows of A are present in A1 (if A1does not satisfy hamming radius condition, can split A1 based on hamming distances)

Recursive Algorithm

Effectiveness of Analysis

Effectiveness of Analysis

Run-time Scalability Rank-1 approximation requires O(nz(A)) time Total run-time at each level in the recursive tree cannot exceed this since total # of nonzeros at each level is at most nz(A)  Run-time is linear in nz(A) runtime vs # columns runtime vs # rows runtime vs # nonzeros

Conclusions and Ongoing Work Proposed algorithm is Scalable to exteremely high-dimensions Effective in discovering dominant patterns Hierarchical in nature, allowing multi-resolution analysis Currently working on Real-world applications of proposed method Effective initialization schemes Parallel implementation