Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.

Slides:



Advertisements
Similar presentations
FMRI Methods Lecture 10 – Using natural stimuli. Reductionism Reducing complex things into simpler components Explaining the whole as a sum of its parts.
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
3D Geometry for Computer Graphics
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Generalised Inverses Modal Analysis and Modal Testing S. Ziaei Rad.
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
Dimensionality Reduction PCA -- SVD
Linear Algebra.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
A Clustered Particle Swarm Algorithm for Retrieving all the Local Minima of a function C. Voglis & I. E. Lagaris Computer Science Department University.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Lecture 20 SVD and Its Applications Shang-Hua Teng.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Techniques for studying correlation and covariance structure
Separate multivariate observations
Linear Algebra and Image Processing
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
EE 290A: Generalized Principal Component Analysis Lecture 2 (by Allen Y. Yang): Extensions of PCA Sastry & Yang © Spring, 2011EE 290A, University of California,
Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Computing Eigen Information for Small Matrices The eigen equation can be rearranged as follows: Ax = x  Ax = I n x  Ax - I n x = 0  (A - I n )x = 0.
SINGULAR VALUE DECOMPOSITION (SVD)
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Market analysis for the S&P500 Giulio Genovese Tuesday, December
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Chapter 61 Chapter 7 Review of Matrix Methods Including: Eigen Vectors, Eigen Values, Principle Components, Singular Value Decomposition.
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Unsupervised Learning II Feature Extraction
CSE 4705 Artificial Intelligence
Principal Component Analysis (PCA)
Linear Algebra Review.
School of Computer Science & Engineering
Lecture 8:Eigenfaces and Shared Features
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Singular Value Decomposition
Techniques for studying correlation and covariance structure
Principal Component Analysis
Presented by Nagesh Adluru
Word Embedding Word2Vec.
Outline Singular Value Decomposition Example of PCA: Eigenfaces.
Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 – 14, Tuesday 8th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR)
Feature Selection Methods
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab

Main Approach in Concept Extraction Problems Clustering Methods and LSI Ideas and Our Works Experimental Results

Main approach in Concept Extraction (we will say it CE) is using LSI. LSI is a collection of one Matrix Algorithm and some Probabilistic Analyses on it for using on Term- Document Matrix. At first we should create Term-Document matrix (using measures like TFiDF for indicating the importance of a term in a particular document), then give it to SVD (Singular Value Decomposition) algorithm and finally choose the first K columns as concepts.

Singular Value Decomposition is an algorithm for Matrix (we assume that matrix M is m×n) Decomposition to 3 matrices like U, S and V, such that S is an orthogonal matrix of singular values, U is eigenvectors of the Matrix MM T (Term correlation matrix) and V is eigenvectors of the Matrix M T M (Document Correlation Matrix). S is sorted descending. Therefore the first k elements of it or the first k columns of U or the first k rows of V are the most important values.

Steps of SVD can be explained as below: 1- Select first column of matrix M 1, we name it u 1 2- Calculate the length of u 1 and add it to first element. 3- Then set B 1 =|u 1 | 2 /2 4- Then set U 1 =I-B 1 -1 u 1 u 1 T 5- Then set M 2 =U 1 M 1 6- Do it for first row and then repeat for other rows and columns In general for i th column or row, in step 2 we should first set all elements before i th element equal to zero, then calculate the length and add the result to i th element.

Main Approach in Concept Extraction Problems Clustering Methods and LSI Ideas and Our Works Experimental Results

We can list the main problems of LSI as below This method is based on the sum of square of distances (Σ (s i -t i ) 2 ), so it is useful for data that has Gaussian (Normal) Distribution. But Term-Document Matrix has Poisson Distribution. This method is very slow (its computation complexity is n 3 m and n<<m) Poisson distribution is a Memory-less Distribution. In other words next occurrence of probabilistic variable X doesn’t depend on previous occurrences.

Main Approach in Concept Extraction Problems Clustering Methods and LSI Ideas and Our Works Experimental Results

There is a wide variety of methods in clustering. But we can group them as below: Discrete Methods Linear approaches PCA K-Means K-Medians K-Centers LSH Non-Linear approaches KPCA Embeddings Artificial Intelligence Based Approaches

PCA is abbreviation for Principle Component Analysis and is a collection of methods that use eigenvector and eigenvalue properties for clustering. So, SVD is one of the main approaches in PCA collection. Recently, proved that K-Means and other members of its family can be listed in PCA family. PCA family are linear approaches and can not cluster data that their independence is nonlinear. PCA family is suitable for Gaussian Distribution.

One sample for nonlinear independence.

But K-Means has computational complexity equal to O(nm), and it is better than SVD (O(n 3 m)). LSH is a member of linear methods and has good computational complexity.

KPCA (Kernel PCA) is a collection of methods in nonlinear clustering. There are two groups in KPCA Kernel functions: in this family we should invent a function that can convert nonlinear independence to linear one. For example of using Gaussian function see below.

Kernel Tricks: in this family we should convert original space to a higher order space with specific properties (some methods convert data to a Hilbert space that is a subset for Banach spaces) such that nonlinear independence will be converted to linear one and then we can use PCA methods. In this approach we should use Embedding methods. Artificial Intelligence based clustering are very slow for our purpose.

Main Approach in Concept Extraction Problems Clustering Methods and LSI Ideas and Our Works Experimental Results

Our works will be on both finding an appropriate Kernel Function and an appropriate Embedding. But we focus on Kernel Functions in this phase. Our idea is a little different with main approach, we change distance function instead of points to reach the linearity. There is a technique called “Copula” in statistics and probabilistics. Copula is a framework for finding a bi- variate distribution function for two probabilistic variable.

Main idea is as below: two variable are independent if the conjunctive probability of them is equal to product of their probabilities. So first we find an appropriate Copula function and then calculate the surrounding volume between copula surface and the surface of the product of probabilities of variables. This can be used as a measure for indicating the independence. Now we have a good Kernel Function. There are a wide variety of copula function for general purposes and have been used in different researches and they did reach to good results.

This is a sample copula function obtained for a sample data, using Bernstein Polynomials Copula function.

Main advantages of our idea are as follows: All of preprocessing computational complexity is about O(nm 2 ). So if we using K-Means (O(nm)) then we obtain an algorithm with computational complexity equal to O(nm 2 ) for detecting clusters with nonlinear independency (SVD has O(n 3 m) for linear independency and n>>m). Copula functions don’t care about data distribution. Surprisingly, we can use them for two variables with different distributions. On the other hand SVD is suitable for Gaussian data distribution.

Main Approach in Concept Extraction Problems Clustering Methods and LSI Ideas and Our Works Experimental Results

For testing our ideas we did the following: First we obtained popular Datasets. They are all from University College Dublin (UCD), School of Computer Science and Informatics, Machine Learning Group. Next we study the structure of SVD and K-Means (obtaining K-Means using core-sets) algorithms We use MATLAB to implement algorithms We test SVD and K-Means on datasets. For example one concept group that we obtain for BBC dataset is as following terms: juventu, cocain, romanian, alessandro, luciano, adrian, chelsea, ruin, bayern, drug, fifa, club,... or another concept group about printers and so on.

Now we should implement Copula in MATLAB and compare results with common SVD and K-Means.