Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.

Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab

Main Approach in Concept Extraction Problems Clustering Methods and LSI Ideas and Our Works Experimental Results

Main approach in Concept Extraction (we will say it CE) is using LSI. LSI is a collection of one Matrix Algorithm and some Probabilistic Analyses on it for using on Term- Document Matrix. At first we should create Term-Document matrix (using measures like TFiDF for indicating the importance of a term in a particular document), then give it to SVD (Singular Value Decomposition) algorithm and finally choose the first K columns as concepts.

Singular Value Decomposition is an algorithm for Matrix (we assume that matrix M is m×n) Decomposition to 3 matrices like U, S and V, such that S is an orthogonal matrix of singular values, U is eigenvectors of the Matrix MM T (Term correlation matrix) and V is eigenvectors of the Matrix M T M (Document Correlation Matrix). S is sorted descending. Therefore the first k elements of it or the first k columns of U or the first k rows of V are the most important values.

Steps of SVD can be explained as below: 1- Select first column of matrix M 1, we name it u 1 2- Calculate the length of u 1 and add it to first element. 3- Then set B 1 =|u 1 | 2 /2 4- Then set U 1 =I-B 1 -1 u 1 u 1 T 5- Then set M 2 =U 1 M 1 6- Do it for first row and then repeat for other rows and columns In general for i th column or row, in step 2 we should first set all elements before i th element equal to zero, then calculate the length and add the result to i th element.

We can list the main problems of LSI as below This method is based on the sum of square of distances (Σ (s i -t i ) 2 ), so it is useful for data that has Gaussian (Normal) Distribution. But Term-Document Matrix has Poisson Distribution. This method is very slow (its computation complexity is n 3 m and n<<m) Poisson distribution is a Memory-less Distribution. In other words next occurrence of probabilistic variable X doesn’t depend on previous occurrences.

There is a wide variety of methods in clustering. But we can group them as below: Discrete Methods Linear approaches PCA K-Means K-Medians K-Centers LSH Non-Linear approaches KPCA Embeddings Artificial Intelligence Based Approaches

PCA is abbreviation for Principle Component Analysis and is a collection of methods that use eigenvector and eigenvalue properties for clustering. So, SVD is one of the main approaches in PCA collection. Recently, proved that K-Means and other members of its family can be listed in PCA family. PCA family are linear approaches and can not cluster data that their independence is nonlinear. PCA family is suitable for Gaussian Distribution.

One sample for nonlinear independence.

But K-Means has computational complexity equal to O(nm), and it is better than SVD (O(n 3 m)). LSH is a member of linear methods and has good computational complexity.

KPCA (Kernel PCA) is a collection of methods in nonlinear clustering. There are two groups in KPCA Kernel functions: in this family we should invent a function that can convert nonlinear independence to linear one. For example of using Gaussian function see below.

Kernel Tricks: in this family we should convert original space to a higher order space with specific properties (some methods convert data to a Hilbert space that is a subset for Banach spaces) such that nonlinear independence will be converted to linear one and then we can use PCA methods. In this approach we should use Embedding methods. Artificial Intelligence based clustering are very slow for our purpose.

Our works will be on both finding an appropriate Kernel Function and an appropriate Embedding. But we focus on Kernel Functions in this phase. Our idea is a little different with main approach, we change distance function instead of points to reach the linearity. There is a technique called “Copula” in statistics and probabilistics. Copula is a framework for finding a bi- variate distribution function for two probabilistic variable.

Main idea is as below: two variable are independent if the conjunctive probability of them is equal to product of their probabilities. So first we find an appropriate Copula function and then calculate the surrounding volume between copula surface and the surface of the product of probabilities of variables. This can be used as a measure for indicating the independence. Now we have a good Kernel Function. There are a wide variety of copula function for general purposes and have been used in different researches and they did reach to good results.

This is a sample copula function obtained for a sample data, using Bernstein Polynomials Copula function.

Main advantages of our idea are as follows: All of preprocessing computational complexity is about O(nm 2 ). So if we using K-Means (O(nm)) then we obtain an algorithm with computational complexity equal to O(nm 2 ) for detecting clusters with nonlinear independency (SVD has O(n 3 m) for linear independency and n>>m). Copula functions don’t care about data distribution. Surprisingly, we can use them for two variables with different distributions. On the other hand SVD is suitable for Gaussian data distribution.

For testing our ideas we did the following: First we obtained popular Datasets. They are all from University College Dublin (UCD), School of Computer Science and Informatics, Machine Learning Group. Next we study the structure of SVD and K-Means (obtaining K-Means using core-sets) algorithms We use MATLAB to implement algorithms We test SVD and K-Means on datasets. For example one concept group that we obtain for BBC dataset is as following terms: juventu, cocain, romanian, alessandro, luciano, adrian, chelsea, ruin, bayern, drug, fifa, club,... or another concept group about printers and so on.

Now we should implement Copula in MATLAB and compare results with common SVD and K-Means.

Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.

Similar presentations

Presentation on theme: "Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.

Similar presentations

Presentation on theme: "Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab."— Presentation transcript:

Similar presentations

About project

Feedback