1 Database Systems Group Research Overview 2010. 2 OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
K Means Clustering , Nearest Cluster and Gaussian Mixture
Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.
Data Mining Techniques: Clustering
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
1 ISI’02 Multidimensional Databases Challenge: representation for efficient storage, indexing & querying Examples (time-series, images) New multidimensional.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Unsupervised Learning
What is Cluster Analysis?
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Recommender systems Ram Akella November 26 th 2008.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Data Mining – Intro.
FLANN Fast Library for Approximate Nearest Neighbors
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining Techniques
University of Toronto 8/30/20151 Data Mining The Art and Science of Obtaining Knowledge from Data Dr. Saed Sayad.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
A new clustering tool of Data Mining RAPID MINER.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Mining – Intro.
Semi-Supervised Clustering
Parallel Database Systems
I don’t need a title slide for a lecture
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
Wellington Cabrera Advisor: Carlos Ordonez
EM Algorithm and its Applications
Carlos Ordonez, Javier Garcia-Garcia,
Presentation transcript:

1 Database Systems Group Research Overview 2010

2 OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex: Increase in age causes increase in risk for heart disease Combined OLAP with Means Comparison Parametric Test – Used to pair similar groups and determine if they are significantly different – Want to reject hypothesis that the two groups have the same mean Developed GUI that allows for easy user interface Zhibo Chen Advisor: Dr. Carlos Ordonez

3 OLAP Statistical Tests Association Rules – technique used to detect patterns within items of dataset – HighAge, High Cholestrol => Heart Disease Compare results from both techniques OLAP Statistical Test discovered more rules than Association Rules – p-value is more reliable than confidence (considers pdf) – OLAP affected less by distribution than AR AR better when performance is priority and data is skewed OLAP Statistical Test better when data is distributed Zhibo Chen Advisor: Dr. Carlos Ordonez

4 OLAP Statistical Test versus Association Rules Blue and red lines represent location of the averages of the two groups – Averages are fairly different from one another Confidence says that the two groups are similar – Many blue points above 50 – Many red points above 50 – confidence is low Zhibo Chen Advisor: Dr. Carlos Ordonez

5 OLAP Exploration with UDF On-Line Analytical Process (OLAP) – Set of techniques allowing users to explore various aggregations of a dataset – Ex: dataset with day, month, year, sales What were average sales for Sundays? Solve by grouping on day and then extracting Sunday Normally done outside the database or with OLAP servers – We want to study how to perform the same techniques inside the DBMS (SQL or UDF) – Found that users can efficiently perform OLAP exploration using UDFs Zhibo Chen Advisor: Dr. Carlos Ordonez

6 Digital Libraries in a DBMS have been traditionally exploited outside relational databaseInformation retrieval techniques have been traditionally exploited outside relational database systems due to storage overhead, complexity to suit them in a relational model, and slower performance in SQL implementations. Searching and querying can be performed SQLSearching and querying documents under information retrieval models in relational database systems can be performed with optimized SQL. We explore three phases: Document preprocessing. Document storage. Document retrieval (VSM, OPM, DPLM). Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez

7 Keyword Search Across Document and Databases meaningSometimes the meaning and structure of a database is unknown. describeThere are external semi-structured sources that can help to describe it. linkWe found that we can link these two worlds to identify relationships between the structured data with the semi- structured data. rightWe believe that is the right approach approach to do it inside the database. We implemented a prototype SQL entirely in SQL. Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez

8 Bayesian Statistics Latest trend in advanced statistics; very demanding: CPU and large data sets microarray data high dimensionalityApplied to microarray data in the DBMS. The problem involves high dimensionality data of few samples. Variable selection Computational expensiveVariable selection is the first issue that we have been trying to solve. Computational expensive looking for the best model (2^d), where d is de number of dimensions. Applying SQL optimizationsApplying SQL optimizations and data layout modifications, we obtain less than 3 seconds selections of > 1 M dimensions, but still not enough. : Gibbs Sampler Variable SelectionCurrent work: Gibbs Sampler Variable Selection. Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez

9 PCA Black-box Black-box Rotation of the input space Rotation of the input space Make the representative components evident Make the representative components evident No Covariance between attributes No Covariance between attributes Variance represented by the eigenvalues Variance represented by the eigenvalues Deal with high dimensionality Deal with high dimensionality Mario Navas Advisor: Dr. Carlos Ordonez

10 DB Implementation Summary matrices n L Q Summary matrices n L Q Correlation matrix Correlation matrix Eigenvalue decomposition problem Eigenvalue decomposition problem

11 Outliers detection in microarray data Deal with high dimensionality Deal with high dimensionality Redundancy minimized Redundancy minimized Find distance based outliers in a reduced space Find distance based outliers in a reduced space PCA -based Outliers [2D] Distance-based Outliers [7D] PCA -based Outliers [2D] Distance-based Outliers [126] Matching top 10

12 Bayesian Classification Based On Decomposition via Clustering An Extension Of Na ï ve Bayes. Class Decomposition of the Gaussians Using Clustering Using K-Means and E-M Scalability - Query Optimizations for Computationally and Memory Intensive Computations Incremental Learning of the Classifier Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez

13 Computing Distance & Sufficient Statistics Using SQL & UDFs Five different SQL optimizations and one User Defined Function (UDF) to compute Euclidean distance in K-Means Sufficient Statistics – Count, Linear Sum and Quadartic Sum for multiple clusters and multiple classes computed in a single data set scan Using SQL (or) UDF. Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez

14 Fast Bayesian Classifier Based on FREM The Algorithm – Initialization : Randomly initialize k clusters per class from the data set. – E-step : Compute Mahalanobis distance, find nearest cluster and then compute sufficient statistics. – M-step : Recompute the mean and variances and weight of the clusters per class. Mixture parameters updated in this step. – SplitClusters : Splitting Heavy Clusters to reach higher quality solutions and reseeding low weight clusters. – The E-step and M-step are iterated until the model converges.

15 Constrained Association Rules in SQL Association rules are a data mining technique used to discover frequent patterns in a data set. Real world application of this technique is broad and can include fields such as medical and commerce. We can automatically generate efficient SQL queries for discovering association rules Kai Zhao Advisor: Dr. Carlos Ordonez

16 Comparison between CAR and DT CAR perform an exhaustive combinatorial research whereas DT recursively partition the input attribute space. CAR aim to find all rules above the given thresholds whereas DT find regions in space where most records belong to the same class. CAR analyze item combinations whereas DT select only one input attribute at one time. Kai Zhao Advisor: Dr. Carlos Ordonez

17 Frequent Subgraph Mining Frequent subgraph – A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold FREQUENT PATTERNS (MIN SUPPORT IS 2) (A) (B)(C) Kai Zhao Advisor: Dr. Carlos Ordonez