Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik.

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Configuration management
Indexing DNA Sequences Using q-Grams
Characteristic Functions. Want: YearCodeQ1AmtQ2AmtQ3AmtQ4Amt 2001e (from fin_data table in Sybase Sample Database) Have: Yearquartercodeamount.
1 Using Blind Search and Formal Concepts for Binary Factor Analysis Aleš Keprt
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Clustering Categorical Data The Case of Quran Verses
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
Data Preparation for Data Mining Prepared by: Yuenho Leung.
Introduction to Data Mining with XLMiner
Section 2.3 Gauss-Jordan Method for General Systems of Equations
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Memory-Based Recommender Systems : A Comparative Study Aaron John Mani Srinivasan Ramani CSCI 572 PROJECT RECOMPARATOR.
28 Feb 2006Digi - Paul Dauncey1 In principle change from simulation output to “raw” information equivalent to that seen in real data Not “reconstruction”,
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Computer Science 1620 Programming & Problem Solving.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Introduction To C++ Programming 1.0 Basic C++ Program Structure 2.0 Program Control 3.0 Array And Structures 4.0 Function 5.0 Pointer 6.0 Secure Programming.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
Determining Authorship March 21, 2013 CS Intro. to Comp. for the Humanities and Social Sciences 1.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
Database testing Prepared by Saurabh sinha. Database testing mainly focus on: Data integrity test Data integrity test Stored procedures test Stored procedures.
Density Curves Normal Distribution Area under the curve.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
Identifying Reversible Functions From an ROBDD Adam MacDonald.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
2-Level Minimization Classic Problem in Switching Theory
Using the Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects Richard Williams
General Programming Introduction to Computing Science and Programming I.
Winrunner Usage - Best Practices S.A.Christopher.
CTFS Workshop Shameema Esufali Suzanne Lao Data coordinators and technical resources for the network
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Persuasive Essay: writing to convince others of your opinion.
Multiplying Whole Numbers © Math As A Second Language All Rights Reserved next #5 Taking the Fear out of Math 9 × 9 81 Single Digit Multiplication.
Digital Logic Computer Organization 1 © McQuain Logic Design Goal:to become literate in most common concepts and terminology of digital.
2-Level Minimization Classic Problem in Switching Theory Tabulation Method Transformed to “Set Covering Problem” “Set Covering Problem” is Intractable.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
M1G Introduction to Database Development 2. Creating a Database.
Views Lesson 7.
Christopher Moh 2005 Competition Programming Analyzing and Solving problems.
INFO1408 Database Design Concepts Week 16: Introduction to Database Management Systems Continued.
Example 12.3 Operations Models | 12.2 | 12.4 | 12.5 | 12.6 | 12.7 |12.8 | 12.9 | | | | | | | |
Simulation Using computers to simulate real- world observations.
Grade Book Database Presentation Jeanne Winstead CINS 137.
Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,
A337 - Reed Smith1 Structure What is a database? –Table of information Rows are referred to as records Columns are referred to as fields Record identifier.
Copyright © Curt Hill The IF Revisited If part 4 Style and Testing.
CTFS Workshop Shameema Esufali Asian data coordinator and technical resource for the network
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Adding and Subtracting Decimals © Math As A Second Language All Rights Reserved next #8 Taking the Fear out of Math 8.25 – 3.5.
SW388R6 Data Analysis and Computers I Slide 1 Comparing Central Tendency and Variability across Groups Impact of Missing Data on Group Comparisons Sample.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Classification and Regression Trees
Selected Results of the National Outcome Study Data collected Dec 2006 – April 2008 Data collected Dec 2006 – April 2008 Presented by the K-Wraps Evaluation.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Introduction to Computing Science and Programming I
Repetition Structures
Data Virtualization Demoette… Data Lineage Reporting
Big-Data Fundamentals
CTFS Asia Region Workshop 2014
Session #, Speaker Name Indexing Chapter 8 11/19/2018.
Physical Database Design
Teaching London Computing
Fundamentals of Data Structures
Coding Concepts (Basics)
Presentation transcript:

Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik

Project Description Goals: 1.Develop an application that will find relevant overlapping subspace clusters in categorical data from an input SQL query using ROCAT. 2.Apply the application to the public dataset: “Social Justice Sexuality Project: 2010 National Survey, including Puerto Rico (ICPSR 34363).” 3.Potentially apply application to other public datasets 4.Find optimizations to the algorithm to reduce runtime/improve results. 5.If I have time, I would also like to run other subspace clustering algorithms like DHCC to the data set and compare results.

The Dataset “Social Justice Sexuality Project: 2010 National Survey, including Puerto Rico (ICPSR 34363).” 5 factors: racial and sexual identity, spirituality and religion, mental and physical health, family formations and dynamics, and civic and community engagement. Contains about 5000 data rows (results of one person taking the survey) and over 100 attributes (the questions). Ideal for ROCAT – almost entirely categorical data, and the amount data is in a range that should run smoothly but not trivially easy with ROCAT.

Implementation Primarily Java (+ MySQL) The ROCAT algorithm translates very well into an object oriented environment. It’s a language I am comfortable with. I also used Excel and a Python script to import the csv file to a MySQL database table.

Data preprocessing (I have done so far) The data needed to be preprocessed for the following reasons: Some questions were blanked for confidentiality – no point in keeping these columns Ex. Exact age was blanked, and instead, age was broken into three subgroups. Some questions were only answered by 600/5000 people because there were two versions of the survey Obviously, the columns for the questions people did not answer should not be considered. These columns should only be considered for the 600 people who did answer them. Some questions are irrelevant Whether or not someone took the paper or electronic version of the survey is probably not important.

More on Data preprocessing (have not implemented yet) For whatever reason, some of the attributes redundant and should be condensed if possible. For example, there is a question involving the race of the subject where the response options are “Only white,” “Only black,” “Only Asian,” etc., but there is another boolean column in the data that is “Subject answered ‘only white’ in earlier question.” Once I have the application completely functional, I may preprocess the data further to try to find trends relating to specific questions. Ex. To find specific trends between race and religion, I may run the application on only the attributes that relate to race and religion.

Recall: MDL principle to measure relevance Goal: Find the compression model that will result in the minimum number of bits needed to represent the data. The model will tell us the relevant subspace clusters. So, there algorithm frequently checks if a subspace cluster is relevant by checking if adding it to the model reduces coding cost. I am having trouble properly calculating coding cost, and consequently, my application cannot tell which subspace clusters are the most relevant.

Recall: ROCAT Algorithm Input: Data set D Output: List of subspace clusters in D 3 phases: Searching (bulk of algorithm and runtime - have implemented) Combining (have not implemented) Reassigning (have not implemented) As said earlier, my implementation currently cannot decide on its own which subspace clusters are most relevant. Fortunately, I can, so I can share some results.

Potentially interesting current results 466 answered 25 questions the same, from which I could conclude: Not white, nor person of color, nor Asian/Pacific Islander, nor Native American, nor Hispanic They identified as LGBTQ and Cisgender Not foreign born, nor parents, not third generation or more These people said that their medical professions did not ignore nor seemed uncomfortable with their sexual identity. Because of redundant data mentioned earlier, this specific subspace cluster was basically found twice, but with different questions: Saying “Not x” for all races x, and saying “only ‘other race’” means the same thing, but there are attributes for both, and the algorithm can’t realize they are the same.

Uninteresting Results 4571 people said they were neither Asian/Pacific Islander, nor Native American 2 people answered 47 questions the same Because I am having trouble properly calculating coding cost, the application currently cannot realize these results are less interesting than the previous ones.

Potential optimizations so far (1) Recall: Search phase – Find best pure subspace cluster In a situation like this, it makes sense to calculate the coding cost of each candidate cluster, as each candidate is very different. However, I have found that calculating the coding cost is a slow process – It is currently my bottleneck, but I could be implementing it inefficiently. Note, attributes are added in order of least entropy to greatest

Potential optimizations so far (1 con) Consider this case in the Find Best Pure algorithm: C1C1 C2C2 CnCn CkCk Obviously, of C 1, C 2, …, C k, C k is the best subspace cluster. For any C i, C i+1 that have same number of rows, we should be able to skip the calculation of the coding cost of C i, because CC(C i ) >= CC(C i+1 ). C 1, C 2, …, C k all have the same number of rows. C n has fewer rows. Return C =

Potential optimizations so far (2) Consider this case in the Find Best Pure algorithm: C1C1 C2C2 CnCn Return C n A subsection of C 1, C 2, etc. is likely to be found in future iterations – perhaps we can remember these so we do not have to find them again? Overall, I do not think you can assume that subsections of C 1, C 2, etc. will be found to be relevant in the future because returning C n changes the value distributions of the attributes. However, it may be a good estimation if the application were time sensitive.

What to do next Properly calculate coding cost to be better discriminate relevance between subspace clusters. Implement the rest of the algorithm and look for optimizations. Preprocess data to reduce redundancy in data. Preprocess data to find specific trends. Try other dataset(s). Try running a different algorithm. I do not think I will have time to implement another algorithm myself, but I may be able to find someone else’s application and compare results.

Thank you! Questions?