Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong.

Slides:



Advertisements
Similar presentations
Image Retrieval With Relevant Feedback Hayati Cam & Ozge Cavus IMAGE RETRIEVAL WITH RELEVANCE FEEDBACK Hayati CAM Ozge CAVUS.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Random Forest Predrag Radenković 3237/10
Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Clustering Beyond K-means
TL by Kernel Meta Learning Fabio Aiolli University of Padova (Italy) F. Aiolli - Transfer Learning by Kernel Meta-Learning - ICML WS on Unsupervised and.
CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Adaboost and its application
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Unsupervised Learning
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Evaluating Performance for Data Mining Techniques
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis.
Data mining and machine learning A brief introduction.
DATA MINING CLUSTERING K-Means.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
Unsupervised Object Segmentation with a Hybrid Graph Model (HGM) Reporter: 鄭綱 (6/14)
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Text Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Unsupervised Learning. Supervised learning vs. unsupervised learning.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Intro. ANN & Fuzzy Systems Lecture 20 Clustering (1)
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning Lecture 9: Clustering
CIKM Competition 2014 Second Place Solution
Ying shen Sse, tongji university Sep. 2016
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.
Presentation transcript:

Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong – Rutgers, the State University of New Jersey

Our Story  “Let’s set up a team to compete another data mining challenge” – a call with Rutgers  Is it a competition on data preprocessing?  Transfer the problem into a clustering problem :  How many clusters we are shooting for?  What distance measurement works better?  Go with the stochastic K-means clustering.

Dataset Recap  Five real world data sets were extracted from different domains  No labels were provided during unsupervised learning challenge  The withheld labels are multi-class labels.  Some records can belong to different labels at the same time  Performance was measured by a global score, which is defined as Area Under Learning Curve  A simple linear classifier (Hebbian learner) was used to calculate the learning curve  Focus on small number of training samples by log2 scaling on x- axis of the learning curve

Evolution of Our Approaches  Simple Data Preprocessing  Normalization: Z-scale (std=1, mean=0)  TF-IDF on text recognition (TERRY dataset)  PCA:  PCA on raw data  PCA on normalized data  Normalized PCA vs. non-normalized PCA  K-means Clustering  Cluster on top N normalized PCs  Cosine similarity vs. Euclidian distance

Stochastic Clustering Process  Given Data set X, number of cluster K, and iteration N  For n=1, 2, …, N  Randomly choose K seeds from X  Perform K-means clustering, assign each record a cluster membership I n  Transform In into binary representation  Combine the N binary representation together as the final result  Example of binary representation of clusters  Say cluster label = 1,2,3  Binary representation will be (1 0 0) (0 1 0) and (0 0 1) Our final approach

Results of Our Approaches Dataset Harry – human action recognition

Results Dataset Rita – object recognition

Results Dataset Sylvester-- ecology

Results Dataset Terry – text recognition

Results Dataset Avicenna – Arabic manuscripts

Summary on Results Overall rank 2 nd. Pie Chart Title DatasetWinner Valid Winner Final Winner Rank Our Valid Our Final Our Rank Avecinna Harry Rita Sylvester Terry

Discussions  Stochastic clustering can generate better results than PCA in general  Cosine similarity distance is better than Euclidian distance  Normalized data is better than non-normalized data for k-means in general  Number of clusters (K) is an important factor, but can be relaxed for this particular competition.