Zhang Yanxia China-VO Group 2006.11.30 in Guilin Chinese Virtual Observatory.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Florida International University COP 4770 Introduction of Weka.
An Introduction of Support Vector Machine
An Overview of Machine Learning
Huge Raw Data Cleaning Data Condensation Dimensionality Reduction Data Wrapping/ Description Machine Learning Classification Clustering Rule Generation.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
Classification and Decision Boundaries
Discriminative and generative methods for bags of features
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Introduction. 1.Data Mining and Knowledge Discovery 2.Data Mining Methods 3.Supervised Learning 4.Unsupervised Learning 5.Other Learning Paradigms 6.Introduction.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Data Mining – Intro.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Business Intelligence
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Techniques
Business Intelligence, Data Mining and Data Analytics/Predictive Analytics By: Asela Thomason IS 495 Summer 2015.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
CSE 185 Introduction to Computer Vision Pattern Recognition.
This week: overview on pattern recognition (related to machine learning)
2 Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion Objectives, Prerequisite and.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.
Computational Intelligence: Methods and Applications Lecture 37 Summary & review Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Basic Data Mining Technique
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
An Automated Classification Algorithm for Multi-wavelength Data Yanxia Zhang, Ali Luo,Yongheng Zhao National Astronomical Observatories, China ,
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
PREDICTING SONG HOTNESS
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Machine Learning with Spark MLlib
Data Mining – Intro.
Data Transformation: Normalization
DATA MINING © Prentice Hall.
Source: Procedia Computer Science(2015)70:
COMP61011 : Machine Learning Ensemble Models
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
A task of induction to find patterns
CAMCOS Report Day December 9th, 2015 San Jose State University
A task of induction to find patterns
Presentation transcript:

Zhang Yanxia China-VO Group in Guilin Chinese Virtual Observatory

11/29-12/03China-VO 2006, Guilin2 Outline Why What How Example challenge summary

11/29-12/03China-VO 2006, Guilin3 Astronomy facing “data avalanche” IRAS 25  2MASS 2  DSS Optical IRAS 100  WENSS 92cm NVSS 20cm GB 6cm ROSAT ~keV Necessity Is the Mother of Invention DM&KDD

11/29-12/03China-VO 2006, Guilin4 Issues in Astronomy Compression (e.g. Galaxy images and spectra) Classification (e.g. Stars, galaxies, or Gamma Ray Bursts) Reconstruction (e.g. of blurred galaxy images, mass distribution from weak gravitational lensing) Feature extraction (e.g. signatures feature of stars, galaxies and quasars) Parameter estimation (e.g. Star parameter measurement, Photometric redshift prediction, orbital parameters of extra- solar planets, or cosmological parameters ) Model selection (e.g. are there 0,1,2,……planets around stars, or is there a cosmological model with none-zero neutrino mass more favorable) Ofer Lahav, 2006, astro-ph/ Summary on the 4th meeting on “Statistical Challenge in Modern Astronomy” held at Penn State University in June 2006

11/29-12/03China-VO 2006, Guilin5 Science Requirements for DM ( Borne K D, 2001, Proc. Of the MPA/ESO/MPE Workshop,671 ) v Cross-Identification - refers to the classical problem of associating the source list in one database to the source list in another. v Cross-Correlation - refers to the search for correlations, tendencies, and trends between physical parameters in multi-dimensional data, usually across databases. v Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a database. v Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to a database in the hope of making a serendipitous discovery of new objects or a new class of objects.

11/29-12/03China-VO 2006, Guilin6 KDD: Opportunity and Challenges KDD: Opportunity and Challenges Data Rich Knowledge Poor (the resource) Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.) Competitive Pressure Data Mining Technology Mature KDD

11/29-12/03China-VO 2006, Guilin7 KDD: A Definition bytes: never see the whole data set or put it in the memory of computers What knowledge? How to represent and use it? Data mining algorithms? KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

11/29-12/03China-VO 2006, Guilin8 Volume Value EDP MIS DSS Benefits of Knowledge Discovery Generate Rapid Response Disseminate EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems

DM: A KDD Process –Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

11/29-12/03China-VO 2006, Guilin10 Work at each process of DM DM object Data preparation Data processing Analysis and Evalution

11/29-12/03China-VO 2006, Guilin11 Primary Tasks of Data Mining Primary Tasks of Data Mining Classification Deviation and change detection Summarization Clustering Dependency Modeling Regression finding the description of several predefined classes and classify a data item into one of them. maps a data item to a real-valued prediction variable. identifying a finite set of categories or clusters to describe the data. finding a compact description for a subset of data finding a model which describes significant dependencies between variables. discovering the most significant changes in the data

11/29-12/03China-VO 2006, Guilin12 Feature selection Filter method Wrapper method Embedded method Feature weighted method

11/29-12/03China-VO 2006, Guilin13 Feature extraction PCA Factor analysis (Principal FA/Maximum Likelihood FA) Projection pursuit ICA Non-linear PCA/ICA Random projection Principal curves MDS LLE ISOMAP Topological continuous map Neural network Vector quantization Kernel PCA/ICA LDA (linear discriminant analysis ) QDA (quadratic discriminant analysis) FDA (Fisher discriminant analysis) GDA (Generalized discriminant analysis) KDDA (kernel direct discriminant analysis)

11/29-12/03China-VO 2006, Guilin14 Classification Methods Based on statistical theory: SVMs, ML, LDA,FDA,QDA,KNN Based on NN: LVQ, RBF, PNN, KSOM,BBN,SLP,MLP Based on Decision Tree: REPTree, RandomTree, CART,C5.0, J48, DecisionStump, RandomForest, NBtree,AC 2,Cal5, ADTree,KDTree Based on Decision Rule: Decision Table,CN2,ITrule, AQ Based on bayesian theory: Naive Bayes classifier, NBTree Based on meta learning: adaboost, boosting, bagging Based on evolution theory: genetic algorithm Based on fuzzy theory: fuzzy set, rough set Ensembles of classifiers Data Mining algorithm patterns

11/29-12/03China-VO 2006, Guilin15 Regression Methods (penalized) logistic regression Bayesian regression analysis Additive regression Locally weighted regression Voted perceptron network Projection pursuit regression Recursive partitioning regression Alternating condition expectation Stepwise regression Recursive least square Fourier transform regression Ruled-based regression Principal component regression Instance-based regression Multivariate adaptive regression splines Regression trees (CART, RETIS, M5,random forest, KDtree) Simple windowed regression SVM NN

11/29-12/03China-VO 2006, Guilin16 Method to estimate errors Train-test Cross-validation Bootstrap Leave-one-out

11/29-12/03China-VO 2006, Guilin17 Evaluation of methods Accuracy Speed Comprehensibility Time to learn Generalization

11/29-12/03China-VO 2006, Guilin18 Model Selection for Classifiction Accuracy G-mean F-measure ROC (Receive Operating Characteristic Curve)

11/29-12/03China-VO 2006, Guilin19 Model Selection for Regression AIC ( Akaike information criterion ) BIC (Bayesian information criterion) SRM (Structure Risk Minimization)

11/29-12/03China-VO 2006, Guilin20 Example 1 Lim Jien-sien et al. Machine Learning, 40, (2000) 33 algorithms on 16 different samples 22 decision trees CART, S-Plus tree, C4.5,FACT,QUEST,IND,OC1,LMDT,CAL5,T1 9 statistical methods LDA,QDA,NN,LOG,FDA,PDA,MDA,POL 2 neural networks LVQ,RBF

11/29-12/03China-VO 2006, Guilin21 Example 1 Lim Jien-sien et al. Machine Learning, 40, (2000)

11/29-12/03China-VO 2006, Guilin22 Example 2

11/29-12/03China-VO 2006, Guilin23 Example 3 Zhao,Y, Zhang,Y., 2006, submitted to cospar

11/29-12/03China-VO 2006, Guilin24 For NB, ADTree MLP, the corresponding whole accuracy amounts to 97.5%, 98.5% and 98.1%, respectively. Zhang,Y,Zhao,Y, 2006, submitted to CHJAA Example 3

11/29-12/03China-VO 2006, Guilin25 By best-forward search, j-h, b-v,j+ 2.5lgFpeak are optimal features selected from the 10 features. Decision Table is applied. 10-fold cross-validation for training and test % Zhang,Y, Luo, A, Zhao,Y, 2006, submitted to Cospar Example 4

11/29-12/03China-VO 2006, Guilin26 Li,Y.,Zhang,Y.,Zhao,Y.,2006,submitted to Chinese Science k-Nearest neighbor classifier Example 5

11/29-12/03China-VO 2006, Guilin27 Zhang,Y., Zhao, Y., 2006,ADASS XV,351,173 Example 6

11/29-12/03China-VO 2006, Guilin28 Challenges and Influential Aspects Handling of different types of data with different degree of supervision Changing data and knowledge Understandability of patterns, various kinds of requests and results (decision lists, inference networks, concept hierarchies, etc.) Interactive, Visualization Knowledge Discovery Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc. ) Massive data sets, high dimensionality (efficiency, scalability)

11/29-12/03China-VO 2006, Guilin29 Summary Linear or non-linear Gassian or non-gassian Continous or discrete Missing or not Comparision of the number of attributes with that of records Choose the appropriate method or ensemble algorithms according to the task and data characteristics

11/29-12/03China-VO 2006, Guilin30 Prospect With the wing of DM, find more, better or best knowledge! Thank you for your attention!

Thank you !!!