Mining the MACHO dataset Markus Hegland, Mathematical Sciences Institute, ANU Margaret Kahn, ANU Supercomputer Facility.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Chapter 4: Basic Estimation Techniques
Random Forest Predrag Radenković 3237/10
The Multiple Regression Model.
Chapter 7 Statistical Data Treatment and Evaluation
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ch11 Curve Fitting Dr. Deshi Ye
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Lecture 5 (Classification with Decision Trees)
Chapter 11 Multiple Regression.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Assumption of Homoscedasticity
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
456/556 Introduction to Operations Research Optimization with the Excel 2007 Solver.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Chapter 3 Data Exploration and Dimension Reduction 1.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Chapter 9 – Classification and Regression Trees
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Modern Navigation Thomas Herring
Benk Erika Kelemen Zsolt
Chapter 10 Correlation and Regression
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
LECTURE 9 Tuesday, 24 FEBRUARY STA291 Fall Administrative 4.2 Measures of Variation (Empirical Rule) 4.4 Measures of Linear Relationship Suggested.
Lecture 09 03/01/2012 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Classification Ensemble Methods 1
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Chapter 4 More on Two-Variable Data. Four Corners Play a game of four corners, selecting the corner each time by rolling a die Collect the data in a table.
Stats Methods at IC Lecture 3: Regression.
Chapter 4: Basic Estimation Techniques
BINARY LOGISTIC REGRESSION
Chapter 4 Basic Estimation Techniques
Chapter 7. Classification and Prediction
Bagging and Random Forests
Basic Estimation Techniques
R. E. Wyllys Copyright 2003 by R. E. Wyllys Last revised 2003 Jan 15
Boosting and Additive Trees
Multiple Regression.
Implementing AdaBoost
Classification and Prediction
Ensemble learning.
Ensemble learning Reminder - Bagging of Trees Random Forest
Statistical Models and Machine Learning Algorithms --Review
Presentation transcript:

Mining the MACHO dataset Markus Hegland, Mathematical Sciences Institute, ANU Margaret Kahn, ANU Supercomputer Facility

The MACHO project Woods data set Data exploration and data properties Data preprocessing Feature sets Classification using additive models Training process Web site

The MACHO Project To find evidence of dark matter from gravitational lensing effect Observations at Mt Stromlo ^7 observed stars CDD images

Woods Data Set 792 stars identified as long period variable Chosen from the full MACHO data set Original data processed by SODOPHOT to give red and blue light curves Missing data Large errors Unequal sampling

Stars from the Woods data set

Two typical long-period stars

Data Preprocessing Data sampling is not uniform so cannot use Fourier transforms. Periodic stars satisfy f(t+p) = f(t) for some period p, say. Long period variable starts are not exactly periodic e.g. f(t)=f(t+p)+g(t) where g is small compared with f. Use periodic smoothing to estimate missing data.

Periodic Smoothing An estimate for f can be determined by minimizing the function The function is f is modeled as a piecewise linear function. In practice p is not known but it can be estimated by a method such as Pisarenkos method. For now the second penalty function multiplier is much smaller than the first.

Feature Sets Features are calculated to characterize the light curves. Magnitudes are observed for both red and blue frequency range. The difference between these is the logarithm of the ratio of intensities of blue and red light. Called the colour index. Summary features of the light curves are obtained from the colour and magnitudes by forming the average (or median) over time, the amplitude of the fluctuations, the average frequency or time scale and a measure of the time scale fluctuations.

Features contd. Correlation between red and blue magnitudes. 9 features calculated and stored for each light curve. Use these features as predictor variables for the classifier, NOT original light curve data.

Classification using additive models ANOVA decomposition, Friedman (MARS 1991), Hastie- Tibshirani (GAM 1990), Wahba 1990 For example, such a function could approximate a classification function to decide which of two classes (0 or 1) a particular star belongs.

Additive Models In general an additive model is expressed as a sum of unknown smooth functions that have to be estimated from the data. The model is fitted by using a local scoring algorithm which iteratively fits weighted additive models by a back-fitting algorithm. This is a Gauss-Seidel method which iteratively smooths partial residuals by weighted linear least squares.

Possible basis functions for the approximation space in 1D. Indicator functions Hat functions Hierarchical hat fns ADDFIT uses 1D basis functions

Boosting Boosting is a machine learning procedure which improves the accuracy of any learning algorithm. The AdaBoost procedure used in this code calls a weak learning procedure several times and maintains a distribution of weights over the training set. Initially all weights are zero but then weights of incorrectly classified examples are increased so that the weak learner concentrates more on them.

Training Program Start with an initial training set of accepted stars, that is, stars of the type of interest. Helpful to also have a set of unacceptable stars to help the trainer. Additive models are used to form a classification function using the feature set data from the initial training set. This function is then applied to the full data set and the stars ordered based on the function values.

The light curves are displayed in decreasing order of function value. Ideally the training set stars should appear first. Further acceptable and unacceptable stars can be chosen by clicking on the relevant button and then a new classification carried out. Continue the process until satisfied with the star sorting.

Web based data mining tool Software link to Macho demo. This software contains Python code to read ASCII star data files, process them by removing any with insufficient good data then calculate several features from each star. These features are then used for the training program to select groups of like stars. The programs have incorporated a method of caching data so that it is kept in binary form for quicker access. The caching software was written by Ole Nielsen and can be downloaded from the ANU datamining web page.

Procedure to run: Determine initial training set. When prompted enter the star numbers for acceptable stars. Stars 1 and 2 are already entered as a default. When the web browser appears with the top ranked 60 stars, those that have already been deemed acceptable will have the accept button disabled and those that have been rejected will have the reject button disabled.

The user can then choose more acceptable or unacceptable stars by clicking on the relevant button. Previous decisions can be changed. After choosing a few stars click on the continue button to see the next 60 top ranked stars or go down to further pages to make more choices. Continue until satisfied with the initial ranked stars. Stop by clicking quit or restart.