Brookhaven Science Associates U.S. Department of Energy A Multi-Stage Expert System for Aerosol Classification Statisticians: Raymond Mugno and Wei Zhu.

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

Hierarchical Clustering

1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Introduction to Bioinformatics

Cluster Analysis.

Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Mutual Information Mathematical Biology Seminar

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

What is Cluster Analysis?

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.

1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.

Clustering Unsupervised learning Generating “classes”

Evaluating Performance for Data Mining Techniques

Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.

COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong

Presented by Tienwei Tsai July, 2005

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.

Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.

EECS 274 Computer Vision Segmentation by Clustering II.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:

Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Machine Learning Queens College Lecture 7: Clustering.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.

Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.

Clustering Patrice Koehl Department of Biological Sciences National University of Singapore

Vocabulary of Mole Theory. ___ is the amount produced from a reaction in reality. actual yield.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.

Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Multivariate statistical methods Cluster analysis.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.

Multivariate statistical methods

Semi-Supervised Clustering

Clustering Patrice Koehl Department of Biological Sciences

Cluster Analysis II 10/03/2012.

Chapter 15 – Cluster Analysis

Gedas Adomavicius Jesse Bockstedt

Data Clustering Michael J. Watts

Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:

Clustering (3) Center-based algorithms Fuzzy k-means

Clustering and Multidimensional Scaling

Multivariate Statistical Methods

Data Mining – Chapter 4 Cluster Analysis Part 2

Cluster Analysis.

Text Categorization Berlin Chen 2003 Reference:

Presentation transcript:

Brookhaven Science Associates U.S. Department of Energy A Multi-Stage Expert System for Aerosol Classification Statisticians: Raymond Mugno and Wei Zhu Computer Scientists: Peter Imrich and Klaus Mueller Environmental Chemists: Dan Imre and Alla Zelenyuck

Brookhaven Science Associates U.S. Department of Energy Our Project n Working with Environmental Chemists at Brookhaven National Laboratory n Design an Expert System to classify aerosols n Existing tools are not accurate enough n Existing tools are not fast enough

Brookhaven Science Associates U.S. Department of Energy Our Project: An Expert System n Use statistical techniques to reduce data n Use visualization techniques to show reduction n Give experts a tool to view classification n Experts can change how data is reduced, by moving particles and making “classification rules”

Brookhaven Science Associates U.S. Department of Energy Our Project: The Chemistry n Aerosols are collected by a SPLAT mass spectrometer n 5200 voltage readings are collected for each particle n The 5200 voltage drops are converted to a 450 dimensional vector of intensities n These vectors are the mass spectra

Brookhaven Science Associates U.S. Department of Energy Mass Spectra n Each spectra can be represented as vector, v = (v1, v2, …, v450) n The subscript, j=1,2, … 450, represents the atomic weight of an element in the particle n The value of vj represents the amount of that element

Brookhaven Science Associates U.S. Department of Energy Dataset n Over 1 million mass spectra collected from Houston, Texas n Filtered to 238,160 actual particles n Each particle has a unique time stamp down to millisecond n Each particle has a time of flight, from which mass can be determined n Each particle has a mass spectra, an ordered array of 450 integer peaks

Brookhaven Science Associates U.S. Department of Energy The Two-Level Classification Scheme n Level 1 (dimension reduction): Classify large numbers of mass spectra into clusters with very similar particles, i.e. similar mass spectra, using K-means clustering. n Level 2 (class determination): Guided by chemical experts, combine clusters using binary classifiers. Determine the particle/cluster membership (acid, aromatics, or finer classes).

Brookhaven Science Associates U.S. Department of Energy Clustering Analysis n Statistical tool used to classify multidimensional entities n Hierarchical clustering: n Non-Hierarchical clustering

Brookhaven Science Associates U.S. Department of Energy K-Means Clustering Analysis n Start with a k seeds (representative entities of the k clusters) and a threshold distance n For each entity find the distance to each seed n If the minimum distance is less than the threshold, add the entity to that cluster n If the minimum distance is greater than the threshold, the entity becomes the seed of a new cluster

Brookhaven Science Associates U.S. Department of Energy K-Means Clustering Analysis n After iterating through all the entities, update the seeds n Iterate though the particles again n Continue iterating through the particles until none of the entities change clusters or other criterion is met n Differs from classification, because end number of classes and the classes themselves are not set in advance

Brookhaven Science Associates U.S. Department of Energy First Level Classification: K-Means Clustering n Start with 25 seeds, average spectra of particle class, from experts n For each particle, the distance between its spectra and each seed’s spectra is calculated n If the minimum distance is less than a threshold distance, the particle is put into that corresponding cluster n If the minimum distance is greater than the threshold distance, the current particle is set as a new seed

Brookhaven Science Associates U.S. Department of Energy Distance Function n 1 – r n r is the Pearson Correlation Coefficient n We label seed spectra as (x1,x2,…,xn) and the particle spectra as (y1,y2,…,yn) -- where Xi or Yi represents the magnitude of the ‘peak’ at location i is mass to charge ratio (i=13,14,…,250)

Brookhaven Science Associates U.S. Department of Energy Notation X39=20 X41=5 X53=2 Y26=1 Y28=25 Y68=2

Brookhaven Science Associates U.S. Department of Energy Why use Correlation Coefficient? n Spectra with “peaks” of the same proportion at the same locations will have small distance between them n Classify similar shaped spectra together for dimension reduction

Brookhaven Science Associates U.S. Department of Energy First Level Results Started with 25 seeds Threshold to create a new cluster set to 0.3 Processed 238,160 particles (5 iterations) Finished with 2000 clusters/seeds Seeds are updated to be average of spectra in cluster Dimension reduction of 120 fold

Brookhaven Science Associates U.S. Department of Energy Cluster Example 1, based on an Organic seed (23) 19 major peaks from 27 to 97

Brookhaven Science Associates U.S. Department of Energy Cluster Example 2, based on an Fe seed (13) 4 major peaks at 54, 55, 56 and 57

Brookhaven Science Associates U.S. Department of Energy New Cluster --- Example 1 (464) 6 Major Peaks at: 23, 24, 28, 30, 36, and 39

Brookhaven Science Associates U.S. Department of Energy New Cluster --- Example 2 (640) 6 Major Peaks at: 53, 54, 55, 56, 57, and 58

Brookhaven Science Associates U.S. Department of Energy New Cluster --- Example 3 (574) 19 Major Peaks from 23 to 131

Brookhaven Science Associates U.S. Department of Energy Measuring Within-Cluster Similarity Calculated the average distance for each particle to its clusters center Calculated the standard deviation of the distances to the cluster’s center for each cluster Found the particle furthest from the cluster’s center

Brookhaven Science Associates U.S. Department of Energy Comments on the First Level Classifier n Using a distance of threshold of 0.3 yielded clusters where the within cluster similarity level is very high n Ideal for dimension reduction – Now instead of working with 238,160 original spectra, we can work with the 2000 seeds for the clusters for a second level classification n A dimension reduction of 120 fold!

Brookhaven Science Associates U.S. Department of Energy Second Level Classification 2000 clusters is too many Find clusters that are very similar and combine them Find clusters that are very similar and classify them into a general group and have the chemical experts sub divide the general groups

Brookhaven Science Associates U.S. Department of Energy Second Level Classification Hierarchical Clustering Second Level Classification Hierarchical Clustering Find the pair wise distance between each entity Merge the two “closest” entities Repeat the procedure until there is only 1 entity, or until merging distance threshold is met

Brookhaven Science Associates U.S. Department of Energy Second Level Classification Hierarchical Clustering Second Level Classification Hierarchical Clustering Simple Linkage (closest entities) Average Linkage (average distance over all entities) Centroid Linkage (distance between average entities) Complete Linkage (distance between furthest elements)

Brookhaven Science Associates U.S. Department of Energy Second Level Classification Find clusters that are very similar and classify them into a general group and have the chemical experts sub divide the general groups Using Centroid Linkage Each cluster is represented by a seed, the average spectra of that cluster

Brookhaven Science Associates U.S. Department of Energy Second Level Classification Use Binary Matching metric to group clusters From a particle spectra vector v create a binary vector w. If vi > Peak_Threshold of total of vi’s peaks, wi = 1, else wi=0 Experts gave Peak_Threshold =10 to filter out noise

Brookhaven Science Associates U.S. Department of Energy Second Level Classification Metric for comparing 2 binary vectors, w and x Binary score = Number of peaks in common Max Peaks = Maximum number of peaks between x or w Distance = 1 – (Binary Score)/Max Peaks

Brookhaven Science Associates U.S. Department of Energy Second Level Classification Circular Dendrogram Seeds are located around the circumference More similar seeds are merged closer to the outer edge of the circle

Brookhaven Science Associates U.S. Department of Energy Second Level Classification

Brookhaven Science Associates U.S. Department of Energy Second Level Classification User can zoom on different area of the dendrogram From dendrogram, user can obtain seed spectra From dendrogram, user can obtain cluster information, such as number of particles

Brookhaven Science Associates U.S. Department of Energy Second Level Classification

Brookhaven Science Associates U.S. Department of Energy Conclusion Chemists now have a tool to view distribution of the data Can look at 2000 seeds instead of 238,000 particles Gives chemists insight to distribution of particle classes Chemists can give feedback that will improve metrics

Brookhaven Science Associates U.S. Department of Energy Future Work - Molecule Library

Brookhaven Science Associates U.S. Department of Energy Future Work - Time Series Analysis  Time series distribution of average (or median) aerosol size  Time series distribution of atmospheric composition  Time series models for oxidation rate comparison

Brookhaven Science Associates U.S. Department of Energy Future Work - Comparison of Classifiers 1. Hierarchical clustering (Hinz et al, 1995) 2. Non-hierarchical clustering (Trieger et al, 1995) 3. Fuzzy clustering (Hinz et al, 96/99; Trieger et al, 95) 4. Discriminant analysis (Alsberg et al,1998) 5. Neural network (Song et al, 1999) 6. Classification tree (Harrington et al, 1989) Which approach is the best?

Brookhaven Science Associates U.S. Department of Energy Future Work - Spatial Temporal Analysis  Analyzing the spatial & temporal evolution of air- borne particles  Random field theory  Interactive classification & analysis via the graphical user interface The Human-Machine Interface: examine chemical significance of clusters; adjust classifiers for better classification; explore the spatial/temporal trend & model goodness-of-fit graphically...

Brookhaven Science Associates U.S. Department of Energy In the News... Technology Takes on Terrorism By Earl Lane WASHINGTON BUREAU Newsday: February 26, 2002