ESR in a Consulting Environment

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

PARTITIONAL CLUSTERING
Salvatore giorgi Ece 8110 machine learning 5/12/2014
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Diffusion Geometries, and multiscale Harmonic Analysis on graphs and complex data sets. Multiscale diffusion geometries, “Ontologies and knowledge building”
NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.
Radial Basis Function Networks
Data Mining Techniques
CSE 185 Introduction to Computer Vision Pattern Recognition.
Anomaly detection with Bayesian networks Website: John Sandiford.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
CSE 5559 Computational Topology: Theory, algorithms, and applications to data analysis Lecture 0: Introduction Instructor: Yusu Wang.
Welcome to MATH:7450 (22M:305) Topics in Topology: Scientific and Engineering Applications of Algebraic Topology Week 1: Introduction to Topological Data.
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Howard Community College
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining Practical Machine Learning Tools and Techniques
Big data classification using neural network
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning for Computer Security
Data Transformation: Normalization
Support Feature Machine for DNA microarray data
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
k-Nearest neighbors and decision tree
Deep Feedforward Networks
Deep learning David Kauchak CS158 – Fall 2016.
We propose a method which can be used to reduce high dimensional data sets into simplicial complexes with far fewer points which can capture topological.
International Workshop
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
MATH:7450 (22M:305) Topics in Topology: Scientific and Engineering Applications of Algebraic Topology Nov 22, 2013: Topological methods for exploring low-density.
Intelligent Information System Lab
Machine Learning Basics
Main Project total points: 500
Overview of Supervised Learning
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Vincent Granville, Ph.D. Co-Founder, DSC
Guest lecturer: Isabel K. Darcy
Introducing Mathematics
Neuro-Computing Lecture 4 Radial Basis Function Network
Pattern Recognition and Image Analysis
By Charlie Fractal Mentor: Dr. Vignesh Subbian
CSE572, CBS572: Data Mining by H. Liu
Algorithm design (computational geometry)
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Graph Neural Networks Amog Kamsetty January 30, 2019.
Machine learning overview
NonLinear Dimensionality Reduction or Unfolding Manifolds
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Word representations David Kauchak CS158 – Fall 2016.
Learning and Memorization
Image recognition.
Lecture 10: The Future May 17, 2019
Support Vector Machines 2
Goodfellow: Chapter 14 Autoencoders
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

ESR in a Consulting Environment Maurizio Sanarico Chief Data Scientist SDG Group

The secondments First secondment: Second secondment: Understand the basics Implement Mapper in Python (Cartographer) Second secondment: Study the selection of the overlapping parameter Assess the merits of fitting local models in correspondence with nodes of Mapper Working also on developing a calendar for multiple time series analysis Exposed to our methodology and programming practice

Why Topological Data Analysis for Secondments It is powerful, with strong mathematical background, very general and complementary with respect to machine learning mainstream It is interpretable Can be used for Exploratory / hypothesis generating analysis Space partitioning for apply more focused local models and limit the curse of dimensionality Generate new variables with somewhat untouched content (e.g. Metrization of persistent landscapes with Lp-norms) Chracterize the performance of a predictive model

Topological Data Analysis (TDA) An emerging discipline combining algebraic topology, geometry and statistics Aimed to extend algebraic topology constructs to data clouds Proposed in the two secondments in SDG because it is one of the applied research lines we are following and to introduce some new ideas in the mainstream of machine learning and statistical methods

TDA: the main streams Persistent homology: extends homology to data Multimensional persistent homology Local persistent homology Zig-zag persistent homology Mapper: build a topological network Multiscale mapper Multinerve mapper Both solve stability problems Morse-Smale regression and complex analysis: explore and characterize extrema in gradient fields

TDA: Persistent Homology Characterize the topological content of data Find topological invariants Main tools: persistence diagram (PD), barcodes, persistent landscapes and persistent images Vectorization of PD generates features that can be used in statistical models to add information that standard variables don’t contain

TDA: Mapper Explore the topological and geometrical (local) structure of data Identify outliers, groups with different shapes (also shapes difficult to discover with other metrods, like flares, loops, local branches,...) Prepare data for further analysis Characterize the performance of a model (Fibers of Failure method) giving the opportunity to understand where to focus improvement actions (e.g., oversampling regions of the data space or define a special layer in a deep neural network as a correction layer). Can also incorporate supervised learning methods

TDA: Focus on Mapper Relatively easy to program Conceptually not so simple From modeling point of view requires various not obvious choices Very flexible Stability may be an issue (theoretical solution through multiscale version of Mapper) Some somewhat loosely related methods (Mapper can embed all of them): Projection pursuit Isomap Locally linear embedding MDS

TDA: Focus on Mapper Main operation: from high-dimensional data to simplicial complex (a combinatorial representation of data by composing a set of simplices) representing topological and geometrical information at a specified resolution Less sensitive to metric Achieve substantial simplification Preserve topological structure

TDA: Focus on Mapper Theoretical framework: Build compressed data representations that are Independent of the coordinate frames used Deformation invariant Combinatorial representation of continuous multidimensional data / objects

Mapper in practice Practical use requieres selecting: Filter function(s) (also called len(s)) Overlapping parameter Resolution Clustering algorithm

TDA: A birdview (from M. Piekenbrock, 2018)

Mapper: the choices Define a reference map / filter function: f: X  Z Construct a covering {U}A of Z (A is called the index set) Construct the subsets X via f-1(U) Applying a clustering algorithm C to the sets X Obtain a cover f*(U) of X by considering the path-connected components of X Clusters from nodes / 0-simplices Non-empty intersections from edges / 1-simplices The Mapper is the nerve of this cover: M(U,f) = N(f*(U))

Some further concepts Mapper provides a compressed description of the shape of a data set(expressed via the codomain of the mapping) Mappers is quite general: • Any mapping function can be used (or a composition of different functions) • Cover may be constructed arbitrarily • Any clustering algorithm may be used • The resulting graph is often much easier to interpret than, e.g. individual scatter plots of pairwise relationships • Mapper is often paired with high-dimensional data, and is generally used to see the ‘true’ shape or structure of the data • Mapper is the core algorithm behind the AI Company, Ayasdi Inc. • Anti-Money Laundering • Detecting Payment Fraud • Assessing health risks

Some real examples: Ticket Classification Two cases: 1) tickets regarding IT applications, 2) ticket from communication equipments Using Isolation Forest and K-NN as filter functions Color scale  proportion of ticket with high priority state in the node, from deep blue=none to dark red=all May be used to classify new tickets according to priority May also be used to assign an expected time-to-closing for new ticket given their characteristics

Some real examples: Prediction error analysis Inspired by the «Fibers of Failure» approach or Carlsson et al. (2018) Data: backbone tickets in communication equipments Filters: A k-NN (nearest neighbor distance, with k=5) An unsupervised Isolation Forest The Individual probability of ticket The difference between the individual probability and the average probability The ratio between the probability and its standard deviation. Color proportional to probability of failure and tooltip = actual state (failure / not failure). Deep blue = higher failure probability, dark red = lower probability of failure Predictive model used was Xgboost

Other cases at SDG Characterization of medical services in a foreign country to discover frauds, anomalies and define current and best practices. Use new features, derived using persistent homology, obtained from metrization using persistent images and persistence landscapes, in predictive models applied in pharmaceutical manufacturing processes A version of Mapper with some enhancements and working in a distributed computing environment is being developed (still in early stages).

Next slides Analysis of clinical data from Colombian hospitals Analysis from ambulatorial data from Colombia MNIST data classified with a convolutional neural network (fibers of failure analysis)