Presentation is loading. Please wait.

Presentation is loading. Please wait.

ESR in a Consulting Environment

Similar presentations


Presentation on theme: "ESR in a Consulting Environment"— Presentation transcript:

1 ESR in a Consulting Environment
Maurizio Sanarico Chief Data Scientist SDG Group

2 The secondments First secondment: Second secondment:
Understand the basics Implement Mapper in Python (Cartographer) Second secondment: Study the selection of the overlapping parameter Assess the merits of fitting local models in correspondence with nodes of Mapper Working also on developing a calendar for multiple time series analysis Exposed to our methodology and programming practice

3 Why Topological Data Analysis for Secondments
It is powerful, with strong mathematical background, very general and complementary with respect to machine learning mainstream It is interpretable Can be used for Exploratory / hypothesis generating analysis Space partitioning for apply more focused local models and limit the curse of dimensionality Generate new variables with somewhat untouched content (e.g. Metrization of persistent landscapes with Lp-norms) Chracterize the performance of a predictive model

4 Topological Data Analysis (TDA)
An emerging discipline combining algebraic topology, geometry and statistics Aimed to extend algebraic topology constructs to data clouds Proposed in the two secondments in SDG because it is one of the applied research lines we are following and to introduce some new ideas in the mainstream of machine learning and statistical methods

5 TDA: the main streams Persistent homology: extends homology to data
Multimensional persistent homology Local persistent homology Zig-zag persistent homology Mapper: build a topological network Multiscale mapper Multinerve mapper Both solve stability problems Morse-Smale regression and complex analysis: explore and characterize extrema in gradient fields

6 TDA: Persistent Homology
Characterize the topological content of data Find topological invariants Main tools: persistence diagram (PD), barcodes, persistent landscapes and persistent images Vectorization of PD generates features that can be used in statistical models to add information that standard variables don’t contain

7 TDA: Mapper Explore the topological and geometrical (local) structure of data Identify outliers, groups with different shapes (also shapes difficult to discover with other metrods, like flares, loops, local branches,...) Prepare data for further analysis Characterize the performance of a model (Fibers of Failure method) giving the opportunity to understand where to focus improvement actions (e.g., oversampling regions of the data space or define a special layer in a deep neural network as a correction layer). Can also incorporate supervised learning methods

8 TDA: Focus on Mapper Relatively easy to program
Conceptually not so simple From modeling point of view requires various not obvious choices Very flexible Stability may be an issue (theoretical solution through multiscale version of Mapper) Some somewhat loosely related methods (Mapper can embed all of them): Projection pursuit Isomap Locally linear embedding MDS

9 TDA: Focus on Mapper Main operation: from high-dimensional data to simplicial complex (a combinatorial representation of data by composing a set of simplices) representing topological and geometrical information at a specified resolution Less sensitive to metric Achieve substantial simplification Preserve topological structure

10 TDA: Focus on Mapper Theoretical framework:
Build compressed data representations that are Independent of the coordinate frames used Deformation invariant Combinatorial representation of continuous multidimensional data / objects

11 Mapper in practice Practical use requieres selecting:
Filter function(s) (also called len(s)) Overlapping parameter Resolution Clustering algorithm

12 TDA: A birdview (from M. Piekenbrock, 2018)

13 Mapper: the choices Define a reference map / filter function: f: X  Z
Construct a covering {U}A of Z (A is called the index set) Construct the subsets X via f-1(U) Applying a clustering algorithm C to the sets X Obtain a cover f*(U) of X by considering the path-connected components of X Clusters from nodes / 0-simplices Non-empty intersections from edges / 1-simplices The Mapper is the nerve of this cover: M(U,f) = N(f*(U))

14 Some further concepts Mapper provides a compressed description of the shape of a data set(expressed via the codomain of the mapping) Mappers is quite general: • Any mapping function can be used (or a composition of different functions) • Cover may be constructed arbitrarily • Any clustering algorithm may be used • The resulting graph is often much easier to interpret than, e.g. individual scatter plots of pairwise relationships • Mapper is often paired with high-dimensional data, and is generally used to see the ‘true’ shape or structure of the data • Mapper is the core algorithm behind the AI Company, Ayasdi Inc. • Anti-Money Laundering • Detecting Payment Fraud • Assessing health risks

15 Some real examples: Ticket Classification
Two cases: 1) tickets regarding IT applications, 2) ticket from communication equipments Using Isolation Forest and K-NN as filter functions Color scale  proportion of ticket with high priority state in the node, from deep blue=none to dark red=all May be used to classify new tickets according to priority May also be used to assign an expected time-to-closing for new ticket given their characteristics

16

17 Some real examples: Prediction error analysis
Inspired by the «Fibers of Failure» approach or Carlsson et al. (2018) Data: backbone tickets in communication equipments Filters: A k-NN (nearest neighbor distance, with k=5) An unsupervised Isolation Forest The Individual probability of ticket The difference between the individual probability and the average probability The ratio between the probability and its standard deviation. Color proportional to probability of failure and tooltip = actual state (failure / not failure). Deep blue = higher failure probability, dark red = lower probability of failure Predictive model used was Xgboost

18

19 Other cases at SDG Characterization of medical services in a foreign country to discover frauds, anomalies and define current and best practices. Use new features, derived using persistent homology, obtained from metrization using persistent images and persistence landscapes, in predictive models applied in pharmaceutical manufacturing processes A version of Mapper with some enhancements and working in a distributed computing environment is being developed (still in early stages).

20 Next slides Analysis of clinical data from Colombian hospitals
Analysis from ambulatorial data from Colombia MNIST data classified with a convolutional neural network (fibers of failure analysis)

21

22

23


Download ppt "ESR in a Consulting Environment"

Similar presentations


Ads by Google