Presentation is loading. Please wait.

Presentation is loading. Please wait.

First Principle Data Analysis Database Use and Design – Spring 2016 © Philippe Bonnet 2014.

Similar presentations


Presentation on theme: "First Principle Data Analysis Database Use and Design – Spring 2016 © Philippe Bonnet 2014."— Presentation transcript:

1 First Principle Data Analysis Database Use and Design – Spring 2016 © Philippe Bonnet 2014

2 You are an Analyst You are tasked with some form of data analysis Simplistic: – What is the purchasing behavior of customers of different ages? You have access to the history of purchase transactions for a large collection of customers More realistic: – As a new mamager of sports club, what kind of insight can I get about member? You have access to all internal data in the club and to open data related to competitions (local and international) © Philippe Bonnet 2014

3 You are an Analyst The first step is to explore the data at your disposal: – What data do you have? Inspecting the contents of the data set – Establishing what is there – Finding irregularities (for cleaning) Characterizing data sets – What is missing? © Philippe Bonnet 2014

4 Data Exploration © Philippe Bonnet 2014 “If we need a short suggestion of what exploratory data analysis is, I would suggest that: - It is an attitude AND - A flexibility AND - Some graph paper (or transparencies, or both). No catalogue of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper - and transparencies - are there, not as a technique, but rather as recognition that the picture-examining eye is the best finder we have of the wholly unanticipated.” John Tukey, 1961

5 Visual Analytics Information density depends on display size: – < 1M : Mobile devices – 10M : Desktop – 100+ M: Large displays Principles: – Dynamic interfaces: Overview, zoom & filter, details-on-demand © Philippe Bonnet 2014

6 Visual Analysitcs 1.Check out Ben Schneiderman Ben Schneiderman 2.Check out xkcd! – My favourite is Money – more examples.Moneyexamples © Philippe Bonnet 2014

7 InnoDB Example © Philippe Bonnet 2014

8 InnoDB Example © Philippe Bonnet 2014

9 Multi-Variable Spotfire Spotfire / TableauTableau © Philippe Bonnet 2014

10 Multi Variable World Bank Climate Variability ToolClimate Variability Tool © Philippe Bonnet 2014

11 Spatio-Temporal Map-DMap-D (+ gnip)gnip © Philippe Bonnet 2014

12 Spatio-Temporal Energy Consumption at Berkeley © Philippe Bonnet 2014

13 Tree TreeMap © Philippe Bonnet 2014

14 Network NodeXL © Philippe Bonnet 2014

15 Text TagClouds, Word Tree, Phrase Net: Many Eyes: Many Eyes https://xkcd.com/657/ © Philippe Bonnet 2014

16 How to characterize a data set? Cardinality Dimensions (att names, types) Sample(s) Summaries: – For each dimension: number of distinct values, number of missing values, distribution, statistics, e.g.: Central tendencies: mean / median / mode – Depending on the distribution symetric / skewed / bimodal Quantifying heteorgenity: Range, inter-quantile range © Philippe Bonnet 2014

17 Example - SNAPSNAP © Philippe Bonnet 2014

18 You are an Analyst You went through the data exploration phase You should now mine insight from the data (input): – output: Descriptive (patterns) / Predictive (model) /Prescriptive (actions) – method: Supervised / Un-supervised You should define/answer questions: – Regression: defining relationship among objects – Clustering: finding groups of similar objects – Classification: to which group does an object belong © Philippe Bonnet 2014

19 Algorithms Input: data set Output: – set of clusters (extension, intension) – decision tree – mathematical model – set of rules Types of algorithms – Regression, clustering, classification – Association rules – Sequence analysis © Philippe Bonnet 2014 In the context of SQL Server 2014 Analysis servicesof SQL Server 2014

20 Reference Algorithm #1 – k means Clustering algorithm: – Initialization:pick k (arbitrary) points in the data set as centroids for k clusters. – loop until convergence step 1: assignment of each data point to the cluster whose centroid is closest (based on euclidian distance) step 2: for each cluster, compute the centroid based on the assignment from step 1 k Means converges to a (local) optimum. No guarantee to reach a global optimum. © Philippe Bonnet 2014

21 Take-away Points Data exploration = visual analytics and data set characterization – Visual Analytics: Overview, zoom & filter, details-on-demand Be creative – Characterizing a data set Dimensions, cardinality Central tendencies Heterogeneity Mining insights: – Descriptive / Predictive / prescriptive – Three main approaches: Regression / Clustering / Classification © Philippe Bonnet 2014


Download ppt "First Principle Data Analysis Database Use and Design – Spring 2016 © Philippe Bonnet 2014."

Similar presentations


Ads by Google