Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS
Copyright © 2013, SAS Institute Inc. All rights reserved. New ways to manage distributed and not structured in classical way data are needed: We need different paradigm to organize data and, above all, to query them. Collect several sources and manage them open several new problems: Relational data (GRAPH DATA) can be useful to understand event spreading in a population. Data in motion coming from several tools on field (sensor devices) provide dynamic pattern often without an history of their form New ways to manage distributed and not structured in classical way data are needed: We need different paradigm to organize data and, above all, to query them. Collect several sources and manage them open several new problems: Relational data (GRAPH DATA) can be useful to understand event spreading in a population. Data in motion coming from several tools on field (sensor devices) provide dynamic pattern often without an history of their form Not always data are in structured data model Often we need to join data with not same keys Often data coming with periodic flow in real time Often we need to recognize pattern from data changing frequently NEW QUESTIONS WITH BIG DATA
Copyright © 2013, SAS Institute Inc. All rights reserved. SQL Queries often are useless to reach these data: Information are not organized into DB structures Data are very different way to provides information: i.e. text are not easy to query using traditional query languages. Merging are driven by fuzzy keys where you can assign group information according statistic relationship. Event can be happen driven from relational with other data rather from specific behavior. SQL Queries often are useless to reach these data: Information are not organized into DB structures Data are very different way to provides information: i.e. text are not easy to query using traditional query languages. Merging are driven by fuzzy keys where you can assign group information according statistic relationship. Event can be happen driven from relational with other data rather from specific behavior. ANALYSIS Not always you can apply sampling to extract data Not always you can join data to define ABT Often you need to know how environment can influence event changements. Often we need to merging information collected in different time window.
Copyright © 2013, SAS Institute Inc. All rights reserved. methods for pattern recognition coming from statistical inference analysis using SEMMA paradigm for supervised and unsupervised data patterns. Other coming from stochastic process analysis both for continue time and discrete events like diffusion process or markov chains process. Time series forecasting: stochastic processes in continue time with continue space Multivariate analysis applied on semantic rules to discover text patterns Graph analysis methods for pattern recognition coming from statistical inference analysis using SEMMA paradigm for supervised and unsupervised data patterns. Other coming from stochastic process analysis both for continue time and discrete events like diffusion process or markov chains process. Time series forecasting: stochastic processes in continue time with continue space Multivariate analysis applied on semantic rules to discover text patterns Graph analysis SAS PROCEDURES BIG DATA REQUIRES ALSO SEVERAL METHODOLOGICAL STRATEGIES:
Copyright © 2013, SAS Institute Inc. All rights reserved. Text Mining Parsing largescale text collections Parsing large-scale text collections Extract entities Extract entities Auto Stemming & synonym detection Auto. Stemming & synonym detection Data Mining Complex relationshipsComplex relationships Tree-based Classification Tree-based Classification Variable Selection Variable SelectionOptimization Local search optimization Local search optimization Large-scale linear & mixed integer problems Large-scale linear & mixed integer problems Graph theory Graph theoryEconometrics Probability of events Probability of events Severity of random events Severity of random events ANALYTICAL CATEGORIES AND TARGET USAGE Forecasting Large-scale multiple hierarchy problems Large-scale, multiple hierarchy problems Statistics Binarytarget &continuous no predictions Binary target & continuous no. predictions LinearNon Linear, & MixedLinear modeling Linear, Non- Linear, & Mixed Linear modeling
Copyright © 2013, SAS Institute Inc. All rights reserved. Data coming from different sources can be tie using different methods like linear or not linear canonical decomposition. Data pattern variability on data in motion like data coming from devices can be sampled or simulate pattern distribution. Sparse vector data with missing values can be simulate using particular regression methods Discrete choice among different events can be defined using multinomial discrete models. Automatic time series forecast considering many series at the same time Data coming from different sources can be tie using different methods like linear or not linear canonical decomposition. Data pattern variability on data in motion like data coming from devices can be sampled or simulate pattern distribution. Sparse vector data with missing values can be simulate using particular regression methods Discrete choice among different events can be defined using multinomial discrete models. Automatic time series forecast considering many series at the same time
Copyright © 2013, SAS Institute Inc. All rights reserved. Network Graph Analysis can be used to: Measuring nodes importance and relationships among them. Measuring changes over time into a net. Identify how events spreading into the net using particular diffusion process. Graph Analysis can be used to: Measuring nodes importance and relationships among them. Measuring changes over time into a net. Identify how events spreading into the net using particular diffusion process. GRAPH ANALYSIS Node Link
Copyright © 2013, SAS Institute Inc. All rights reserved. REAL TIME MONITORING SYSTEM: Building and managing the behavioral patterns of the measures for each type sensor to detect abnormal process by rules of alarm (offline process). Building scenario how events spreading and influence different part of system Monitoring measures to detect anomalies and the validity of the rules over time (online process). Produce models to predict abnormalities in the medium term. Scenario
Copyright © 2013, SAS Institute Inc. All rights reserved. INTEGRATED PROCESS CONTROL: Shewhart type control charts with identification of the role of the history of the measures and trend-cycle components according to the Box-Jenkins methodology Multivariate analysis of processes: This is the main tool for statistical process control measures in relation to each other considering Markov chain process or diffusion processes Classification system components: The machines can be classified according to their behavior and some information about the specific characteristics of the same Identifying patterns of alarm: Rules of diagnostic thresholds identified by the control charts to minimize false alarms, depending on the history of the event to be monitored in real time Scenario
Copyright © 2013, SAS Institute Inc. All rights reserved. Historical process data storage Measures Metadata and classification Event process thresholds managing for alert process Extraction rules DABT System interface ADMINISTRATION SYSTEM: EXAMPLE Pattern recognition and event handling Module
Copyright © 2013, SAS Institute Inc. All rights reserved. Real time modelling. Alert Rules and pattern thresholds Module in real time check Data streaming analysis and update historical data. REAL TIME MONITORING SYSTEM: EXAMPLE Real time Feedback