June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.

June 2013 BIG DATA SCIENCE: A PATH FORWARD

CONFIDENTIAL | 2 linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com  Data Science Lead @ Think Big  Product/Brand Obsessive  Teacher  Occasional Engineer

CONFIDENTIAL | 3 TODAY High level exploration of the skills, tools, and techniques needed to achieve early success and to help you build your data science practice.

CONFIDENTIAL | 4  Understand our organizational needs for data science  Infrastructure: Technological tools and platforms.  Talent: Staff hired and trained.  Capabilities: Data science techniques utilized. INFRASTRUCTURE, TALENT, & CAPABILITIES HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduce Data Exploration Basic ModelingPhD Math VisualizationClusteringCategorization Continuous Models Text Analysis

CONFIDENTIAL | 5  Boxed Solutions: Mahout & Platform  Toolkits: RHadoop, Scikit, etc.  You will need toolkits to solve unique problems  but smart techniques make that easier.  Boxed solutions are limited  but can be a good source of early velocity. ANALYTICS TOOLS

CONFIDENTIAL | 6  Gigabytes from Stackoverflow  Questions from users  With metadata  Users have reputations  Questions open or closed  Follow along  Thinking about your data  To learn in a  Familiar context and  Plan DATA Presenter Audience HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 7 select count(1) as total, sum(has_code), avg(body_count), stddev_samp(body_count), corr(reputation, owner_questions), histogram_numeric(body_count, 10) from questions ; STEP 1: EXPLORE HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Patterns through Hive Patterns through Tableau

CONFIDENTIAL | 8  Summaries of unstructured data  Time-since metrics select transform(…) using ‘python …’  Clustering: Browsing cohorts /bin/mahout canopy STEP 2: FEATURE BUILDING HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis SQL Windowing Cross-Record Features

CONFIDENTIAL | 9 Sample (don’t parallelize) Naturally parallel SVD Random Forests Estimators and Ensembles Bootstrapping Localizing Advanced Parallelization Linear models with SGD Neural networks PARALLEL MODELS IN HADOOP HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 10  Single R model  run many times  over samples  and aggregated m <- C5.0(status ~ …) STEP 3: STRUCTURED MODEL (BAGGING) HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Mapper 1: Define n reducer keys Send any record to reducer I with probability p Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Bagging a Model

CONFIDENTIAL | 11 WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  We’ve created a structured model  to flag questions that won’t be closed  using Big Data.  But we haven’t used unstructured data.

CONFIDENTIAL | 12 TEXT ANALYSIS HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Is “the big dog” really different from “dog is big?” How about “I like eggs but hate tofu” and “I hate eggs but like tofu?” Language has lexical and syntactical features Different techniques leverage these in different ways  Bag of Words: Structure doesn’t matter  n-gram: Structure matters (but not that much)  Feature Extraction: BACON! BACON! BACON!

CONFIDENTIAL | 13 STEP 4: UNSTRUCTURED MODEL HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  Similar to Hadoop’s Word Count  Create counts for token/category pairs  Use counts to calculate Information Gain MR Job 1: Calculate information gain (IG) for all tokens. MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 3: Build a classifier over the newly structured data (prior slides) MR Job 3: Build a classifier over the newly structured data (prior slides) Information Gain

CONFIDENTIAL | 14 WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  We’ve created two models  One structured,  one unstructured.  But they don’t work together.

CONFIDENTIAL | 15 STEP 5: ENSEMBLE MODEL HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  Join many models together  By using their output  As input to ensemble model.  Best when models perform differently  Exploit differences with nonlinearities  Like interaction effects. Ensembling Mapper 1: Load multiple models Score the models per record and output Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records MR Job 2: Build a model over the output data as if it was raw data. MR Job 2: Build a model over the output data as if it was raw data.

CONFIDENTIAL | 16  We’ve created two models:  one structured,  one unstructured  and have ensembled them  to create a single, powerful model  and solve a practical business problem. WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 17  This required simple infrastructure  a blend of analysis and scripting skills  an understanding of BIG data science techniques  but not a team of PhDs or a billion dollars. HOW DID WE GET HERE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 18 Questions? www.thinkbiganalytics.com @danmallinger

June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.

Similar presentations

Presentation on theme: "June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.

Similar presentations

Presentation on theme: "June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead."— Presentation transcript:

Similar presentations

About project

Feedback