Download presentation
Presentation is loading. Please wait.
Published byJoella Tucker Modified over 8 years ago
1
June 2013 BIG DATA SCIENCE: A PATH FORWARD
2
CONFIDENTIAL | 2 linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com Data Science Lead @ Think Big Product/Brand Obsessive Teacher Occasional Engineer
3
CONFIDENTIAL | 3 TODAY High level exploration of the skills, tools, and techniques needed to achieve early success and to help you build your data science practice.
4
CONFIDENTIAL | 4 Understand our organizational needs for data science Infrastructure: Technological tools and platforms. Talent: Staff hired and trained. Capabilities: Data science techniques utilized. INFRASTRUCTURE, TALENT, & CAPABILITIES HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduce Data Exploration Basic ModelingPhD Math VisualizationClusteringCategorization Continuous Models Text Analysis
5
CONFIDENTIAL | 5 Boxed Solutions: Mahout & Platform Toolkits: RHadoop, Scikit, etc. You will need toolkits to solve unique problems but smart techniques make that easier. Boxed solutions are limited but can be a good source of early velocity. ANALYTICS TOOLS
6
CONFIDENTIAL | 6 Gigabytes from Stackoverflow Questions from users With metadata Users have reputations Questions open or closed Follow along Thinking about your data To learn in a Familiar context and Plan DATA Presenter Audience HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis
7
CONFIDENTIAL | 7 select count(1) as total, sum(has_code), avg(body_count), stddev_samp(body_count), corr(reputation, owner_questions), histogram_numeric(body_count, 10) from questions ; STEP 1: EXPLORE HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Patterns through Hive Patterns through Tableau
8
CONFIDENTIAL | 8 Summaries of unstructured data Time-since metrics select transform(…) using ‘python …’ Clustering: Browsing cohorts /bin/mahout canopy STEP 2: FEATURE BUILDING HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis SQL Windowing Cross-Record Features
9
CONFIDENTIAL | 9 Sample (don’t parallelize) Naturally parallel SVD Random Forests Estimators and Ensembles Bootstrapping Localizing Advanced Parallelization Linear models with SGD Neural networks PARALLEL MODELS IN HADOOP HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis
10
CONFIDENTIAL | 10 Single R model run many times over samples and aggregated m <- C5.0(status ~ …) STEP 3: STRUCTURED MODEL (BAGGING) HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Mapper 1: Define n reducer keys Send any record to reducer I with probability p Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Bagging a Model
11
CONFIDENTIAL | 11 WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis We’ve created a structured model to flag questions that won’t be closed using Big Data. But we haven’t used unstructured data.
12
CONFIDENTIAL | 12 TEXT ANALYSIS HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Is “the big dog” really different from “dog is big?” How about “I like eggs but hate tofu” and “I hate eggs but like tofu?” Language has lexical and syntactical features Different techniques leverage these in different ways Bag of Words: Structure doesn’t matter n-gram: Structure matters (but not that much) Feature Extraction: BACON! BACON! BACON!
13
CONFIDENTIAL | 13 STEP 4: UNSTRUCTURED MODEL HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Similar to Hadoop’s Word Count Create counts for token/category pairs Use counts to calculate Information Gain MR Job 1: Calculate information gain (IG) for all tokens. MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 3: Build a classifier over the newly structured data (prior slides) MR Job 3: Build a classifier over the newly structured data (prior slides) Information Gain
14
CONFIDENTIAL | 14 WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis We’ve created two models One structured, one unstructured. But they don’t work together.
15
CONFIDENTIAL | 15 STEP 5: ENSEMBLE MODEL HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Join many models together By using their output As input to ensemble model. Best when models perform differently Exploit differences with nonlinearities Like interaction effects. Ensembling Mapper 1: Load multiple models Score the models per record and output Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records MR Job 2: Build a model over the output data as if it was raw data. MR Job 2: Build a model over the output data as if it was raw data.
16
CONFIDENTIAL | 16 We’ve created two models: one structured, one unstructured and have ensembled them to create a single, powerful model and solve a practical business problem. WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis
17
CONFIDENTIAL | 17 This required simple infrastructure a blend of analysis and scripting skills an understanding of BIG data science techniques but not a team of PhDs or a billion dollars. HOW DID WE GET HERE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis
18
CONFIDENTIAL | 18 Questions? www.thinkbiganalytics.com @danmallinger
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.