June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
Stat-JR: eBooks Richard Parker. Quick overview To recap… Stat-JR uses templates to perform specific functions on datasets, e.g.: – 1LevelMod fits 1-level.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Recommender System with Hadoop and Spark
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
User Experience and Interface Design for Web Apps
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
GETTING YOUR COLLEGE HIRES JOB READY Jillian Payne, Director Analytic Development Program.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
PANEL SENIOR BIG DATA ARCHITECT BD-COE
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
Kaggle Winner Presentation Template. Agenda 1.Background 2.Summary 3.Feature selection & engineering 4.Training methods 5.Important findings 6.Simple.
9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
A Simple Approach for Author Profiling in MapReduce
Image taken from: slideshare
4/19/ :02 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Big Data is a Big Deal!.
Introduction to Machine Learning
PROTECT | OPTIMIZE | TRANSFORM
A Straightforward Author Profiling Approach in MapReduce
Unit 5 Working with pig.
Spark Presentation.
Data Analytics → Demystified
Using Python to Interact with the EPA WATERS Web Services (part 2)
Creating New Business Value with Big Data
Basic machine learning background with Python scikit-learn
ECE 5424: Introduction to Machine Learning
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Machine Learning & Data Science
Overview of big data tools
The Idea of Pig Or Pig Concepts
Analytics: Its More than Just Modeling
Ensemble learning.
Charles Tappert Seidenberg School of CSIS, Pace University
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Machine Learning for Cyber
Presentation transcript:

June 2013 BIG DATA SCIENCE: A PATH FORWARD

CONFIDENTIAL | 2  Data Science Think Big  Product/Brand Obsessive  Teacher  Occasional Engineer

CONFIDENTIAL | 3 TODAY High level exploration of the skills, tools, and techniques needed to achieve early success and to help you build your data science practice.

CONFIDENTIAL | 4  Understand our organizational needs for data science  Infrastructure: Technological tools and platforms.  Talent: Staff hired and trained.  Capabilities: Data science techniques utilized. INFRASTRUCTURE, TALENT, & CAPABILITIES HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduce Data Exploration Basic ModelingPhD Math VisualizationClusteringCategorization Continuous Models Text Analysis

CONFIDENTIAL | 5  Boxed Solutions: Mahout & Platform  Toolkits: RHadoop, Scikit, etc.  You will need toolkits to solve unique problems  but smart techniques make that easier.  Boxed solutions are limited  but can be a good source of early velocity. ANALYTICS TOOLS

CONFIDENTIAL | 6  Gigabytes from Stackoverflow  Questions from users  With metadata  Users have reputations  Questions open or closed  Follow along  Thinking about your data  To learn in a  Familiar context and  Plan DATA Presenter Audience HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 7 select count(1) as total, sum(has_code), avg(body_count), stddev_samp(body_count), corr(reputation, owner_questions), histogram_numeric(body_count, 10) from questions ; STEP 1: EXPLORE HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Patterns through Hive Patterns through Tableau

CONFIDENTIAL | 8  Summaries of unstructured data  Time-since metrics select transform(…) using ‘python …’  Clustering: Browsing cohorts /bin/mahout canopy STEP 2: FEATURE BUILDING HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis SQL Windowing Cross-Record Features

CONFIDENTIAL | 9 Sample (don’t parallelize) Naturally parallel SVD Random Forests Estimators and Ensembles Bootstrapping Localizing Advanced Parallelization Linear models with SGD Neural networks PARALLEL MODELS IN HADOOP HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 10  Single R model  run many times  over samples  and aggregated m <- C5.0(status ~ …) STEP 3: STRUCTURED MODEL (BAGGING) HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Mapper 1: Define n reducer keys Send any record to reducer I with probability p Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Bagging a Model

CONFIDENTIAL | 11 WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  We’ve created a structured model  to flag questions that won’t be closed  using Big Data.  But we haven’t used unstructured data.

CONFIDENTIAL | 12 TEXT ANALYSIS HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis Is “the big dog” really different from “dog is big?” How about “I like eggs but hate tofu” and “I hate eggs but like tofu?” Language has lexical and syntactical features Different techniques leverage these in different ways  Bag of Words: Structure doesn’t matter  n-gram: Structure matters (but not that much)  Feature Extraction: BACON! BACON! BACON!

CONFIDENTIAL | 13 STEP 4: UNSTRUCTURED MODEL HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  Similar to Hadoop’s Word Count  Create counts for token/category pairs  Use counts to calculate Information Gain MR Job 1: Calculate information gain (IG) for all tokens. MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 3: Build a classifier over the newly structured data (prior slides) MR Job 3: Build a classifier over the newly structured data (prior slides) Information Gain

CONFIDENTIAL | 14 WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  We’ve created two models  One structured,  one unstructured.  But they don’t work together.

CONFIDENTIAL | 15 STEP 5: ENSEMBLE MODEL HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis  Join many models together  By using their output  As input to ensemble model.  Best when models perform differently  Exploit differences with nonlinearities  Like interaction effects. Ensembling Mapper 1: Load multiple models Score the models per record and output Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records MR Job 2: Build a model over the output data as if it was raw data. MR Job 2: Build a model over the output data as if it was raw data.

CONFIDENTIAL | 16  We’ve created two models:  one structured,  one unstructured  and have ensembled them  to create a single, powerful model  and solve a practical business problem. WHERE ARE WE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 17  This required simple infrastructure  a blend of analysis and scripting skills  an understanding of BIG data science techniques  but not a team of PhDs or a billion dollars. HOW DID WE GET HERE? HadoopNoSQLAnalyticsSQL/MPPReal Time ScriptingMapReduceExplorationBasic ModelingPhD Math VisualizationClusteringCategorizationContinuousText Analysis

CONFIDENTIAL | 18 Questions?