Big Data, Data Mining, Tools

Slides:



Advertisements
Similar presentations
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Advertisements

Better Performance for Big Data Shuya Zhang; Shyam Sundar Somasundaram [10/03/13] 1 [1] Bhasker Allene, Marco Righini, “Better Performance for Big Data”
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
1 Intern Project Presentation Connor Richardson Big Data August 4, 2015.
An Introduction to HDInsight June 27 th,
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Hadoop + Mahout Anton Slutsky, Lead Data Scientist, EPAM Systems
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
Nov 2006 Google released the paper on BigTable.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Microsoft Ignite /28/2017 6:07 PM
3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
A Tutorial on Hadoop Cloud Computing : Future Trends.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
MapReduce Compilers-Apache Pig
Introduction to Hadoop
Big Data & Test Automation
Image taken from: slideshare
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
Big Data Enterprise Patterns
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop.
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Tutorial: Big Data Algorithms and Applications Under Hadoop
Status and Challenges: January 2017
Chapter 14 Big Data Analytics and NoSQL
MSBIC Hadoop Series Processing Data with Pig
Hadoopla: Microsoft and the Hadoop Ecosystem
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
DATA SCIENCE Online Training at GoLogica
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
DATA ANALYTICS AND TEXT MINING
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Ministry of Higher Education
Introduction to Spark.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
1 Demand of your DB is changing Presented By: Ashwani Kumar
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Data analytics with Hadoop In the Microsoft Azure cloud
TIM TAYLOR AND JOSH NEEDHAM
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
Big Data Technology: Introduction to Hadoop
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Big Data.
Presentation transcript:

Big Data, Data Mining, Tools

N = ALL

CORRELATION vs CAUSATION

Data Sources...

Data Creation, Storage, Costs

Infrastructure

NoSQL Flavors https://www.youtube.com/watch?v=qI_g07C_Q5I

NoSQL https://www.youtube.com/watch?v=qI_g07C_Q5I Not Only SQL (sort of) Greater scalability Designed with distributed computing and commodity (not cheap) hardware. Variety of flavors https://www.youtube.com/watch?v=qI_g07C_Q5I

Topic: Algorithms

Tools

Speaking of the Cloud

High Level Flow Example

Hadoop MapReduce

HDFS Distributed file system. Write-once/read many Fault tolerance / Redundance Processing logic close to data http://www.ibm.com/developerworks/library/wa-introhdfs/

Traditional word count in Java http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code

Hive CREATE TABLE docs (line STRING); CREATE TABLE word_counts AS SELECT word, count(1) as count FROM (SELECT explode(split(line, ' ')) AS word FROM docs) w GROUP BY word ORDER BY word;

Hive with Some Structure Data 123 F 456 M 789 M 111 M 222 M 333 F 444 F 555 M create table if not exists p_genders ( p_id string, gender string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * from p_genders;

Pig Latin A = load 'S3://pmb4bucket/input/bleakhouse/bleakhouse.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into 's3://pmb4hadoop/output/bleakhouse';

Complex Event Processing

Tools

Data Scientist Not just a bean counter - it’s about modeling General skill set: Math (linear algebra, statistics, calculus, discrete math) Business sense Programming skills Communication etc, etc, etc https://www.youtube.com/watch?v=ceeiUAmbfZk

Our Schedule Setting the goals for a data mining project. Setting up KNime Gathering and preparing data. Visualization Machine Learning Naïve Bayes Clustering and Classification Dimension reduction

But first…