Big Data, Data Mining, Tools

Slides:

Advertisements

Similar presentations

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Advertisements

Better Performance for Big Data Shuya Zhang; Shyam Sundar Somasundaram [10/03/13] 1 [1] Bhasker Allene, Marco Righini, “Better Performance for Big Data”

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Hadoop and HDFS

1 Intern Project Presentation Connor Richardson Big Data August 4, 2015.

An Introduction to HDInsight June 27 th,

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Hadoop + Mahout Anton Slutsky, Lead Data Scientist, EPAM Systems

MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.

Nov 2006 Google released the paper on BigTable.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Big Data Analytics with Excel Peter Myers Bitwise Solutions.

Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

Apache Hadoop on Windows Azure Avkash Chauhan

CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.

Microsoft Ignite /28/2017 6:07 PM

3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

A Tutorial on Hadoop Cloud Computing : Future Trends.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

MapReduce Compilers-Apache Pig

Introduction to Hadoop

Big Data & Test Automation

Image taken from: slideshare

Big Data is a Big Deal!.

SNS COLLEGE OF TECHNOLOGY

Big Data Enterprise Patterns

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

An Open Source Project Commonly Used for Processing Big Data Sets

Tutorial: Big Data Algorithms and Applications Under Hadoop

Status and Challenges: January 2017

Chapter 14 Big Data Analytics and NoSQL

MSBIC Hadoop Series Processing Data with Pig

Hadoopla: Microsoft and the Hadoop Ecosystem

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

DATA SCIENCE Online Training at GoLogica

Introduction to MapReduce and Hadoop

Hadoop Clusters Tess Fulkerson.

Central Florida Business Intelligence User Group

DATA ANALYTICS AND TEXT MINING

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Ministry of Higher Education

Introduction to Spark.

MIT 802 Introduction to Data Platforms and Sources Lecture 2

1 Demand of your DB is changing Presented By: Ashwani Kumar

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Overview of big data tools

Data analytics with Hadoop In the Microsoft Azure cloud

TIM TAYLOR AND JOSH NEEDHAM

INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA

Zoie Barrett and Brian Lam

Charles Tappert Seidenberg School of CSIS, Pace University

Big-Data Analytics with Azure HDInsight

MapReduce: Simplified Data Processing on Large Clusters

Big Data Technology: Introduction to Hadoop

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

Big Data, Data Mining, Tools

N = ALL

CORRELATION vs CAUSATION

Data Sources...

Data Creation, Storage, Costs

Infrastructure

NoSQL Flavors https://www.youtube.com/watch?v=qI_g07C_Q5I

NoSQL https://www.youtube.com/watch?v=qI_g07C_Q5I Not Only SQL (sort of) Greater scalability Designed with distributed computing and commodity (not cheap) hardware. Variety of flavors https://www.youtube.com/watch?v=qI_g07C_Q5I

Topic: Algorithms

Tools

Speaking of the Cloud

High Level Flow Example

Hadoop MapReduce

HDFS Distributed file system. Write-once/read many Fault tolerance / Redundance Processing logic close to data http://www.ibm.com/developerworks/library/wa-introhdfs/

Traditional word count in Java http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code

Hive CREATE TABLE docs (line STRING); CREATE TABLE word_counts AS SELECT word, count(1) as count FROM (SELECT explode(split(line, ' ')) AS word FROM docs) w GROUP BY word ORDER BY word;

Hive with Some Structure Data 123 F 456 M 789 M 111 M 222 M 333 F 444 F 555 M create table if not exists p_genders ( p_id string, gender string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * from p_genders;

Pig Latin A = load 'S3://pmb4bucket/input/bleakhouse/bleakhouse.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into 's3://pmb4hadoop/output/bleakhouse';

Complex Event Processing

Tools

Data Scientist Not just a bean counter - it’s about modeling General skill set: Math (linear algebra, statistics, calculus, discrete math) Business sense Programming skills Communication etc, etc, etc https://www.youtube.com/watch?v=ceeiUAmbfZk

Our Schedule Setting the goals for a data mining project. Setting up KNime Gathering and preparing data. Visualization Machine Learning Naïve Bayes Clustering and Classification Dimension reduction

But first…