Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major.

Slides:



Advertisements
Similar presentations
Leveraging Commercial Graph DB Technologies in Open Source and Polyglot Application Environments Brian Clark, VP Product Management Objectivity, Inc.
Advertisements

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Hadoop Ecosystem Overview
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Stern Center for Research Computing
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Machine Learning as a Service
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matthew Winter and Ned Shawa
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Nov 2006 Google released the paper on BigTable.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Integrating Big Data into the Computing Curricula 02/2015 Achmad Benny Mutiara
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA/ Hadoop Interview Questions.
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Big data toolbox.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Introduction to Spark Streaming for Real Time data analysis
Big Data A Quick Review on Analytical Tools
An Open Source Project Commonly Used for Processing Big Data Sets
Status and Challenges: January 2017
Hadoop Tutorials Spark
CS122B: Projects in Databases and Web Applications Winter 2017
Spark Presentation.
Hadoopla: Microsoft and the Hadoop Ecosystem
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Ministry of Higher Education
Introduction to Spark.
NoSQL Systems Overview (as of November 2011).
Dane Stubben QuintilesIMS Database Manager
Apache Spark & Complex Network
CS110: Discussion about Spark
Overview of big data tools
Big Data Young Lee BUS 550.
Introduction Apache Mesos is a type of open source software that is used to manage the computer clusters. This type of software has been developed by the.
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
H2O is used by more than 14,000 companies
Big-Data Analytics with Azure HDInsight
CS639: Data Management for Data Science
Presentation transcript:

Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major with CS minor Emphasis on Stats & Machine Learning jameslohse.com – download slides and paper Contact: supportml.com DATA Changing name to Mega Learning LLC, watch for that

Welcome to the Intermountain Big Data Conference! 3 Big Data Utah / UTGE Big Data Utah and Utah Geek Events Nick Baguley / Pat Wright On Meetup.com November is next Next Big Data Utah event is January 13 look at UTGE: Big Mountain Data Conference and others

Welcome to the Intermountain Big Data Conference! 4 Data Mining and Machine Learning Primer Tools and infrastructure for being a Data Scientist can be overwhelming at first Much more to it than just programming This is true for all development, lots of tools So you know Java? How about Maven? Gradle? Eclipse, IntelliJ? Android Studio? Ant, SVN, Git, Github, Mercurial, Ivy, etc etc?

Welcome to the Intermountain Big Data Conference! 5 Big Server vs. Cluster Storing large data sets – local vs. cloud? GPU? Hadoop / HDFS / Hbase for cluster storage Cluster of Unreliable Commodity Hardware Hadoop is Apache Open Source project Often associated with MapReduce They are not the same, MapReduce can work on a Hadoop file system

Welcome to the Intermountain Big Data Conference! 6 Hadoop Spreads large data sets across clusters Clusters can be very cheap hardware Based on Google white papers on MapReduce and Google File System HDFS – Hadoop Distributed File System Framework mostly written in Java

Welcome to the Intermountain Big Data Conference! 7 MapReduce Part of Hadoop Separate from HDFS, layers on top of HDFS Was originally proprietary Google technology Splits jobs across a cluster Facilitates parallel processing for higher speed Implemented in MongoDb, for example

Welcome to the Intermountain Big Data Conference! 8 Apache Spark MapReduce replacement from UC Berkeley In-memory primitives, not disk based Cluster management - Spark, YARN or Mesos, Hbase, Cassandra Distributed storage interfaces with HDFS, Cassandra, Openstack Swift, Amazon S3 Pseudo-distributed mode for testing locally Most active Apache project in 2014

Welcome to the Intermountain Big Data Conference! 9 Apache Spark Components Spark Core / Resilient Distributed Datasets RDD in Java, Python and Scala Spark SQL – SQL over unstructured data Spark Streaming – Kafka, Flume, Twitter, TCP sockets, ZeroMQ, Kinesis MLlib Machine Learning Library MLlib 10X faster than Apache Mahout GraphX – Graph processing library

Welcome to the Intermountain Big Data Conference! 10 R Like Matlab, more a statisics environment than a pure programming language Learn more about R on Coursera.com Part of Johns Hopkins “Data Science” track Supposedly funny: “A Data Scientist is a statistician who is a better software developer than other statisticians, and a software developer who is a better statistician than other software developers”

Welcome to the Intermountain Big Data Conference! 11 CRAN / Rstudio / Rpy2 Comprehensive R Archive Network RStudio is the IDE for R programming Free / open source from Desktop app or RStudio Server for web access Rpy2 is a Python Interface to R Also PyPy, Rpy, Rpython Python taking over as the language for ML

Welcome to the Intermountain Big Data Conference! 12 Web Crawlers in Python & Java Scrapy (Python) – Tag Soup (Java) – Beautiful Soup (Python) – Taggle is Tag Soup in C++

Welcome to the Intermountain Big Data Conference! 13 Ipython Notebook / Jupyter Display / formatting of multiple languages and codesets in one place, for publishing Numerous ML-based notebooks online: Interesting notebooks: Jupyter is now separated from iPython – “Language-agnostic” parts of iPython now on Jupyter.org

Welcome to the Intermountain Big Data Conference! 14 What? NO SQL? Not Only SQL – there is SQL Solves problems relational can't touch Amazon, Facebook, Twitter, LinkedIn “eventually consistent” not ACID Many many choices!

Welcome to the Intermountain Big Data Conference! 15 Key – Value store Stores keys and values – that's it! Not up to more complex tasks Great for simple needs, very fast! Redis, Memcached, Amazon DynamoDB

Welcome to the Intermountain Big Data Conference! 16 Graph and other types Graph DB, just that, stores data as a graph with nodes and edges, nodes not all indexed Neo4j, FlockDB, OrientDB, IBM DB2, Stardog Many other models for databases, each has its own benefits of speed vs. reliability/consistency According to Object, Tabular, Tuple Store, Triple/quad store, Hosted, Multi-value, Correlation, Cell

Welcome to the Intermountain Big Data Conference! 17 MongoDb, Cassandra, HBase Article claims analysis of LinkedIn shows these are becoming the top three NoSQL databases to know: ongodb-cassandra-hbase-three-nosql-databases- to-watch.html

Welcome to the Intermountain Big Data Conference! 18 Kaggle.com / competitions Where the money is, Big Data competition If you are at the top of Kaggle you are going to make a lot of money (and change the world?) Good community and starter projects Facial Keypoints Detection in R Big Data Utah also runs competitions

Welcome to the Intermountain Big Data Conference! 19

Welcome to the Intermountain Big Data Conference! 20

Welcome to the Intermountain Big Data Conference! 21

Welcome to the Intermountain Big Data Conference! 22

Welcome to the Intermountain Big Data Conference! 32 Deploying Shiny Apps ShinyApps.io has free limited ac Rstudio Shiny Server: server/ Not to be confused with RStudio Server / Pro d-server/

Welcome to the Intermountain Big Data Conference! 33 Thanks for attending! Q&A if there's time... learning-tools-r-shiny-python/