Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major with CS minor Emphasis on Stats & Machine Learning jameslohse.com – download slides and paper Contact: supportml.com DATA Changing name to Mega Learning LLC, watch for that
Welcome to the Intermountain Big Data Conference! 3 Big Data Utah / UTGE Big Data Utah and Utah Geek Events Nick Baguley / Pat Wright On Meetup.com November is next Next Big Data Utah event is January 13 look at UTGE: Big Mountain Data Conference and others
Welcome to the Intermountain Big Data Conference! 4 Data Mining and Machine Learning Primer Tools and infrastructure for being a Data Scientist can be overwhelming at first Much more to it than just programming This is true for all development, lots of tools So you know Java? How about Maven? Gradle? Eclipse, IntelliJ? Android Studio? Ant, SVN, Git, Github, Mercurial, Ivy, etc etc?
Welcome to the Intermountain Big Data Conference! 5 Big Server vs. Cluster Storing large data sets – local vs. cloud? GPU? Hadoop / HDFS / Hbase for cluster storage Cluster of Unreliable Commodity Hardware Hadoop is Apache Open Source project Often associated with MapReduce They are not the same, MapReduce can work on a Hadoop file system
Welcome to the Intermountain Big Data Conference! 6 Hadoop Spreads large data sets across clusters Clusters can be very cheap hardware Based on Google white papers on MapReduce and Google File System HDFS – Hadoop Distributed File System Framework mostly written in Java
Welcome to the Intermountain Big Data Conference! 7 MapReduce Part of Hadoop Separate from HDFS, layers on top of HDFS Was originally proprietary Google technology Splits jobs across a cluster Facilitates parallel processing for higher speed Implemented in MongoDb, for example
Welcome to the Intermountain Big Data Conference! 8 Apache Spark MapReduce replacement from UC Berkeley In-memory primitives, not disk based Cluster management - Spark, YARN or Mesos, Hbase, Cassandra Distributed storage interfaces with HDFS, Cassandra, Openstack Swift, Amazon S3 Pseudo-distributed mode for testing locally Most active Apache project in 2014
Welcome to the Intermountain Big Data Conference! 9 Apache Spark Components Spark Core / Resilient Distributed Datasets RDD in Java, Python and Scala Spark SQL – SQL over unstructured data Spark Streaming – Kafka, Flume, Twitter, TCP sockets, ZeroMQ, Kinesis MLlib Machine Learning Library MLlib 10X faster than Apache Mahout GraphX – Graph processing library
Welcome to the Intermountain Big Data Conference! 10 R Like Matlab, more a statisics environment than a pure programming language Learn more about R on Coursera.com Part of Johns Hopkins “Data Science” track Supposedly funny: “A Data Scientist is a statistician who is a better software developer than other statisticians, and a software developer who is a better statistician than other software developers”
Welcome to the Intermountain Big Data Conference! 11 CRAN / Rstudio / Rpy2 Comprehensive R Archive Network RStudio is the IDE for R programming Free / open source from Desktop app or RStudio Server for web access Rpy2 is a Python Interface to R Also PyPy, Rpy, Rpython Python taking over as the language for ML
Welcome to the Intermountain Big Data Conference! 12 Web Crawlers in Python & Java Scrapy (Python) – Tag Soup (Java) – Beautiful Soup (Python) – Taggle is Tag Soup in C++
Welcome to the Intermountain Big Data Conference! 13 Ipython Notebook / Jupyter Display / formatting of multiple languages and codesets in one place, for publishing Numerous ML-based notebooks online: Interesting notebooks: Jupyter is now separated from iPython – “Language-agnostic” parts of iPython now on Jupyter.org
Welcome to the Intermountain Big Data Conference! 14 What? NO SQL? Not Only SQL – there is SQL Solves problems relational can't touch Amazon, Facebook, Twitter, LinkedIn “eventually consistent” not ACID Many many choices!
Welcome to the Intermountain Big Data Conference! 15 Key – Value store Stores keys and values – that's it! Not up to more complex tasks Great for simple needs, very fast! Redis, Memcached, Amazon DynamoDB
Welcome to the Intermountain Big Data Conference! 16 Graph and other types Graph DB, just that, stores data as a graph with nodes and edges, nodes not all indexed Neo4j, FlockDB, OrientDB, IBM DB2, Stardog Many other models for databases, each has its own benefits of speed vs. reliability/consistency According to Object, Tabular, Tuple Store, Triple/quad store, Hosted, Multi-value, Correlation, Cell
Welcome to the Intermountain Big Data Conference! 17 MongoDb, Cassandra, HBase Article claims analysis of LinkedIn shows these are becoming the top three NoSQL databases to know: ongodb-cassandra-hbase-three-nosql-databases- to-watch.html
Welcome to the Intermountain Big Data Conference! 18 Kaggle.com / competitions Where the money is, Big Data competition If you are at the top of Kaggle you are going to make a lot of money (and change the world?) Good community and starter projects Facial Keypoints Detection in R Big Data Utah also runs competitions
Welcome to the Intermountain Big Data Conference! 19
Welcome to the Intermountain Big Data Conference! 20
Welcome to the Intermountain Big Data Conference! 21
Welcome to the Intermountain Big Data Conference! 22
Welcome to the Intermountain Big Data Conference! 32 Deploying Shiny Apps ShinyApps.io has free limited ac Rstudio Shiny Server: server/ Not to be confused with RStudio Server / Pro d-server/
Welcome to the Intermountain Big Data Conference! 33 Thanks for attending! Q&A if there's time... learning-tools-r-shiny-python/