Projects on Extended Apache Spark

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Data Mining Datameer RapidMiner Windows Azure Marketplace
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hive UDF content/uploads/downloads/2013/09/HWX.Qu bole.Hive_.UDF_.Guide_.1.0.pdf UT Dallas 1.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
CSI315CSI315 Web Development Technologies Continued.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
PANEL SENIOR BIG DATA ARCHITECT BD-COE
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Python Spark Intro for Data Science
Image taken from: slideshare
Big Data Analytics and HPC Platforms
Enhancement of IITBombayX-Open edX
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Presented by: Omar Alqahtani Fall 2016
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Concept & Examples of pyspark
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Machine Learning Library for Apache Ignite
Introduction to Spark Streaming for Real Time data analysis
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
A Web Mining Platform for Enhancing Knowledge Management on the Web KOK-LEONG ONG WEE-KEONG NG EE-PENG LIM Center for Advanced Information Systems,
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Central Florida Business Intelligence User Group
Azure Machine Learning & ML Studio
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
End-to-End Machine Learning with Apache AsterixDB
February 26th – Map/Reduce
Apache Spark & Complex Network
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Overview of big data tools
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Spark and Scala.
Introduction to MapReduce
Introduction to Spark.
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
Hadoop – PIG.
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Lecture 29: Distributed Systems
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Projects on Extended Apache Spark

MapReduce The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values.

WordCount Example

fast and general-purpose cluster computing system in-memory nature High-level built-in APIs in Java, Scala, Python and R

popularity over the years

advanced analytics

SQL spark module for structured data processing can act as distributed SQL query engine interaction with Spark SQL via SQL and the Dataset API

SQL (Example) Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json"); df.select("name").show(); // +---------+ // | name| // |Michael| // | Andy| // | Justin| df.createOrReplaceTempView("people");

SQL (Example) Dataset<Row> sqlDF = spark.sql("SELECT name FROM people"); sqlDF.show(); // +---------+ // | name| // |Michael| // | Andy| // | Justin|

SQL (UDF) User-Defined Functions (aka UDF) Create your own using Java, Python, Scala Register custom UDF and use it on sql queries

SQL (UDF Example) static UDF1 toUpper = new UDF1<String, String>() { public String call(final String str) throws Exception { return str.toUpperCase(); } }; sqlContext.udf().register("toUpper", toUpper, DataTypes.StringType); spark.sql("SELECT toUpper(name) FROM people").show();

SQL (UDAF) User-Defined Aggregate Functions (aka UDAF) Built-in functions: count(), countDistinct(), avg(), max(), min() Create your own using Java, Scala Register custom UDAF and use it on sql queries

SQL (Virtual Tables) Virtual tables are not supported from Apache Spark yet ExaremeSpark is an extension, which includes virtual tables Create your own using MapReduce algorithms Register custom Virtual Tables and use it on sql queries

SQL (Virtual Tables Example) exaremespark.sql("select * from apachelogsplit('/path/of/access_log')").show(); exaremespark.sql("select * from sample(HowMany,(select * from apachelogsplit('/path/of/access_log')))')").show();

Official spark site: https://spark.apache.org/ GitHub: https://github.com/apache/spark Official spark-sql site: https://spark.apache.org/docs/latest/sql-programming-guide.html Useful online book: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details

Projects using extended Spark SQL Machine Learning Privacy Preserving Differential Privacy K-anonymity Text mining A library for text mining (e.g. stemmer, keywords extraction, stopword removal, high-pass text filtering etc.) Whatever is published in top computer science conferences (e.g. SIGMOD, VLDB etc.)

Other Projects Scheduling Optimization Storage formats Data transfer Compression