Projects on Extended Apache Spark

Projects on Extended Apache Spark

MapReduce The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values.

WordCount Example

fast and general-purpose cluster computing system
in-memory nature High-level built-in APIs in Java, Scala, Python and R

popularity over the years

advanced analytics

SQL spark module for structured data processing
can act as distributed SQL query engine interaction with Spark SQL via SQL and the Dataset API

SQL (UDF) User-Defined Functions (aka UDF)
Create your own using Java, Python, Scala Register custom UDF and use it on sql queries

SQL (UDF Example) static UDF1 toUpper = new UDF1<String, String>() { public String call(final String str) throws Exception { return str.toUpperCase(); } }; sqlContext.udf().register("toUpper", toUpper, DataTypes.StringType); spark.sql("SELECT toUpper(name) FROM people").show();

SQL (UDAF) User-Defined Aggregate Functions (aka UDAF)
Built-in functions: count(), countDistinct(), avg(), max(), min() Create your own using Java, Scala Register custom UDAF and use it on sql queries

SQL (Virtual Tables) Virtual tables are not supported from Apache Spark yet ExaremeSpark is an extension, which includes virtual tables Create your own using MapReduce algorithms Register custom Virtual Tables and use it on sql queries

SQL (Virtual Tables Example)
exaremespark.sql("select * from apachelogsplit('/path/of/access_log')").show(); exaremespark.sql("select * from sample(HowMany,(select * from apachelogsplit('/path/of/access_log')))')").show();

Official spark site: https://spark.apache.org/
GitHub: Official spark-sql site: Useful online book:

Projects using extended Spark SQL
Machine Learning Privacy Preserving Differential Privacy K-anonymity Text mining A library for text mining (e.g. stemmer, keywords extraction, stopword removal, high-pass text filtering etc.) Whatever is published in top computer science conferences (e.g. SIGMOD, VLDB etc.)

Other Projects Scheduling Optimization Storage formats Data transfer
Compression

Projects on Extended Apache Spark

Similar presentations

Presentation on theme: "Projects on Extended Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Projects on Extended Apache Spark

Similar presentations

Presentation on theme: "Projects on Extended Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback