Presentation is loading. Please wait.

Presentation is loading. Please wait.

Projects on Extended Apache Spark

Similar presentations


Presentation on theme: "Projects on Extended Apache Spark"— Presentation transcript:

1 Projects on Extended Apache Spark

2 MapReduce The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values.

3 WordCount Example

4 fast and general-purpose cluster computing system
in-memory nature High-level built-in APIs in Java, Scala, Python and R

5 popularity over the years

6 advanced analytics

7 SQL spark module for structured data processing
can act as distributed SQL query engine interaction with Spark SQL via SQL and the Dataset API

8 SQL (Example) Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json"); df.select("name").show(); // // | name| // |Michael| // | Andy| // | Justin| df.createOrReplaceTempView("people");

9 SQL (Example) Dataset<Row> sqlDF = spark.sql("SELECT name FROM people"); sqlDF.show(); // // | name| // |Michael| // | Andy| // | Justin|

10 SQL (UDF) User-Defined Functions (aka UDF)
Create your own using Java, Python, Scala Register custom UDF and use it on sql queries

11 SQL (UDF Example) static UDF1 toUpper = new UDF1<String, String>() { public String call(final String str) throws Exception { return str.toUpperCase(); } }; sqlContext.udf().register("toUpper", toUpper, DataTypes.StringType); spark.sql("SELECT toUpper(name) FROM people").show();

12 SQL (UDAF) User-Defined Aggregate Functions (aka UDAF)
Built-in functions: count(), countDistinct(), avg(), max(), min() Create your own using Java, Scala Register custom UDAF and use it on sql queries

13 SQL (Virtual Tables) Virtual tables are not supported from Apache Spark yet ExaremeSpark is an extension, which includes virtual tables Create your own using MapReduce algorithms Register custom Virtual Tables and use it on sql queries

14 SQL (Virtual Tables Example)
exaremespark.sql("select * from apachelogsplit('/path/of/access_log')").show(); exaremespark.sql("select * from sample(HowMany,(select * from apachelogsplit('/path/of/access_log')))')").show();

15 Official spark site: https://spark.apache.org/
GitHub: Official spark-sql site: Useful online book:

16 Projects using extended Spark SQL
Machine Learning Privacy Preserving Differential Privacy K-anonymity Text mining A library for text mining (e.g. stemmer, keywords extraction, stopword removal, high-pass text filtering etc.) Whatever is published in top computer science conferences (e.g. SIGMOD, VLDB etc.)

17 Other Projects Scheduling Optimization Storage formats Data transfer
Compression


Download ppt "Projects on Extended Apache Spark"

Similar presentations


Ads by Google