Projects on Extended Apache Spark
MapReduce The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values.
WordCount Example
fast and general-purpose cluster computing system in-memory nature High-level built-in APIs in Java, Scala, Python and R
popularity over the years
advanced analytics
SQL spark module for structured data processing can act as distributed SQL query engine interaction with Spark SQL via SQL and the Dataset API
SQL (Example) Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json"); df.select("name").show(); // +---------+ // | name| // |Michael| // | Andy| // | Justin| df.createOrReplaceTempView("people");
SQL (Example) Dataset<Row> sqlDF = spark.sql("SELECT name FROM people"); sqlDF.show(); // +---------+ // | name| // |Michael| // | Andy| // | Justin|
SQL (UDF) User-Defined Functions (aka UDF) Create your own using Java, Python, Scala Register custom UDF and use it on sql queries
SQL (UDF Example) static UDF1 toUpper = new UDF1<String, String>() { public String call(final String str) throws Exception { return str.toUpperCase(); } }; sqlContext.udf().register("toUpper", toUpper, DataTypes.StringType); spark.sql("SELECT toUpper(name) FROM people").show();
SQL (UDAF) User-Defined Aggregate Functions (aka UDAF) Built-in functions: count(), countDistinct(), avg(), max(), min() Create your own using Java, Scala Register custom UDAF and use it on sql queries
SQL (Virtual Tables) Virtual tables are not supported from Apache Spark yet ExaremeSpark is an extension, which includes virtual tables Create your own using MapReduce algorithms Register custom Virtual Tables and use it on sql queries
SQL (Virtual Tables Example) exaremespark.sql("select * from apachelogsplit('/path/of/access_log')").show(); exaremespark.sql("select * from sample(HowMany,(select * from apachelogsplit('/path/of/access_log')))')").show();
Official spark site: https://spark.apache.org/ GitHub: https://github.com/apache/spark Official spark-sql site: https://spark.apache.org/docs/latest/sql-programming-guide.html Useful online book: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details
Projects using extended Spark SQL Machine Learning Privacy Preserving Differential Privacy K-anonymity Text mining A library for text mining (e.g. stemmer, keywords extraction, stopword removal, high-pass text filtering etc.) Whatever is published in top computer science conferences (e.g. SIGMOD, VLDB etc.)
Other Projects Scheduling Optimization Storage formats Data transfer Compression