Download presentation
Presentation is loading. Please wait.
Published byDortha Griffith Modified over 9 years ago
1
Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond
2
About me Committer and PMC member of Apache Spark “Former” PhD student at Berkeley Release manager for Spark 1.0 Background in networking and distributed systems
3
Today’s Talk Spark background About the Spark release process The Spark 1.0 release Looking forward to Spark 1.1
4
What is Spark? Efficient General execution graphs In-memory storage Usable Rich APIs in Java, Scala, Python Interactive shell Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop 2-5× less code Up to 10× faster on disk, 100× in memory
6
30-Day Commit Activity
7
Spark Philosophy Make life easy and productive for data scientists Well documented, expressive API’s Powerful domain specific libraries Easy integration with storage systems … and caching to avoid data movement Predictable releases, stable API’s
8
Spark Release Process Quarterly release cycle (3 months) 2 months of general development 1 month of polishing, QA and fixes Spark 1.0 Feb 1 April 8 th, April 8 th + Spark 1.1 May 1 July 8 th, July 8 th +
9
Spark 1.0: By the numbers -3 months of development -639 patches -200+ JIRA issues -100+ contributors
10
API Stability in 1.X API’s are stable for all non-alpha projects Spark 1.1, 1.2, … will be compatible @DeveloperApi Internal API that is unstable @Experimental User-facing API that might stabilize later
11
Today’s Talk About the Spark release process The Spark 1.0 release Looking forward to Spark 1.1
12
Spark 1.0 Features Core engine improvements Spark streaming MLLib Spark SQL
13
Spark Core History server for Spark UI Integration with YARN security model Unified job submission tool Java 8 support Internal engine improvements
14
History Server Configure with : spark.eventLog.enabled=true spark.eventLog.dir=hdfs://XX In Spark Standalone, history server is embedded in the master. In YARN/Mesos, run history server as a daemon.
15
Job Submission Tool Apps don’t need to hard-code master: conf = new SparkConf().setAppName(“My App”) sc = new SparkContext(conf)./bin/spark-submit \ --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster
16
Java 8 Support RDD operations can use lambda syntax class Split extends FlatMapFunction { public Iterable call(String s) { return Arrays.asList(s.split(" ")); } ); JavaRDD words = lines.flatMap(new Split()); JavaRDD words = lines.flatMap(s -> Arrays.asList(s.split(" "))); Old New
17
Java 8 Support NOTE: Minor API changes (a) If you are extending Function classes, use implements rather than extends. (b) Return-type sensitive functions mapToPair mapToDouble
18
Python API Coverage rdd operators intersection(), take(), top(), topOrdered() meta-data name(), id(), getStorageLevel() runtime configuration setJobGroup(), setLocalProperty()
19
Integration with YARN Security Supports Kerberos authentication in YARN environments: spark.authenticate= true ACL support for user interfaces: spark.ui.acls.enable = true spark.ui.view.acls = patrick, matei
20
Engine Improvements Job cancellation directly from UI Garbage collection of shuffle and RDD data
21
Documentation Unified Scaladocs across modules Expanded MLLib guide Deployment and configuration specifics Expanded API documentation
22
Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL Spark SQL MLLib machine learning MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s RDD-Based Matrices
23
Spark SQL
24
Turning an RDD into a Relation // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects, register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",").map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people")
25
Querying using SQL // SQL statements can be run directly on RDD’s val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age "Name: " + t(0)).collect() // Language integrated queries (ala LINQ) val teenagers = people.where('age >= 10).where('age <= 19).select('name)
26
Import and Export // Save SchemaRDD’s directly to parquet people.saveAsParquetFile("people.parquet") // Load data stored in Hive val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ // Queries can be expressed in HiveQL. hql("FROM src SELECT key, value")
27
In Memory Columnar Storage Spark SQL can cache tables using an in- memory columnar format: - Scan only required columns - Fewer allocated objects (less GC) - Automatically selects best compression
28
Spark Streaming Web UI for streaming Graceful shutdown User-defined input streams Support for creating in Java Refactored API
29
MLlib Sparse vector support Decision trees Linear algebra SVD and PCA Evaluation support 3 contributors in the last 6 months
30
MLlib Note: Minor API change val data = sc.textFile("data/kmeans_data.txt") val parsedData = data.map( s => s.split(‘\t').map(_.toDouble).toArray) val clusters = KMeans.train(parsedData, 4, 100) val data = sc.textFile("data/kmeans_data.txt") val parsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble))) val clusters = KMeans.train(parsedData, 4, 100)
31
1.1 and Beyond Data import/export leveraging catalyst HBase, Cassandra, etc Shark-on-catalyst Performance optimizations External shuffle Pluggable storage strategies Streaming: Reliable input from Flume and Kafka
32
Unifying Experience SchemaRDD represents a consistent integration point for data sources spark-submit abstracts the environmental details (YARN, hosted cluster, etc). API stability across versions of Spark
33
Conclusion Visit spark.apache.org for videos, tutorials, and hands-on exercises.spark.apache.org Help us test a release candidate! Spark Summit on June 30 th spark-summit.orgspark-summit.org Meetup group meetup.com/spark-usersmeetup.com/spark-users
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.