CS 239 – Big Data Systems Fall 2018 Harry Xu UCLA
My Research Background Programming languages and compilers Static and dynamic program analysis Compiler Runtime system Big Data systems Dataflow systems Graph systems Distributed systems Single-machine disk-based systems Some industrial experience Microsoft – created and solely developed an optimizing compiler for Cosmos/Scope that improved the overall performance of production jobs by up to 3X IBM – created and developed a series of profiling tools for large-scale systems Big Data system support for scalable program analysis Language/runtime support for scalable systems
BigDatalog Application Circle Infrastructure Circle
This Course: Big Data Systems What it is about Low-level infrastructures Programming models Runtimes Scalability and efficiency What it is NOT about High-level applications Workloads Data collection and usage An example We are going to discuss some papers on machine learning systems We are NOT going to discuss learning models and algorithms because I don’t know much about them
Industrial Relevance Many papers came directly from industry GFS, MapReduce, Bigtable, Spanner, TensorFlow (Google) HDFS (Yahoo) Azure, Trill, Dryad, Naiad (Microsoft) Spark, Tachyon (Databricks) Applications v.s. systems Many people can develop applications Few people can develop systems Applications are specific to domains while skills required to build infrastructures are generic
Goals to Achieve Understand what systems are available for data analytics Understand fundamental challenges in system design Understand how to design a customized system for a certain workload Gain experience with system development by proposing and implementing a new idea
What This Course is Related To Distributed systems Database systems Computer Architecture Networking Storage (memory, disk, file system, etc.) Graph algorithms Statistics Machine learning
Aspects of Big Data Processing Where to put data? How to process data at scale? How to process different types of data? Structured data Unstructured data Streaming data Graph data Data for model training How to take advantage of technological advances How to make processing efficient?
Topics Covered (I) Distributed storage systems Dataflow engines HDFS, GFS, Bigtable, Spanner, and Azure storage Dataflow engines MapReduce, Dryad, AsterixDB, Spark Batch processing Hive, Spark SQL, and SCOPE Resource Management Mesos, YARN, LATE, Borg, Sparrow
Topics Covered (II) Stream processing Graph processing Storm, Flink, Kafka, Naiad, Trill, SVE, Drizzle Graph processing Pregel, Ligra, GraphChi, Xstream, GridGraph Machine learning TensorFlow, Parameter Servers, Project Adam
Why Do We Need Those Systems Enablers Better performance Scalability Efficiency Energy Easy/flexible programmability
Course Structure Paper critiques Presentation Due before each presentation day Presentation 20-25 mins Participation in active discussion Project 2-3 students form a group, working on an innovative idea in system development
Things about Presentations/Critiques Reuse slides as much as possible A good rule of thumb is to follow this order What problems does the paper solve? Why are they (serious) problems? Why aren’t they already solved? What are the main challenges? How did the authors overcome them? What evidence did the authors show that the problems is solved? Questions, concerns, opportunities for improvement