CS 239 – Big Data Systems Fall 2018

Slides:

Advertisements

Similar presentations

Ali Ghodsi UC Berkeley & KTH & SICS

Advertisements

Starfish: A Self-tuning System for Big Data Analytics.

Runtime Techniques for Efficient and Reliable Program Execution Harry Xu CS 295 Winter 2012.

Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.

SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.

SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.

Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Big Data Yuan Xue CS 292 Special topics on.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Data Analytics (CS40003) Introduction to Data Lecture #1

Big Data Analytics and HPC Platforms

Connected Infrastructure

CS 405G: Introduction to Database Systems

TensorFlow– A system for large-scale machine learning

PROTECT | OPTIMIZE | TRANSFORM

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

Smart Building Solution

Department of Intelligent Systems Engineering

Data Analytics and CERN IT Hadoop Service

Hadoop and Analytics at CERN IT

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Spark Presentation.

Smart Building Solution

Yak: A High-Performance Big-Data-Friendly Garbage Collector

Connected Infrastructure

Data Platform and Analytics Foundational Training

Introduction to Web Mining

The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.

Speculative Region-based Memory Management for Big Data Systems

Big Data Analytics in Parallel Systems

ETL Architecture for Real-Time BI

Introduction to Spark.

Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo

Yak: A High-Performance Big-Data-Friendly Garbage Collector

CS 179 Project Intro.

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Introduction to Apache

Parallel Analytic Systems

Overview of big data tools

Syllabus and Introduction Keke Chen

Architecture for Real-Time ETL

Spark and Scala.

TIM TAYLOR AND JOSH NEEDHAM

Enabling ML Based Research

INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA

Department of Intelligent Systems Engineering

Zoie Barrett and Brian Lam

Charles Tappert Seidenberg School of CSIS, Pace University

CS 345A Data Mining Lecture 1

Agenda Need of Cloud Computing What is Cloud Computing

TensorFlow: A System for Large-Scale Machine Learning

CS 345A Data Mining Lecture 1

Introduction to Web Mining

Streaming data processing using Spark

Big-Data Analytics with Azure HDInsight

CS 345A Data Mining Lecture 1

Introduction to Azure Data Lake

Convergence of Big Data and Extreme Computing

Presentation transcript:

CS 239 – Big Data Systems Fall 2018 Harry Xu UCLA

My Research Background Programming languages and compilers Static and dynamic program analysis Compiler Runtime system Big Data systems Dataflow systems Graph systems Distributed systems Single-machine disk-based systems Some industrial experience Microsoft – created and solely developed an optimizing compiler for Cosmos/Scope that improved the overall performance of production jobs by up to 3X IBM – created and developed a series of profiling tools for large-scale systems Big Data system support for scalable program analysis Language/runtime support for scalable systems

BigDatalog Application Circle Infrastructure Circle

This Course: Big Data Systems What it is about Low-level infrastructures Programming models Runtimes Scalability and efficiency What it is NOT about High-level applications Workloads Data collection and usage An example We are going to discuss some papers on machine learning systems We are NOT going to discuss learning models and algorithms because I don’t know much about them

Industrial Relevance Many papers came directly from industry GFS, MapReduce, Bigtable, Spanner, TensorFlow (Google) HDFS (Yahoo) Azure, Trill, Dryad, Naiad (Microsoft) Spark, Tachyon (Databricks) Applications v.s. systems Many people can develop applications Few people can develop systems Applications are specific to domains while skills required to build infrastructures are generic

Goals to Achieve Understand what systems are available for data analytics Understand fundamental challenges in system design Understand how to design a customized system for a certain workload Gain experience with system development by proposing and implementing a new idea

What This Course is Related To Distributed systems Database systems Computer Architecture Networking Storage (memory, disk, file system, etc.) Graph algorithms Statistics Machine learning

Aspects of Big Data Processing Where to put data? How to process data at scale? How to process different types of data? Structured data Unstructured data Streaming data Graph data Data for model training How to take advantage of technological advances How to make processing efficient?

Topics Covered (I) Distributed storage systems Dataflow engines HDFS, GFS, Bigtable, Spanner, and Azure storage Dataflow engines MapReduce, Dryad, AsterixDB, Spark Batch processing Hive, Spark SQL, and SCOPE Resource Management Mesos, YARN, LATE, Borg, Sparrow

Topics Covered (II) Stream processing Graph processing Storm, Flink, Kafka, Naiad, Trill, SVE, Drizzle Graph processing Pregel, Ligra, GraphChi, Xstream, GridGraph Machine learning TensorFlow, Parameter Servers, Project Adam

Why Do We Need Those Systems Enablers Better performance Scalability Efficiency Energy Easy/flexible programmability

Course Structure Paper critiques Presentation Due before each presentation day Presentation 20-25 mins Participation in active discussion Project 2-3 students form a group, working on an innovative idea in system development

Things about Presentations/Critiques Reuse slides as much as possible A good rule of thumb is to follow this order What problems does the paper solve? Why are they (serious) problems? Why aren’t they already solved? What are the main challenges? How did the authors overcome them? What evidence did the authors show that the problems is solved? Questions, concerns, opportunities for improvement