CS 239 – Big Data Systems Fall 2018

Slides:



Advertisements
Similar presentations
Ali Ghodsi UC Berkeley & KTH & SICS
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
Runtime Techniques for Efficient and Reliable Program Execution Harry Xu CS 295 Winter 2012.
Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
Tyson Condie.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Big Data Yuan Xue CS 292 Special topics on.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics (CS40003) Introduction to Data Lecture #1
Big Data Analytics and HPC Platforms
Connected Infrastructure
CS 405G: Introduction to Database Systems
TensorFlow– A system for large-scale machine learning
PROTECT | OPTIMIZE | TRANSFORM
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Smart Building Solution
Department of Intelligent Systems Engineering
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Smart Building Solution
Yak: A High-Performance Big-Data-Friendly Garbage Collector
Connected Infrastructure
Data Platform and Analytics Foundational Training
Introduction to Web Mining
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Speculative Region-based Memory Management for Big Data Systems
Big Data Analytics in Parallel Systems
ETL Architecture for Real-Time BI
Introduction to Spark.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Yak: A High-Performance Big-Data-Friendly Garbage Collector
CS 179 Project Intro.
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Introduction to Apache
Parallel Analytic Systems
Overview of big data tools
Syllabus and Introduction Keke Chen
Architecture for Real-Time ETL
Spark and Scala.
TIM TAYLOR AND JOSH NEEDHAM
Enabling ML Based Research
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Department of Intelligent Systems Engineering
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
CS 345A Data Mining Lecture 1
Agenda Need of Cloud Computing What is Cloud Computing
TensorFlow: A System for Large-Scale Machine Learning
CS 345A Data Mining Lecture 1
Introduction to Web Mining
Streaming data processing using Spark
Big-Data Analytics with Azure HDInsight
CS 345A Data Mining Lecture 1
Introduction to Azure Data Lake
Convergence of Big Data and Extreme Computing
Presentation transcript:

CS 239 – Big Data Systems Fall 2018 Harry Xu UCLA

My Research Background Programming languages and compilers Static and dynamic program analysis Compiler Runtime system Big Data systems Dataflow systems Graph systems Distributed systems Single-machine disk-based systems Some industrial experience Microsoft – created and solely developed an optimizing compiler for Cosmos/Scope that improved the overall performance of production jobs by up to 3X IBM – created and developed a series of profiling tools for large-scale systems Big Data system support for scalable program analysis Language/runtime support for scalable systems

BigDatalog Application Circle Infrastructure Circle

This Course: Big Data Systems What it is about Low-level infrastructures Programming models Runtimes Scalability and efficiency What it is NOT about High-level applications Workloads Data collection and usage An example We are going to discuss some papers on machine learning systems We are NOT going to discuss learning models and algorithms because I don’t know much about them

Industrial Relevance Many papers came directly from industry GFS, MapReduce, Bigtable, Spanner, TensorFlow (Google) HDFS (Yahoo) Azure, Trill, Dryad, Naiad (Microsoft) Spark, Tachyon (Databricks) Applications v.s. systems Many people can develop applications Few people can develop systems Applications are specific to domains while skills required to build infrastructures are generic

Goals to Achieve Understand what systems are available for data analytics Understand fundamental challenges in system design Understand how to design a customized system for a certain workload Gain experience with system development by proposing and implementing a new idea

What This Course is Related To Distributed systems Database systems Computer Architecture Networking Storage (memory, disk, file system, etc.) Graph algorithms Statistics Machine learning

Aspects of Big Data Processing Where to put data? How to process data at scale? How to process different types of data? Structured data Unstructured data Streaming data Graph data Data for model training How to take advantage of technological advances How to make processing efficient?

Topics Covered (I) Distributed storage systems Dataflow engines HDFS, GFS, Bigtable, Spanner, and Azure storage Dataflow engines MapReduce, Dryad, AsterixDB, Spark Batch processing Hive, Spark SQL, and SCOPE Resource Management Mesos, YARN, LATE, Borg, Sparrow

Topics Covered (II) Stream processing Graph processing Storm, Flink, Kafka, Naiad, Trill, SVE, Drizzle Graph processing Pregel, Ligra, GraphChi, Xstream, GridGraph Machine learning TensorFlow, Parameter Servers, Project Adam

Why Do We Need Those Systems Enablers Better performance Scalability Efficiency Energy Easy/flexible programmability

Course Structure Paper critiques Presentation Due before each presentation day Presentation 20-25 mins Participation in active discussion Project 2-3 students form a group, working on an innovative idea in system development

Things about Presentations/Critiques Reuse slides as much as possible A good rule of thumb is to follow this order What problems does the paper solve? Why are they (serious) problems? Why aren’t they already solved? What are the main challenges? How did the authors overcome them? What evidence did the authors show that the problems is solved? Questions, concerns, opportunities for improvement