Architecture and design

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Esri UC 2014 | Technical Workshop | Creating Geoprocessing Services Kevin Hibma.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.

HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

MapReduce using Hadoop Jan Krüger … in 30 minutes...

Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences

Image taken from: slideshare

Spark: Cluster Computing with Working Sets

Big Data is a Big Deal!.

Condor DAGMan: Managing Job Dependencies with Condor

PROTECT | OPTIMIZE | TRANSFORM

Concept & Examples of pyspark

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Introduction to Distributed Platforms

An Open Source Project Commonly Used for Processing Big Data Sets

Spark Presentation.

CSE-291 Cloud Computing, Fall 2016 Kesden

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Introduction to Operating Systems

CS6604 Digital Libraries IDEAL Webpages Presented by

CS110: Discussion about Spark

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Introduction to Apache

Spark and Scala.

Lecture 16 (Intro to MapReduce and Hadoop)

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Introduction to Spark.

MMG: from proof-of-concept to production services at scale

Distributing META-pipe on ELIXIR compute resources

CS639: Data Management for Data Science

Apache Hadoop and Spark

Gordon Erlebacher Florida State University

IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.

Fast, Interactive, Language-Integrated Cluster Computing

Streaming data processing using Spark

Big-Data Analytics with Azure HDInsight

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

CS639: Data Management for Data Science

Pig Hive HBase Zookeeper

Presentation transcript:

Architecture and design META-pipe Architecture and design

Outline Architecture Authorization Server Background: Spark What happens when a user submits a job? Failure handling

Architecture

AAI: SAML/OAuth2.0 Integration authorization server

Overview

Authorization server Features Techonologies Dropwizard web framework SAML 2.0 integration designed for the Elixir AAI OAuth 2.0 Implicit flow Authorization code grant Client Credentials (special clients only) Bearer Token introspection OIDC UserInfo-endpoint Mapping table between internal user IDs and remote use IDs at the IdP Simple authorization based on uri-prefix storage/users/alex authorizes storage/users/alex/test.txt YAML-based configuration Dropwizard web framework Apache Oltu OAuth library Spring Security SAML Hibernate ORM PostgreSQL

Background: Spark

Spark “Apache Spark is a fast and general engine for large-scale data processing” - Spark Website Provides interactive response times to large amounts of data Written in Scala, but can also be used from Java python Python and R Fault tolerant

RDD - Overview Immutable representation of a dataset Deterministic instantiation and transformation Distributed (partitions) Instantiated by transforming another RDD from an input source, like a file on HDFS Computation close to the data Fault tolerant (based on lineage)

RDD - Example val lines = spark.textFile("hdfs://...") val errors = lines.filter(_.startsWith("ERROR")) errors.persist() // Returns Seq[String] errors.filter(_.contains("HDFS")) .map(_.split('\t')(3)) .collect() Function serialization (taken from Spark paper)

RDD – Transformations and actions Method Signature map(f: T => U) RDD[T] => RDD[U] filter(f: T => Bool) RDD[T] => RDD[T] groupByKey() RDD[(K, V)] => RDD[(K, Seq[V])] join() (RDD[K,V],RDD[K,W]) => RDD[(K, (V, W))] partitionBy(p: Partitioner[K]) RDD[(K, V)] => RDD[(K, V)] Method Signature count() RDD[T] => Long collect() RDD[T] => Seq[T] reduce(f: (T, T) => T) RDD[T] => T save(path: String) Outputs RDD to a storage system, e.g., HDFS, Amazon S3

submitting a job

Spark on Stallo

JobService Service that sits between user interface and execution backend Isolates back-end errors from the end user Keeps track of: Which jobs (with parameters) have been submitted by which users References to input- and output datasets Different attempts to run a job (retries)

Job service workflow

Causes for failure Systems becoming unavailable Administration Bugs Stallo reboot Shared file system unavailable Power outage Administration Re-deployment of META-pipe (new version, tool update) Reboot of Spark cluster after configuration update Bugs Tool parser errors Unexpected exceptions Invalid input The FASTQ file turned out be a video file. How to recover?

User Interfaces

User submits a job

Submitting in a new process qsub spark-submit (cluster mode) VM creation

Snapshotting Spark tool RDDs are dumped to disk when computed Simple if-test to see if a tool has already run

Challenges (TODO) Automatic scaling based on queue size Monitoring and logging Big Data