Big Data Open Source Software and Projects ABDS in Summary I

Slides:



Advertisements
Similar presentations
Thanks to Microsoft Azure’s Scalability, BA Minds Delivers a Cost-Effective CRM Solution to Small and Medium-Sized Enterprises in Latin America MICROSOFT.
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary I I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary II: Layers 3 to 4 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Understanding Active Directory
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
.NET, and Service Gateways Group members: Andre Tran, Priyanka Gangishetty, Irena Mao, Wileen Chiu.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
material assembled from the web pages at
BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.
Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox
Built on the Powerful Microsoft Azure Platform, Phyzit Helps Doctors Reduce Readmissions Through a Transitional Care Management App MICROSOFT AZURE ISV.
1 Advanced Software Architecture Muhammad Bilal Bashir PhD Scholar (Computer Science) Mohammad Ali Jinnah University.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Actualog Social PIM Helps Companies to Manage and Share Product Information Using Secure, Scalable Ease of Microsoft Azure MICROSOFT AZURE ISV PROFILE:
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
TACTIC | Workflow: Project Management OSS on Microsoft Azure Helps Enterprises to Create Streamline, Manage, and Track Digital Content MICROSOFT AZURE.
INFSO-RI JRA2 Test Management Tools Eva Takacs (4D SOFT) ETICS 2 Final Review Brussels - 11 May 2010.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
© Copyright IBM Corporation 2016 Diagram Template IBM Cloud Architecture Center Using the Diagram Template This template is for use in creating a visual.
DreamFactory for Microsoft Azure Is an Open Source REST API Platform That Enables Mobilization of Data in Minutes across Frameworks and Storage Methods.
Service Oriented Architecture (SOA) Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
RESERVOIR Service Manager NickTsouroulas Head of Open-Source Reference Implementations Unit Juan Cáceres
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Chapter 6: Securing the Cloud
MICROSOFT AZURE ISV PROFILE: BMC SOFTWARE
Organizations Are Embracing New Opportunities
Distributed Tracing How to do latency analysis for microservice-based applications Reshmi
DocFusion 365 Intelligent Template Designer and Document Generation Engine on Azure Enables Your Team to Increase Productivity MICROSOFT AZURE APP BUILDER.
Introduction to Distributed Platforms
Prepared by: Assistant prof. Aslamzai
Partner Logo Veropath Offers a Next-Gen Expense Management SaaS Technology Solution, Built Specifically to Harness Big Data Analytics Capabilities in Azure.
Spark Presentation.
Platform as a Service.
Cherwell Service Management is an IT Service Management Solution that Makes it Easier for Users to Capitalize on Power of Microsoft Azure MICROSOFT AZURE.
Cloud Computing Platform as a Service
THE STEPS TO MANAGE THE GRID
CHAPTER 3 Architectures for Distributed Systems
Nimble Streamer Helps Media Content Providers Create Streaming Networks Cost-Effectively and Easily by Utilizing Azure’s Worldwide Scalability MICROSOFT.
OpenNebula Offers an Enterprise-Ready, Fully Open Management Solution for Private and Public Clouds – Try It Easily with an Azure Marketplace Sandbox MICROSOFT.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
I590 Data Science Curriculum August
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Interlake Hybrid Cloud Management Suite
Logsign All-In-One Security Information and Event Management (SIEM) Solution Built on Azure Improves Security & Business Continuity MICROSOFT AZURE APP.
Service-centric Software Engineering
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Accelerate Your Self-Service Data Analytics
CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.
What’s changed in the Shibboleth 1.2 Origin
Inventory of Distributed Computing Concepts
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
Media365 Portal by Ctrl365 is Powered by Azure and Enables Easy and Seamless Dissemination of Video for Enhanced B2C and B2B Communication MICROSOFT AZURE.
Overview of big data tools
Abiquo’s Hybrid Cloud Management Solution Helps Enterprises Maximise the Full Potential of the Microsoft Azure Platform MICROSOFT AZURE ISV PROFILE: ABIQUO.
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
Department of Intelligent Systems Engineering
Introduction to Web Services
Nuvolex and Microsoft Azure Combine to Deliver a Multitenant Office 365 Management Platform that Ranks Among Most Advanced in the Industry MICROSOFT AZURE.
I590 Data Science Curriculum August
Presentation transcript:

Big Data Open Source Software and Projects ABDS in Summary I I590 Data Science Curriculum August 15 2014 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

HPC-ABDS Layers Message and Data Protocols Distributed Coordination: Security & Privacy: Monitoring: IaaS Management from HPC to hypervisors: DevOps: Interoperability: File systems: Cluster Resource Management: Data Transport: SQL / NoSQL / File management: In-memory databases&caches / Object-relational mapping / Extraction Tools Inter process communication Collectives, point-to-point, publish-subscribe Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: High level Programming: Application and Analytics: Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom

HPC-ABDS Layers Message and Data Protocols Distributed Coordination: Security & Privacy: Monitoring: IaaS Management from HPC to hypervisors: DevOps: Interoperability: File systems: Cluster Resource Management: Data Transport: SQL / NoSQL / File management: In-memory databases&caches / Object-relational mapping / Extraction Tools Inter process communication Collectives, point-to-point, publish-subscribe Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: High level Programming: Application and Analytics: Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom

Apache Thrift http://en.wikipedia.org/wiki/Apache_Thrift Thrift is an interface definition language and binary communication protocol that is used to define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development". It combines a software stack with a code generation engine to build services that work efficiently to a varying degree and seamlessly between C#, C++ (on POSIX-compliant systems), Cappuccino, Cocoa, Delphi, Erlang, Go, Haskell, Java, Node.js, OCaml, Perl, PHP, Python, Ruby and Smalltalk Note this type of capability augmented by serializers such as Java Kyro

Google Protobuf (Protocol Buffers) http://en.wikipedia.org/wiki/Protocol_Buffers Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. Protocol Buffers are a method of serializing structured data. As such, they are useful in developing programs to communicate with each other over a wire or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates from that description source code in various programming languages for generating or parsing a stream of bytes that represents the structured data. Protocol Buffers are serialized into a binary wire format which is compact, forwards-compatible, and backwards-compatible, but not self-describing (that is, there is no way to tell the names, meaning, or full datatypes of fields without an external specification). C++, Java, Python Protocol Buffers are very similar to the Apache Thrift protocol (used by Facebook for example), except that the public Protocol Buffers implementation does not include a concrete RPC protocol stack to use for defined services.

Apache Avro http://avro.apache.org/docs/current/ Apache Avro relies on schemas defined with Json. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present. When Avro is used in RPC, the client and server exchange schemas in the connection handshake. Avro differs from Thrift and Protocol Buffers in these ways Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

HPC-ABDS Layers Message Protocols Distributed Coordination: Security & Privacy: Monitoring: IaaS Management from HPC to hypervisors: DevOps: Interoperability: File systems: Cluster Resource Management: Data Transport: SQL / NoSQL / File management: In-memory databases&caches / Object-relational mapping / Extraction Tools Inter process communication Collectives, point-to-point, publish-subscribe Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: High level Programming: Application and Analytics: Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom

Apache Zookeeper http://en.wikipedia.org/wiki/Apache_ZooKeeper Important technology to provide reliable control metadata in distributed scalable systems Zookeeper is a distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub project of Hadoop but is now a top-level project in its own right. ZooKeeper's architecture supports high availability through redundant services. The clients can thus ask another ZooKeeper master if the first fails to answer. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a trie (digital tree) datastructure. Clients can read and write from/to the nodes and in this way have a shared configuration service. Updates are totally ordered. ZooKeeper is used by companies including Rackspace, Yahoo and eBay as well as open source enterprise search systems like Solr and Storm. See improved technology Giraffe http://grid.hust.edu.cn/xhshi/projects/giraffe.htm

JGroups http://en.wikipedia.org/wiki/JGroups JGroups is a reliable multicast system written in the Java language and Open Source under LGPL JGroups adds a "grouping" layer over a transport protocol, internally keeping a list of participants. This list is used to: Make the application aware of the listeners Make some or all transmissions reliable Allow totally ordered transmissions JGroups is a toolkit for reliable multicast communication. It can be used to create groups of processes whose members can send messages to each other. JGroups enables developers to create reliable multipoint (multicast) applications where reliability is a deployment issue. JGroups also relieves the application developer from implementing this logic themselves. This saves significant development time and allows for the application to be deployed in different environments without having to change code The most powerful feature of JGroups is its flexible protocol stack, which allows developers to adapt it to exactly match their application requirements and network characteristics. The benefit of this is that you only pay for what you use. By mixing and matching protocols, various differing application requirements can be satisfied. JGroups comes with a number of protocols UDP (IP Multicast), TCP, JMS (but anyone can write their own).

HPC-ABDS Layers Message Protocols Distributed Coordination: Security & Privacy: Monitoring: IaaS Management from HPC to hypervisors: DevOps: Interoperability: File systems: Cluster Resource Management: Data Transport: SQL / NoSQL / File management: In-memory databases&caches / Object-relational mapping / Extraction Tools Inter process communication Collectives, point-to-point, publish-subscribe Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: High level Programming: Application and Analytics: Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom

OpenStack Keystone http://www.ibm.com/developerworks/cloud/library/cl-openstack-keystone/index.html Keystone integrates the OpenStack functions for authentication, policy management, and catalog services, including registering all tenants and users, authenticating users and granting tokens for authorization, creating policies that span all users and services, and managing a catalog of service endpoints. The core object of an identity-management system is the user — a digital representation of a person, system, or service using OpenStack services. Users are often assigned to containers called tenants, which isolate resources and identity objects. A tenant can represent a customer, account, or any organizational unit. Security policies are enforced with a rule-based authorization engine. After a user has been authenticated, the next step is to determine the level of authorization. Keystone encapsulates a set of rights and privileges with a notion called a role. The tokens that the identity service issues include a list of roles that the authenticated user can assume. It is then up to the resource service to match the set of user roles with the requested set of resource operations and either grant or deny access.

Apache Sentry http://sentry.incubator.apache.org/ Role based authorization designed to work with Cloudera Impala (used by Impala in its release) and Apache Hive Originally called Cloudera Access and moved to Apache incubator in August 2013

HPC-ABDS Layers Message Protocols Distributed Coordination: Security & Privacy: Monitoring: IaaS Management from HPC to hypervisors: DevOps: Interoperability: File systems: Cluster Resource Management: Data Transport: SQL / NoSQL / File management: In-memory databases&caches / Object-relational mapping / Extraction Tools Inter process communication Collectives, point-to-point, publish-subscribe Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: High level Programming: Application and Analytics: Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom

Apache Ambari Apache Ambari is contributed by Hortonworks and has multiple cluster management and monitoring functions Provisioning a Hadoop Cluster: Ambari includes an intuitive Web interface that allows one to easily provision, configure and test all the Hadoop services and core components and achieve a wizard-driven installation of Hadoop across any number of hosts. Ambari also provides the powerful Ambari Blueprints API for automating cluster installations without user intervention. Managing a Hadoop cluster: Ambari provides tools to simplify cluster management. The Web interface allows you to control the lifecycle of Hadoop services and components, modify configurations and manage the ongoing growth of your cluster. Monitoring a Hadoop cluster:  Ambari pre-configures alerts for watching Hadoop services and visualizes cluster operational data in a simple Web interface allowing one to monitor health of Hadoop installation.

Nagios Nagios http://www.nagios.org/ is an open source (GPL) computer system monitoring, network monitoring and infrastructure monitoring software application. Nagios offers monitoring and alerting services for servers, switches, applications, and services. It alerts the users when things go wrong and alerts them a second time when the problem has been resolved. “core” is open source but there is a commercial (enterprise) version

Nagios Nagios http://www.nagios.org/ is an open source (GPL) computer system monitoring, network monitoring and infrastructure monitoring software application. Nagios offers monitoring and alerting services for servers, switches, applications, and services. It alerts the users when things go wrong and alerts them a second time when the problem has been resolved. “core” is open source but there is a commercial (enterprise) version

Ganglia http://en.wikipedia.org/wiki/Ganglia_(software) Ganglia is a BSD licensed scalable distributed system monitor tool for high-performance computing systems such as clusters and grids. It allows the user to remotely view live or historical statistics (such as CPU load averages or network utilization) for all machines that are being monitored. It is based on a hierarchical design targeted at federations of clusters. SDSC bundled Ganglia monitoring into their Rocks Installation Tool. http://www.ibm.com/developerworks/library/l-ganglia-nagios-1/ Ganglia is more concerned with gathering metrics and tracking them over time while Nagios has focused on being an alerting mechanism.

Inca Monitoring Tool http://inca.sdsc.edu/ is an open source system from SDSC enabling user level monitoring with a powerful reporting mechanism. Inca detects Grid (cluster) infrastructure problems by executing periodic, automated, user-level testing of Grid software and services. It supports multiple “reporters” for different tests. For example, there are 196 Inca reporters available to test and measure aspects of FutureGrid systems. https://portal.futuregrid.org/tutorials/inca