Big Data Open Source Software and Projects ABDS in Summary XIV: Layer 11C Data Science Curriculum March 5 2015 Geoffrey Fox

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

Thanks to Microsoft Azure’s Scalability, BA Minds Delivers a Cost-Effective CRM Solution to Small and Medium-Sized Enterprises in Latin America MICROSOFT.
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary II: Layers 3 to 4 Data Science Curriculum March Geoffrey Fox
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
The World Wide Web and the Internet Dr Jim Briggs 1WUCM1.
Big Data Open Source Software and Projects ABDS in Summary XVII: Layer 13 Part 2 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
1 Alternate Title Slide: Presentation Name Goes Here Presenter’s Name Infrastructure Solutions Division Date GIS Perfct Ltd. Autodesk Value Added Reseller.
SaaS, PaaS & TaaS By: Raza Usmani
Big Data Open Source Software and Projects ABDS in Summary IX: Level 11C I590 Data Science Curriculum August Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
IT – DBMS Concepts Relational Database Theory.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
1 Web Server Administration Chapter 1 The Basics of Server and Web Server Administration.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Cloud Computing
M1G Introduction to Database Development 6. Building Applications.
Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
PHP Features. Features Clean syntax. Object-oriented fundamentals. An extensible architecture that encourages innovation. Support for both current and.
Database Architectures Database System Architectures Considerations – Data storage: Where do the data and DBMS reside? – Processing: Where.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
TACTIC | Workflow: Project Management OSS on Microsoft Azure Helps Enterprises to Create Streamline, Manage, and Track Digital Content MICROSOFT AZURE.
Skill Area 214 Introduce World wide web(www)
MySQL An Introduction Databases 101.
Powered by Microsoft Azure, Auctori Is the Next Generation in Multilingual, Global, Search Engine Optimized Web Content Management Systems MICROSOFT AZURE.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
DreamFactory for Microsoft Azure Is an Open Source REST API Platform That Enables Mobilization of Data in Minutes across Frameworks and Storage Methods.
The Post Windows Operating System
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Introduction to Distributed Platforms
From DBA to DPA – Becoming a Data Platform Administrator
Big Data A Quick Review on Analytical Tools
PGT(CS) ,KV JHAGRAKHAND
Database System Concepts and Architecture
Open Source distributed document DB for an enterprise
Spark Presentation.
Platform as a Service.
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
PHP / MySQL Introduction
LAMP, WAMP and.. L. Grewe.
I590 Data Science Curriculum August
The Only Digital Asset Management System on Microsoft Azure, MediaValet Is Uniquely Equipped to Meet Any Company’s Needs MICROSOFT AZURE ISV PROFILE: MEDIAVALET.
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Accelerate Your Self-Service Data Analytics
Keep Your Digital Media Assets Safe and Save Time by Choosing ImageVault to be Your Digital Asset Management Solution, Hosted in Microsoft Azure Partner.
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
Overview of big data tools
Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.
Database Software.
Department of Intelligent Systems Engineering
Database Management Systems
Microsoft Azure Services Platform
Convergence of Big Data and Extreme Computing
I590 Data Science Curriculum August
Presentation transcript:

Big Data Open Source Software and Projects ABDS in Summary XIV: Layer 11C Data Science Curriculum March Geoffrey Fox School of Informatics and Computing Digital Science Center Indiana University Bloomington

Functionality of 21 HPC-ABDS Layers 1)Message Protocols: 2)Distributed Coordination: 3)Security & Privacy: 4)Monitoring: 5)IaaS Management from HPC to hypervisors: 6)DevOps: 7)Interoperability: 8)File systems: 9)Cluster Resource Management: 10)Data Transport: 11)A) File management B) NoSQL C) SQL 12)In-memory databases&caches / Object-relational mapping / Extraction Tools 13)Inter process communication Collectives, point-to-point, publish-subscribe, MPI: 14)A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15)A) High level Programming: B) Application Hosting Frameworks 16)Application and Analytics: 17)Workflow-Orchestration: Here are 21 functionalities. (including 11, 14, 15 subparts) 4 Cross cutting at top 17 in order of layered diagram starting at bottom

Database Rankings I Covers NoSQL and SQL Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google and Bing for this measurement. In order to count only relevant results, we are searching for together with the term database, e.g. "Oracle" and "database". General interest in the system. For this measurement, we use the frequency of searches in Google Trends. Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange. Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired. Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional network LinkedIn. Relevance in social networks. We count the number of Twitter tweets, in which the system is mentioned.

Database Rankings II NoSQL

Apache Derby Apache Derby is a relational database management system written in Java and based on the SQL and JDBC standards Derby offers a small footprint (~2.6 megabytes), an embedded JDBC driver, and is easy to deploy and use. Derby originated in 1996 as a startup out of Oakland, CA called Cloudscape Inc. Cloudscape was acquired by Informix and then later by IBM. IBM donated the code to Apache in 2004, creating the Derby incubator project. Derby is a subproject of Apache DB. Derby has been included as part of the Java API since the Java 7 release, rebranded as “JavaDB”. Typically used as an embedded database. Performance not competitive as a standalone system. Derby #50 March

SQLite Public domain is a lightweight RDBMS designed to be used as a library (i.e. embedded) rather than a standalone serverhttp:// The browsers Google Chrome, Opera, Safari and the Android Browser all allow for storing information in, and retrieving it from, a SQLite database within the browser, using the Web SQL Database technology Mozilla Firefox and Mozilla Thunderbird store a variety of configuration data (bookmarks, cookies, contacts etc.) in internally managed SQLite databases, and even offer an add-on to manage SQLite databases. Skype is a widely deployed application that uses SQLite It used inside main smartphone O/S – Apple, Microsoft, Blackberry, Symbian, Android SQLite is ACID-compliant and implements most of the SQL standard, using a dynamically and weakly typed SQL syntax SQLite #9 March

MySQL Popular GNU license SQL database or relational database management system (RDBMS), – Second in number of installations to SQLite as an open source RDBMS Now owned by Oracle with open source and supported versions. Part of LAMP which refers to archetypal model of web service solution stacks, originally consisting of four components: Linux, the Apache HTTP Server, the MySQL relational database management system, and the PHP programming language. – As a solution stack, LAMP is suitable for building dynamic web sites and web applications. Used in cloud architectures but not often as central storage engine but rather for “small” metadata and such Though MySQL began as a low-end alternative to more powerful proprietary databases, it has gradually evolved to support higher-scale needs as well. It is still most commonly used in small to medium scale single-server deployments, either as a component in a LAMP-based web application or as a standalone database server. Much of MySQL's appeal originates in its relative simplicity and ease of use, which is enabled by an ecosystem of open source tools such as phpMyAdmin. In the medium range, MySQL can be scaled by deploying it on more powerful hardware, such as a multi-processor server MySQL #2 behind Oracle March

Galera Cluster Galera is a multi-master extension for MySQL that enables multiple MySQL nodes to handle reads and writes, synchronizing the writes on each node simultaneously. There is also a MySQL Cluster version; Galera comes from a fork of MySQL MariaDB, a MySQL fork, now ships MariaDB Galera Cluster as its cluster solution too. MariaDB #25 March

PostgreSQL PostgreSQL is an open source high quality object-relational database ORDBMS with many similarities to MySQLhttp://en.wikipedia.org/wiki/PostgreSQL According to originally PostgreSQL was known for my features and MySQL for more performance and better ease of use but with time, the systems have become more similarhttp:// PostgreSQL is developed by the PostgreSQL Global Development Group, a diverse group of many companies and individual contributors. It is free and open source software, released under the terms of the PostgreSQL License, a permissive free software license. Michael Stonebraker, a distinguished Berkeley faculty member developed Ingres on which PostgreSQL (Post Ingres) is based. He just got Turing award! #5

CUBRID is open source SQL-based relational database management system (RDBMS) with object extensions developed by Naver Corporation for web applications. The name CUBRID is a combination of the two words cube and bridge, cube standing for a sealed box that provides security for its contents while bridge standing for data bridge. First released November 2008 and written in C The feature that distinguishes CUBRID database from other relational database systems is its 3-tier client-server architecture, which consists of the database server, the connection broker and the application layer Note Server GNU but clients more lenient BSD style license Ranked #153 March 2015 at

Oracle, DB2, SQL Server Dominant commercial object-relational (objects, classes and inheritance are directly supported in database schemas and in the query language) database management system ORDBMS with reputation for high quality and high cost Started in 1977 when Larry Ellison and friends founded Software Development Laboratories (SDL). stems compares many other proprietary and open source systems stems Microsoft SQL Server, IBM DB2 and to a lesser extent Sybase and Teradata are other major commercial RDBMS Has all sorts of extensions such as spatial query support and all sorts of “editions” (Enterprise, Standard, Express) There is substantial debate comparing this classic approach to Hadoop based approaches like Hive which parallelize with greater performance Oracle supports ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. – Compare with Eventually consistent services which provide BASE (Basically Available, Soft state, Eventual consistency) semantics, – BASE gets inconsistent answers before convergence of multiple distributed updates

SciDB SciDB is an array database designed for multidimensional data management and analytics common to scientific, geospatial, financial, and industrial applications. – Arrays are natively supported including parallel operations on them – It is developed by company Paradigm4, co-founded by Michael Stonebraker of PostgreSQL fame. – License is Affero General Public License AGPL Key features include: – Support of provenance – Out of memory arrays – Massive scale math on the arrays for linear algebra and analytics. – Uncertainty can be modeled by associating error-bars with data. – Efficient storage. Partly motivated as a database community answer to Hadoop In engines.com/en/ranking March 2015: #150http://db- engines.com/en/ranking

Rasdaman Rasdaman ("raster data manager") is a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, and statistics data. A frequently used synonym to arrays is raster data, such as in 2-D raster graphics; this actually has motivated the name Rasdaman. However, Rasdaman has no limitation in the number of dimensions - it can serve, for example, 1-D measurement data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z exploration data, 4-D ocean and climate data. There is a Rasdaman query language, rasql.

Pivotal Greenplum Greenplum is a classic SQL database acquired by EMC in 2010 and bundled in their spinoff Pivotal in 2012 Greenplum commercial and built on PostgresSQL and aimed at data warehousing – Note PostgresSQL license encourages use in commercial products whereas MySQL GNU license doesn’t Parallelism by master-slave replication #36 in march

Public Cloud SQL as a Service Provides traditional databases as a service on clouds Azure SQL Service us/library/azure/dn aspx based on SQL Serverhttp://msdn.microsoft.com/en- us/library/azure/dn aspx Google Cloud SQL based on MySQL Amazon Relational Database Service (Amazon RDS) with MySQL, PostgreSQL, Oracle and SQL Server

N1QL N1QL builds on the SQL language and includes many of SQL's features in addition to features associated with document-oriented databases. Designed for Couchbase document oriented caching NoSQL store N1QL allows for joins, filter expressions, aggregate expressions, subqueries, data manipulation language (DML), and many other features to build a rich application.

Google F1 supports Adwords The strong consistency properties of F1 and its storage system come at the cost of higher write latencies compared to MySQL. The commit latency on F1 is quite high (at ms). Read latency takes a hit as well, with simple reads in the 5-10ms range. Google’s core ad business runs on F1, at a scale of 10s of TBs across 1000s of machines. Ability to keep good enough performance by scaling the system matters more than individual latency.

IBM dashDB Similar to Amazon Redshift (level 15A) offering warehouse capabilities for data stored in NoSQL Cloudant 01.ibm.com/softw are/data/dashdb/ 01.ibm.com/softw are/data/dashdb/

BlinkDB BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to tradeoff query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. Builds on Hive/Shark but uses sampling to get high performance Using a 100 node cluster BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200× faster than Hive), with an error of 2-10%