Big Data Open Source Software and Projects ABDS in Summary VII: Level 10 I590 Data Science Curriculum August 15 2014 Geoffrey Fox

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

Welcome to Middleware Joseph Amrithraj
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Socket Layer Security. In this Presentation: need for web security SSL/TLS transport layer security protocols HTTPS secure shell (SSH)
High Performance Computing Course Notes Grid Computing.
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary II: Layers 3 to 4 Data Science Curriculum March Geoffrey Fox
The Internet Useful Definitions and Concepts About the Internet.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Networking Theory (part 2). Internet Architecture The Internet is a worldwide collection of smaller networks that share a common suite of communication.
SSH : The Secure Shell By Rachana Maheswari CS265 Spring 2003.
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
Business Data Communications & Networking
Internet Basics.
Lecture slides prepared for “Business Data Communications”, 7/e, by William Stallings and Tom Case, Chapter 8 “TCP/IP”.
A global, public network of computer networks. The largest computer network in the world. Computer Network A collection of computing devices connected.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Application Layer Functionality and Protocols Network Fundamentals – Chapter 3.
Telnet/SSH: Connecting to Hosts Internet Technology1.
Directory and File Transfer Services Chapter 7. Learning Objectives Explain benefits offered by centralized enterprise directory services such as LDAP.
Data Communications and Networks
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Computation for Physics 計算物理概論 Introduction to Linux.
Remote Access Chapter 4. Learning Objectives Understand implications of IEEE 802.1x and how it is used Understand VPN technology and its uses for securing.
Remote Access Chapter 4. Learning Objectives Understand implications of IEEE 802.1x and how it is used Understand VPN technology and its uses for securing.
1.1 What is the Internet What is the Internet? The Internet is a shared media (coaxial cable, copper wire, fiber optics, and radio spectrum) communication.
Behzad Akbari Spring 2012 (These slides are based on lecture slides by Lawrie Brown)
Networks – Network Architecture Network architecture is specification of design principles (including data formats and procedures) for creating a network.
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Service Primitives Six service primitives that provide a simple connection-oriented service 4/23/2017
1 Version 3.0 Module 11 TCP Application and Transport.
TCP/IP fundamentals Unit objectives Discuss the evolution of TCP/IP Discuss TCP/IP fundamentals.
1 Web Development & Design Foundations with XHTML Chapter 1 Key Concepts.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox
Hour 7 The Application Layer 1. What Is the Application Layer? The Application layer is the top layer in TCP/IP's protocol suite Some of the components.
1 Welcome to CSC 301 Web Programming Charles Frank.
Network Security. 2 SECURITY REQUIREMENTS Privacy (Confidentiality) Data only be accessible by authorized parties Authenticity A host or service be able.
Application Layer Khondaker Abdullah-Al-Mamun Lecturer, CSE Instructor, CNAP AUST.
Chapter 3: Services of Network Operating Systems Maysoon AlDuwais.
Computer Networking From LANs to WANs: Hardware, Software, and Security Chapter 13 FTP and Telnet.
BASIC INTERNET PROTOCOLS: http, ftp, telnet. Mirela Walczak.
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
Kuliah Pengantar Teknologi Informasi Coky Fauzi Alfi cokyfauzialfi.wordpress.com Internet (2)
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Protocols COM211 Communications and Networks CDA College Olga Pelekanou
Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox
Linux Services Configuration
Protocols Monil Adhikari. Agenda Introduction Port Numbers Non Secure Protocols FTP HTTP Telnet POP3, SMTP Secure Protocols HTTPS.
Enterprise Network Systems TCP Mark Clements. 3 March 2008ENS 2 Last Week – Client/ Server Cost effective way of providing more computing power High specs.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
Lecture 6 (Chapter 16,17,18) Network and Internet Security Prepared by Dr. Lamiaa M. Elshenawy 1.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
Chapter 7: Using Network Clients The Complete Guide To Linux System Administration.
APACHE Apache is generally recognized as the world's most popular Web server (HTTP server). Originally designed for Unix servers, the Apache Web server.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Virtual Private Networks
Instructor Materials Chapter 5 Providing Network Services
FTP - File Transfer Protocol
Chapter 3: Windows7 Part 4.
Telnet/SSH Connecting to Hosts Internet Technology.
An Introduction to Computer Networking
Networking Theory (part 2)
Internet Protocols IP: Internet Protocol
APACHE WEB SERVER.
Chapter 7 Network Applications
Networking Theory (part 2)
I590 Data Science Curriculum August
Presentation transcript:

Big Data Open Source Software and Projects ABDS in Summary VII: Level 10 I590 Data Science Curriculum August Geoffrey Fox School of Informatics and Computing Digital Science Center Indiana University Bloomington

HPC-ABDS Layers 1)Message Protocols 2)Distributed Coordination: 3)Security & Privacy: 4)Monitoring: 5)IaaS Management from HPC to hypervisors: 6)DevOps: 7)Interoperability: 8)File systems: 9)Cluster Resource Management: 10)Data Transport: 11)SQL / NoSQL / File management: 12)In-memory databases&caches / Object-relational mapping / Extraction Tools 13)Inter process communication Collectives, point-to-point, publish-subscribe 14)Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: 15)High level Programming: 16)Application and Analytics: 17)Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom

BitTorrent BitTorrent is a protocol supporting the practice of peer-to-peer file sharing that is used to distribute large amounts of data over the Internet. BitTorrent is one of the most common protocols for transferring large files, and peer-to-peer networks have been estimated to collectively account for approximately 43% to 70% of all Internet traffic (depending on geographical location) as of February – In November 2004, BitTorrent was responsible for 35% of all Internet traffic. – As of February 2013, BitTorrent was responsible for 3.35% of all worldwide bandwidth, more than half of the 6% of total bandwidth dedicated to file sharing. Bittorrent popularity has decreased but it was very influential in creating a revolution in music distribution Programmer Bram Cohen, a former University at Buffalo graduate student in Computer Science, designed the protocol in April 2001 and released the first available version on 2 July 2001, and the final version in BitTorrent clients are available for a variety of computing platforms and operating systems including an official client released by Bittorrent, Inc. A key aspect of Bittorrent is that a given file transfer occurs with multiple hosts each delivering part of file (mp3) to be shared – It only works for cases like music distribution where a given file exists in many places on internet

HTTP The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text. HTTP is the protocol to exchange or transfer hypertext. The standards development of HTTP was coordinated by the Internet Engineering Task Force (IETF) and the World Wide Web Consortium (W3C), culminating in the publication of a series of Requests for Comments (RFCs), most notably RFC 2616 (June 1999), which defined HTTP/1.1, the version of HTTP most commonly used today. In June 2014, RFC 2616 was retired and HTTP/1.1 was redefined by RFCs 7230, 7231, 7232, 7233, 7234, and – IETF and W3C are important standards setting bodies in Internet and Web areas respectively – HTTP is Port 80 and HTTPS is Port 443 Although not the most efficient protocol, HTTP is a universal data transfer mechanism

FTP The File Transfer Protocol (FTP) initially developed in is a standard network protocol used to transfer computer files from one host to another host over a TCP-based network, such as the Internet. FTP is built on a client-server architecture and uses separate control and data connections between the client and the server. – FTP users may authenticate themselves using a clear-text sign-in protocol, normally in the form of a username and password, but can connect anonymously if the server is configured to allow it. – For secure transmission that protects the username and password, and encrypts the content, FTP is often secured with SSL/TLS (FTPS). SSH File Transfer Protocol (SFTP) is sometimes also used instead, but is technologically different. The first FTP client applications were command-line applications developed before operating systems had graphical user interfaces, and are still shipped with most Windows, Unix, and Linux operating systems. Many FTP clients and automation utilities have since been developed for desktops, servers, mobile devices, and hardware, and FTP has been incorporated into productivity applications, such as Web page editors. Uses Port 20 (data) and 21 (control)

Globus Online (GridFTP) GridFTP is an extension of the standard File Transfer Protocol (FTP) for high-speed, reliable, and secure data transfer. The protocol was defined within the GridFTP working group of the Open Grid Forum. There are multiple implementations of the protocol; the most widely used is that provided by the Globus toolkit. The aim of GridFTP is to provide a more reliable and high performance file transfer, for example to enable the transmission of very large files. GridFTP is used extensively within large science projects such as the Large Hadron Collider and by many supercomputer centers and other scientific facilities. GridFTP also addresses the problem of incompatibility between storage and access systems. Previously, each data provider would make their data available in their own specific way, providing a library of access functions. – This made it difficult to obtain data from multiple sources, requiring a different access method for each, and thus dividing the total available data into partitions. – GridFTP provides a uniform way of accessing the data, encompassing functions from all the different modes of access, building on and extending the universally accepted FTP standard. FTP was chosen as a basis for it because of its widespread use, and because it has a well defined architecture for extensions to the protocol (which may be dynamically discovered). Better performance is achieved by transmitting multiple streams Numerous GridFTP clients have been developed. The Globus Online software-as-a-service system is particularly popular. It hosts control on public clouds and can be considered as SaaS version of GridFTP

SSH Secure Shell (SSH) is a cryptographic network protocol for secure data communication, remote command-line login, remote command execution, and other secure network services between two networked computers. It connects, via a secure channel over an insecure network, a server and a client running SSH server and SSH client programs, respectively. The protocol specification distinguishes between two major versions that are referred to as SSH-1 and SSH-2. – SSH uses public-key cryptography to authenticate the remote computer and allow it to authenticate the user, if necessary. The best-known application of the protocol is for access to shell accounts on Unix-like operating systems, but it can also be used in a similar fashion for accounts on Windows. It was designed as a replacement for Telnet and other insecure remote shell protocols such as the Berkeley rsh and rexec protocols, which send information, notably passwords, in plaintext, rendering them susceptible to interception and disclosure using packet analysis. The encryption used by SSH is intended to provide confidentiality and integrity of data over an unsecured network, such as the Internet. SSH is port 22

Apache Flume Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic applications. It originated at Cloudera and is written primarily in Java with the following components: Event – a singular unit of data that is transported by Flume (typically a single log entry) Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs. Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS. Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel. Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels. Client – produces and transmits the Event to the Source operating within the Agent

Sqoop Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache project. Sqoop supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. – Exports can be used to put data from Hadoop into a relational database. Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server databases to Hadoop. Couchbase, Inc. also provides a Couchbase Server-Hadoop connector by means of Sqoop Couchbase Server, originally Membase, is an open source, distributed NoSQL document-oriented database that is optimized for interactive applications; related to but different from CouchDB.