CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 2: Distributed Systems I Aidan Hogan

Slides:



Advertisements
Similar presentations
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Advertisements

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture III: 2014/03/23.
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture II: 2014/03/17.
Technical Architectures
Cis e-commerce -- lecture #6: Content Distribution Networks and P2P (based on notes from Dr Peter McBurney © )
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Introduction to Peer-to-Peer (P2P) Systems Gabi Kliot - Computer Science Department, Technion Concurrent and Distributed Computing Course 28/06/2006 The.
Based on last years lecture notes, used by Juha Takkinen.
A. Frank 1 Internet Resources Discovery (IRD) Peer-to-Peer (P2P) Technology (1) Thanks to Carmit Valit and Olga Gamayunov.
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
1 Client-Server versus P2P  Client-server Computing  Purpose, definition, characteristics  Relationship to the GRID  Research issues  P2P Computing.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
Dr. Kalpakis CMSC621 Advanced Operating Systems Introduction.
Introduction to client/server architecture
Middleware for P2P architecture Jikai Yin, Shuai Zhang, Ziwen Zhang.
Tiered architectures 1 to N tiers. 2 An architectural history of computing 1 tier architecture – monolithic Information Systems – Presentation / frontend,
Applied Architectures Eunyoung Hwang. Objectives How principles have been used to solve challenging problems How architecture can be used to explain and.
Client/Server Architectures
P2P File Sharing Systems
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Introduction of P2P systems
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Peer-to-Peer Networks University of Jordan. Server/Client Model What?
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
Architectures of distributed systems Fundamental Models
Personal Computer - Stand- Alone Database  Database (or files) reside on a PC - on the hard disk.  Applications run on the same PC and directly access.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Architecture Models. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
ICS362 – Distributed Systems Dr. Ken Cosh Week 2.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
ADVANCED COMPUTER NETWORKS Peer-Peer (P2P) Networks 1.
Peer to Peer Network Design Discovery and Routing algorithms
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Peer to Peer Computing. What is Peer-to-Peer? A model of communication where every node in the network acts alike. As opposed to the Client-Server model,
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Distributed Hash Tables Steve Ko Computer Sciences and Engineering University at Buffalo.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
Malugo – a scalable peer-to-peer storage system..
Background Computer System Architectures Computer System Software.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lecture 2: Distributed Systems I Aidan Hogan
2.2 Interfacing Computers MR JOSEPH TAN CHOO KEE TUESDAY 1330 TO 1530
CSE 486/586 Distributed Systems Distributed Hash Tables
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan
An example of peer-to-peer application
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 2: Introduction to Distributed Systems Aidan Hogan
CSE 486/586 Distributed Systems Distributed Hash Tables
Principles of Network Applications
CHAPTER 3 Architectures for Distributed Systems
Introduction to client/server architecture
#01 Client/Server Computing
CC Procesamiento Masivo de Datos Otoño Lecture 2: Distributed Systems
GCSE OCR 3 A451 Computing Client-server and peer-to-peer networks
CC Procesamiento Masivo de Datos Otoño Lecture 2 Distributed Systems
CSE 486/586 Distributed Systems Distributed Hash Tables
#02 Peer to Peer Networking
#01 Client/Server Computing
Presentation transcript:

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 2: Distributed Systems I Aidan Hogan

MASSIVE DATA NEEDS DISTRIBUTED SYSTEMS …

Monolithic vs. Distributed Systems One machine that’s n times as powerful? n machines that are equally as powerful? vs.

Parallel vs. Distributed Systems Parallel System – often = shared memory Distributed System – often = shared nothing Memory Processor Memory Processor Memory Processor Memory

What is a Distributed System? “A distributed system is a system that enables a collection of independent computers to communicate in order to solve a common goal.”

What is a Distributed System? “An ideal distributed system is a system that makes a collection of independent computers look like one computer (to the user).”

Disadvantages of Distributed Systems Advantages Cost – Better performance/price Extensibility – Add another machine! Reliability – No central point of failure! Workload – Balance work automatically Sharing – Remote access to services Disadvantages Software – Need specialised programs Networking – Can be slow Maintenance – Debugging sw/hw a pain Security – Multiple users Parallelisation – Not always applicable

WHAT MAKES A GOOD DISTRIBUTED SYSTEM?

Distributed System Design Transparency: Abstract/hide: – Access: How different machines are accessed – Location: What machines have what/if they move – Concurrency: Access by several users – Failure: Keep it a secret from the user “An ideal distributed system is a system that makes a collection of independent computers look like one computer (to the user).”

Distributed System Design Flexibility: – Add/remove/move machines – Generic interfaces Reliability: – Fault-tolerant: recover from errors – Security: user authentication – Availability: uptime/total-time

Distributed System Design Performance: – Runtimes (processing) – Latency, throughput and bandwidth (data) Scalability – Network and infrastructure scales – Applications scale – Minimise global knowledge/bottlenecks!

DISTRIBUTED SYSTEMS: CLIENT–SERVER ARCHITECTURE

Client–Server Model Client makes request to server Server acts and responds (For example: , WWW, Printing, etc.)

Client–Server: Thin Client Few computing resources for client: I/O – Server does the hard work (For example: PHP-heavy websites, SSH, , etc.)

Client–Server: Fat Client Fuller computing resources for client: I/O – Server sends data: computing done client-side (For example: Javascript-heavy websites, multimedia, etc.)

Client–Server: Mirror Servers User goes to any machine (replicated/mirror)

Client–Server: Proxy Server User goes to “forwarding” machine (proxy)

Server Client–Server: Three-Tier Server DataLogicPresentation HTTP GET: Total salary of all employees SQL: Query salary of all employees Add all the salaries Create HTML page

Client–Server: n-Tier Server Slide from Google’s Jeff Dean: Presentation Logical (multiple tiers) Data

DISTRIBUTED SYSTEMS: PEER-TO-PEER ARCHITECTURE

Peer-to-Peer (P2P) Client–Server Clients interact directly with a “central” server Peer-to-Peer Peers interact directly amongst themselves

Peer-to-Peer: Unstructured Ricky Martin’s new album? (For example: Kazaa, Gnutella)

Peer-to-Peer: Unstructured Pixie’s new album? (For example: Kazaa, Gnutella)

Peer-to-Peer: Structured (Central) In central server, each peer registers – Content – Address Peer requests content from server Peers connect directly Central point-of-failure (For example: Napster … central directory was shut down) Ricky Martin’s new album?

Peer-to-Peer: Structured (Hierarchical) Super-peers and peers

Peer-to-Peer: Structured (DHT) Distributed Hash Table (key,value) pairs key based on hash Query with key Insert with (key,value) Peer indexes key range Hash: 000 Hash: 111 (For example: Bittorrent’s Tracker)

Peer-to-Peer: Structured (DHT) Circular DHT: – Only aware of neighbours – O(n) lookups Implement shortcuts – Skips ahead – Enables binary-search- like behaviour – O(log(n)) lookups Pixie’s new album? 111

Peer-to-Peer: Structured (DHT) Handle peers leaving (churn) – Keep n successors New peers – Fill gaps – Replicate

Comparison of P2P Systems 2) Unstructured Search requires flooding (n lookups) Connections → n 2 No central point of failure Peers control their data Peers control neighbours 3) Structured Search follows structure (log(n) lookups) Connections →log(n) No central point of failure Peers assigned data Peers assigned neighbours For Peer-to-Peer, what are the benefits of (1) central directory vs. (2) unstructured, vs. (3) structured? 1) Central Directory Search follows directory (1 lookup) Connections → 1 Central point of failure Peers control their data No neighbours

P2P vs. Client–Server Client–Server Data lost in failure/deletes Search easier/faster Network often faster (to websites on backbones) Often central host – Data centralised – Remote hosts control data – Bandwidth centralised – Dictatorial – Can be taken off-line Peer-to-Peer May lose rare data (churn) Search difficult (churn) Network often slower (to conventional users) Multiple hosts – Data decentralised – Users (often) control data – Bandwidth decentralised – Democratic – Difficult to take off-line What are the benefits of Peer-to-Peer vs. Client–Server?

DISTRIBUTED SYSTEMS: HYBRID EXAMPLE (BITTORRENT)

BitTorrent: Search Server BitTorrent Search (Server) “ ricky martin ” Client–Server

BitTorrent: Tracker BitTorrent Peer Tracker (or DHT)

BitTorrent: File-Sharing

BitTorrent: Hybrid Uploader 1.Creates torrent file 2.Uploads torrent file 3.Announces on tracker 4.Monitors for downloaders 5.Connects to downloaders 6.Sends file parts Downloader 1.Searches torrent file 2.Downloads torrent file 3.Announces to tracker 4.Monitors for peers/seeds 5.Connects to peers/seeds 6.Sends & receives file parts 7.Watches illegal movie Local / Client–Server / Structured P2P / Direct P2P (Torrent Search Engines target of law-suits)

DISTRIBUTED SYSTEMS: IN THE REAL WORLD

Real-World Architectures: Hybrid Often hybrid! – Architectures herein are simplified/idealised – No clear black-and-white (just good software!) – For example, BitTorrent mixes different paradigms – But good to know the paradigms

Physical Location: Cluster Computing Machines (typically) in a central, local location; e.g., a local LAN in a server room

Physical Location: Cluster Computing

Physical Location: Cloud Computing Machines (typically) in a central, remote location; e.g., a server farm like Amazon EC2

Physical Location: Cloud Computing

Physical Location: Grid Computing Machines in diverse locations

Physical Location: Grid Computing

Physical Locations Cluster computing: – Typically centralised, local Cloud computing: – Typically centralised, remote Grid computing: – Typically decentralised, remote

LIMITATIONS OF DISTRIBUTED SYSTEMS: EIGHT FALLACIES

Eight Fallacies By L. Peter Deutsch (1994) – James Gosling (1997) “Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause big trouble and painful learning experiences.” — L. Peter Deutsch Each fallacy is a false statement!

1. The network is reliable Machines fail, connections fail, firewall eats messages flexible routing retry messages acknowledgements!

2. Latency is zero M1: Store X M1 M2: Copy X from M1 M2 There are significant communication delays avoid “races” local order ≠ remote order acknowledgements minimise remote calls – batch data! avoid waiting – multiple-threads

3. Bandwidth is infinite M1: Copy X (10GB) M1 M2 Limited in amount of data that can be transferred avoid resending data direct connections caching!! M1: Copy X (10GB)

4. The network is secure M1: Send Medical History M1 Network is vulnerable to hackers, eavesdropping, viruses, etc. send sensitive data directly isolate hacked nodes – hack one node ≠ hack all nodes authenticate messages secure connections

5. Topology doesn’t change Message M5 thru M2, M3, M4 How machines are physically connected may change (“churn”)! avoid fixed routing – next-hop routing? abstract physical addresses flexible content structure M2 M3 M4 M5 M1

6. There is one administrator Different machines have different policies! Beware of firewalls! Don’t assume most recent version – Backwards compat.

7. Transport cost is zero It costs time/money to transport data: not just bandwidth (Again) minimise redundant data transfer – avoid shuffling data – caching direct connection compression?

8. The network is homogeneous Devices and connections are not uniform interoperability! – Java vs..NET? route for speed – not hops load-balancing

Eight Fallacies (to avoid) 1.The network is reliable 2.Latency is zero 3.Bandwidth is infinite 4.The network is secure 5.Topology doesn’t change 6.There is one administrator 7.Transport cost is zero 8.The network is homogeneous Severity of fallacies vary in different scenarios! Which fallacies apply/do not apply for: Gigabit ethernet LAN? BitTorrent The Web

LABS REVIEW/PREP

Why did it work in memory? We processed a lot of data. Why did it work in memory? Not so many unique words … – but lots of new proper nouns – Heap’s law: – U(n) ≈ Kn β – English text K ≈ 10 β ≈ 0.6

What if it doesn’t work in memory? How could we implement a word- count (or a bi-gram count) using the hard disk for storage?

Most generic method: use sorting tengo que aprender más español tan pronto que puedo y tengo que tomar cada oportunidad para practicar como ahora aprender cada como español más oportunidad para practicar pronto puedo que tan tengo tomar y ahora1 aprender1 cada1 como1 español1 más1 oportunidad1 para1 practicar1 pronto1 puedo1 que3 tan1 tengo2 tomar1 y1 que3 tengo2 ahora1 aprender1 cada1 como1 español1 más1 oportunidad1 para1 practicar1 pronto1 puedo1 tan1 tomar1 y1 How can we use the disk to sort?

External Merge-Sort 1: Batch Sort in batches bigram121 bigram42 bigram732 bigram42 bigram123 bigram149 bigram42 bigram1294 bigram123 bigram42 bigram6 bigram123 bigram42 bigram121 bigram732 Input on-disk (Input size: n) In-memory sort (Batch size b) Output batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram6 bigram42 bigram123

External Merge-Sort 2: Merge bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 In-memory sortInput batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 Sorted output (Output size: n)

Counting bigrams is then easy? bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 bigram6, 1 bigram42, 4 bigram121, 1 bigram 123, 3 bigram149, 1 bigram732, 1 bigram1294, 1 Could use merge-sort again to order by occurrence!

Does external merge-sorting scale? If you have too many batches to read simultaneously, disk will go nuts – Use lots of main-memory to reduce batch count – Only merge k at a time Any problem with external merge-sorting as we scale really high? If we have n batches and merge them k at a time, how many passes will we need? Any solution(s)?

RECAP

Topics Covered What is a (good) Distributed System? Client–Server model – Fat/thin client – Mirror/proxy servers – Three-tier Peer-to-Peer (P2P) model – Central directory – Unstructured – Structured (Hierarchical/DHT) – BitTorrent

Topics Covered Physical locations: – Cluster (local, centralised) vs. – Cloud (remote, centralised) vs. – Grid (remote, decentralised) 8 fallacies – Network isn’t reliable – Latency is not zero – Bandwidth not infinite, – etc.

Topics Covered External Merge Sorting – Split data into batches – Sort batches in memory – Write batches to disk – Merge sorted batches into final output

Questions?