Sajitha Naduvil-vadukootu 11.2 Processes Sajitha Naduvil-vadukootu
Overview Review: Processes NFS: Network File System(1985) HDFS: Hadoop Distributed File System(2006) Spark Architecture (2010) IPFS: The permanent Web(2014) Future Work
Review : Processes Form the basis of how work gets done How are they organized in a system? Single threaded or Multithreaded Stateful or Stateless Architecture: Clients - Servers, Clusters, Master - Worker, Peer to Peer How communication is done. Code Migration - sending computation to the machine instead of communicating.
NFS: Network File System For accessing remote file systems, transparent to clients. Integrated into the Unix Kernel using a Virtual File System interface. Remote Procedure Call Package for communication. Synchronous calls from client to server, Server is stateless.
HDFS (Hadoop Distributed File System) Abstract cluster’s storage, presenting a single file system. Flexibility(schema-less), Durability, Fault tolerance, Balanced data distribution Relaxed consistency, no locking for concurrent writes to the same file. Split files into chunks and replicate them for fault tolerance. Map-Reduce as a data processing model.
Spark Architecture For processing iterative jobs involving large data sets and interactive queries. Built on top of HDFS and included as a library. Application code is sent to the workers. Master - Worker architecture. Resilient Distributed Datasets. Job Scheduling.
IPFS: The InterPlanetary File System Global file system that can access very large data. Peer to Peer High Throughput for accessing large (Peta Byte) data files. No single point of Failure Peers don’t need to trust each other Inspired by Bit Torrent (file sharing application) and HTTP(protocol)
Future Work Resource allocation can be improved or even automated by monitoring computing capacity on the worker nodes in master-worker architecture. Instead of using one master, where memory becomes a constraint, use multiple masters who collaborate, or use master-less systems (grids, peer-to-peer). Need a more unified interface for accessing underlying data.
References [1] Tanenbaum, Andrew S., and Maarten Van Steen. Distributed systems: principles and paradigms. Prentice-Hall, 2007. [2] Sandberg, Russel, et al. "Design and implementation of the Sun network filesystem." Proceedings of the Summer USENIX conference. 1985. [3] Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95. [4] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (October 2016), 56-65. DOI: https://doi.org/10.1145/2934664 [5] Learning Spark: Lightning-Fast Big Data Analysis: Book by Andy Konwinski, Holden Karau, Matei Zaharia, and Patrick Wendell [6] Benet, Juan. "Ipfs-content addressed, versioned, p2p file system." arXiv preprint arXiv:1407.3561 (2014). [7] https://ipfs.io/ [8] Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 2010.
Thank you