Assoc. Prof. Marc FRÎNCU, PhD. Habil. marc.frincu@e-uvt.ro Big Data Technologies Lecture 4: Scalability: Algorithm + Data + hardware Assoc. Prof. Marc FRÎNCU, PhD. Habil. marc.frincu@e-uvt.ro
Scalability Ability of a system to manage an increasingly volume of work Capacity of a system to grow to process larger data Ideally by doubling the processing power the volume to be processed doubles as well λ - slope
Scalability Horizontal (in/out) Vertical (up/down) Adding processing nodes to existing ones Commonity clusters Group of networked machines by using Gigabit, Infiniband, Myrinet, … Requires data replication and synchronization mechanisms Vertical (up/down) Adding more resources on existing nodes Virtualization Adding more cores, RAM, disk, etc. to a VM Cloud computing (on demand) Limited by the physical capacity of a node
Virtualization Creates a virtual version of an OS, server, storage device, network, … Allows sharing physical resources amont multiple VMs (multi-tenancy) Enables the installation of hardware independent software Enables the configuration of images usable on a wide range of devices VMs are managed by a hypervisor (VMM) Hardware abstraction OS takes control of the hardware through the VMM
Virtualization Classic software stack Virtualized software stack
Containers Lightweight VMs Emulate the OS interface through native interface No VMM OS offers all the required support Examples: Linux containers, Solaris containers, BSD jails Advantages Fast allocation Performance similar to running on OS Lightweight
Containers
Docker Extension of Linux containers (LXC) Previously named dotCloud namespace Restricts what a container can see cgroups Restricts what a container can use from a resource
Scalability Strong Measuring execution time while keeping data volume constant but increasing the no. of processors Expectation: execution time drops k times if k processors are used Weak Measuring execution time while increasing the no. of processors but keeping the work volume per processor constant Expectation: execution time constant
Scalability Mith The more we parallelize code the faster it runs Ideally 2x resources = 2x faster In reality Code is not 100% parallelizable Communication & IO Resources are limited By adding resources we do get an improvement but it is limited σ – percentage of code not parallelizable
Law of universal scalability The more load the system receives the less work it will perform k – communication penalty coefficient Sweet point There is no purpose to add more resources beyond it
Examples Community detection in social networks Weather forecast
Communication price Communication low speedup Communication price: More processors drop in speedup Advantage of hybrid approach Communication price
Communication advantage Example: matrix multiplication OpenMP For small dimensions: advantage of shared memory For large dimensions: application does not scale MPI For small dimensions: communication cost For large dimensions: scalability(throughput, speedup)
Impact of data & algorithm For the same algorithm different data can impact its scalability Example: graph processing Platform Amazon EC2 m3.large (2 Intel Xeon E5-2670 cores, 7.5 RAM,100GB SSD, 1 GB Ethernet) 2 data sets: CARN, WIKI No. of nodes: 3, 6, 9 3 algorithms: Hashtag Aggregation At each step compute a statistics about a given tag in the graf Meme Tracking Analyze meme spread in a graph TDPS (Time Dependent Shortest Path) Used in routing Recompute the shortest path at each step
Impact of data & algorithm CARN Large diameter Node distribution: uniform WIKI Small diameter Node distribution: power law Idea Partition graph on many processors Number of interprocessor edges impacts communication Increasing the no. of partitions reduces scalability due to interprocessor communication (TPDS, MEME)
Impact of data & algorithm Setup I/O Processing (% parallel) Shutdown Example: detect influence spread in parallel on large graphs
Impact of parallel APIs Various MPI implementations
Impact of hardware platform Example: weather forecast (WRF) Bluegene scales well No. procs/speedup ratio
Lecture sources https://www.slideshare.net/vividcortex/quantifyin g-scalability-with-the-usl http://www1.chapman.edu/~radenski/research/pa pers/mergesort-pdpta11.pdf https://arxiv.org/pdf/1012.2273.pdf http://serc.iisc.ernet.in/~simmhan/pubs/simmhan -ipdps-2015.pdf https://books.google.ro/books?id=Jtha3wRWCkQ C&pg=PA485&lpg=PA485 http://lass.cs.umass.edu/~shenoy/courses/spring1 6/lectures/Lec06.pdf https://robinsystems.com/blog/containers-deep- dive-lxc-vs-docker-comparison/
Next lecture Data analysis Heterogeneous vs. homogeneous data Independent Dependent Graphs BSP model Data flows Heterogeneous vs. homogeneous data Processing platforms MapReduce Spark Streaming Apache Giraph