BigData@polito - Inter-departmental Lab Idilio Drago / Marco Mellia
Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster
BigData@Polito Lab – Why? When the data is such that processing it becomes part of the challenge Volume, velocity, variety etc Extract some useful knowledge Data mining, machine learning, clustering … Big data cluster Open, flexible, scalable Based on open-source For experimental activities Research Teaching
Big data vs HPC HPC Focus on fast computing Message passing etc. Focus on storage Simple operations on large data Embarrassingly parallel tasks Divide and conquer principle Move code where data is located PB HPC Focus on fast computing - cores, ram, GHz, … Message passing etc. Move superfast little data to superfast CPUs TFLOPS
BigData@Polito Lab Involved departments Physical cluster location DET, DAUIN, DISMA, DIGEP Physical cluster location Auta T – Ing. del Cinema Scientific committee members Mellia Marco - Telecommunication Networks Group DET Baralis Elena - Database and Data Mining Group DAUIN Paolucci Emilio, Neirotti Paolo - DIGEP Mauro Gasparini, Vaccarino Francesco - DISMA Michiardi Pietro - Distributed Systems Group EURECOM (France)
History
Key ideas of big data frameworks Data locality principle Move algorithms to the data, not data to the algorithms Failures are the norm, not the exception The framework takes care of splitting data, synchronizing tasks, recovering in case of failures of a task or a server etc. Data intensive workloads MapReduce → a batch processing framework designed to perform full reads of the input, thus avoiding random access Horizontal scalability based on commodity servers e.g., doubling the number of servers, halving processing time
Map Reduce – Toy example How often a word appears in a collection of documents?
Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster
BigData@Polito – The Hardware
BigData@Polito – The Hardware 4 Switches N3048 18 Workers DELL R720XD 2 x Intel E5-2630v2 6 cores Memory 96 GB 12 HDs 3TB – JBOD 4+1 GbE Network 12 Workers SuperMicro 1 x Intel Xeon 6 cores Memory 64 / 32 GB 5 HDs 2TB – JBOD 2+1 GbE Network Workers: 576 logical cores (with HT) +2TB RAM 276 HDs 768 TB of storage ~ 45 GB/s “nominal” disk read speed (dd) 3 Masters DELL R620 2 x Intel E5-2630v2 6 cores Memory 128 GB 3 HDs 600GB in RAID 4+1 GbE Network
BigData@Polito – Logic Setup Link Aggregation w/Bonding (balance-alb) all machines are connected to both switches in their racks P2P communication is limited to 1 Gbps
Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster
The Software Based on the Cloudera platform
Architecture HDFS – Hadoop Distributed File System YARN – Yet Another Resource Negotiator Applications : MapReduce, Spark etc
HDFS: What is the usable disk capacity? Replication set to 3 – the client writes blocks to its own node first, then the other rack is used for a second and a third copy Therefore out cluster actual capacity is 256 TB Replicas guarantee resilience to disk failures (and we had some already) They give flexibility to allocation of executors
YARN: How are the resources shared? Scheduling Policy Preemption
YARN: How are the resources shared? Dominant Resource Fairness: Equalizes “dominant share” of users Host: <9 CPU, 18 GB> Task User 1: <1 CPU, 4 GB> dom res: memory Task User 2: <3 CPU, 1 GB> dom res: CPU Preemption occurs after 2 min: It is normal to wait some time to see the job starting running It is normal to see containers being killed
Spark applications
MLlib algorithms
Example – Spark execution overview The application creates a driver process The application gets its executor processes It sends the code and tasks to the executors Our current setup allows applications to have more than 500 executors (500+ threads reading and processing the data in parallel)
Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster
Raw HDFS read speed Thanks to overhead, the cluster can read up to 13 GB/s (without any processing)
Roughly, this cluster can sort 1 TB in ~10 min (mapred) Terasort Roughly, this cluster can sort 1 TB in ~10 min (mapred)
Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Samples of current usage of the cluster
How do I request an user account? First: Is this cluster/framework the best solution? The cluster has an independent LDAP/Kerberos system controlling access and HDFS permissions Contact the responsible in your department DET: Marco Mellia, Maurizio Munafò, Idilio Drago, … DAUIN: Elena Baralis, Paolo Garza, … … Fill in the form available at http://bigdata.polito.it/contact
How do I use the cluster? Go to http://bigdata.polito.it/content/access-instructions
Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster
Research Scope: New Algorithms and data science APPLICATION LAYER TRANSPORT LAYER Analysis of network traffic in real-time APPLICATION LAYER Analysis of OSN contents Scope: New Algorithms and data science Traffic classification, engineering Network security (e.g., malware detection) User and community profiling Recommendation systems
Teaching Computer Engineering MS current offering Data Mining Artificial Intelligence Big Data Management New track on Data Science Data Modeling + Data Engineering + Software engineering + Data Mining & Analytics
Questions?