Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.

Slides:



Advertisements
Similar presentations
Tableau Software Australia
Advertisements

Database Architectures and the Web
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 6 Managing and Administering DNS in Windows Server 2008.
Highly Available Central Services An Intelligent Router Approach Thomas Finnern Thorsten Witt DESY/IT.
Oracle Clustering and Replication Technologies CCR Workshop - Otranto Barbara Martelli Gianluca Peco.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Hands-On Microsoft Windows Server 2003 Administration Chapter 9 Administering DNS.
1 DNS,NFS & RPC Rizwan Rehman, CCS, DU. Netprog: DNS and name lookups 2 Hostnames IP Addresses are great for computers –IP address includes information.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
DNS. Outline r Domain Name System r DNS Hierarchy r Resolution.
IB ACM InfiniBand Communication Management Assistant (for Scaling) Sean Hefty.
DNS and Active Directory Integration
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Copyright © 2007 InfiniBand ® Trade Association. Other names and brands are properties of their respective owners. IB Cross-Subnet Communication OpenFabrics.
SRP Update Bart Van Assche,.
Computer System Architectures Computer System Software
Chapter 16 – DNS. DNS Domain Name Service This service allows client machines to resolve computer names (domain names) to IP addresses DNS works at the.
OFED 1.x Roadmap & Release Process November 06 Jeff Squyres, Woodruff, Robert J, Betsy Zeller, Tziporet Koren,
1 Application Layer Lecture 6 Imran Ahmed University of Management & Technology.
OFED 1.2 Lessons, 1.3 Planning and Field Support May 07 Tziporet Koren.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
InfiniBand Routing Solution Approach Yaron Haviv, CTO, Voltaire
Presented by, MySQL AB® & O’Reilly Media, Inc. 0 to 60 in 3.1 Tyler Carlton Cory Sessions.
Scalable name and address resolution infrastructure -- Ira Weiny/John Fleck #OFADevWorkshop.
Management Scalability Author: Todd Rimmer Date: April 2014.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.
Configuring Name Resolution and Additional Services Lesson 12.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
InfiniBand Routers Ian Colloff : QLogic LWG Co-Chair.
Management Tools Development related to DoE Hal Rosenstock.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
OFED 1.3 InfiniBand Management Update Hal Rosenstock.
Virtual Machines Created within the Virtualization layer, such as a hypervisor Shares the physical computer's CPU, hard disk, memory, and network interfaces.
DNS DNS overview DNS operation DNS zones. DNS Overview Name to IP address lookup service based on Domain Names Some DNS servers hold name and address.
Clustering Servers Chapter Seven. Exam Objectives in this Chapter:  Plan services for high availability Plan a high availability solution that uses clustering.
Introduction to Windows Server 2003,. 2 Objectives Identify the key features of each platform that makes up the Windows Server 2003 family Understand.
Hands-On Microsoft Windows Server 2003 Chapter 1 Introduction to Windows Server 2003, Standard Edition.
Stairway to the cloud or can we take the highway? Taivo Liik.
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
OFED 1.2 Management Update Hal Rosenstock.
Renesas Electronics America Inc. © 2010 Renesas Electronics America Inc. All rights reserved. Overview of Ethernet Networking A Rev /31/2011.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
Operational and Application Experiences with the Infiniband Environment Sharon Brunett Caltech May 1, 2007.
Linux Management Enhancements Hal Rosenstock.
OpenFabrics Developers Summit SC06 QoS Update and Implementation RFC Eitan Zahavi, Mellanox Technologies Nov 2006.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
InfiniBand Routing in OFA Jason Gunthorpe – Obsidian Sean Hefty – Intel Hal Rosenstock – Voltaire.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
SDN controllers App Network elements has two components: OpenFlow client, forwarding hardware with flow tables. The SDN controller must implement the network.
Enhancements for Voltaire’s InfiniBand simulator
Understanding and Improving Server Performance
Netscape Application Server
Infiniband Architecture
LCG 3D Distributed Deployment of Databases
Distributed Network Traffic Feature Extraction for a Real-time IDS
Open Source distributed document DB for an enterprise
DNS.
Simple Connectivity Between InfiniBand Subnets
CHAPTER 3 Architectures for Distributed Systems
Conditions Data access using FroNTier Squid cache Server
Early Measurements of a Cluster-based Architecture for P2P Systems
Application taxonomy & characterization
Presentation transcript:

Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies

The Problem And The Solution SA queried for every connection Communication between all nodes creates an n 2 load on the SA In InfiniBand architecture (IBA), SA is a centralized entity Other n 2 scalability issues –Name to address (DNS) Mainly solved by a hosts file –IP address translation Relies on ARPs Solution: Scalable SA (SSA) –Turns a centralized problem into a distributed one March 30 – April 2, 2014#OFADevWorkshop2 n^2 SA load

Analysis March 30 – April 2, 2014#OFADevWorkshop3 SMSA 500 MB 1.6 billion path records 40,000 nodes 50k queries per second ~ 9 hours ~ 1.5 hours calculation

SSA Architecture March 30 – April 2, 2014#OFADevWorkshop4 Localized caching Data Processing Database replication Management Core Distribution Access Client Access Client Distribution Access Client

Distribution Tree Built with rsockets AF_IB support Parent selected based on “nearness” based on hops as well as balancing based on fanouts March 30 – April 2, 2014#OFADevWorkshop5

rsockets AF_IB rsend/rrecv performance On “luna” class machines as sender and receiver with 4x QDR links and 1 intervening switch –8 core Intel(R) Xeon(R) CPU 2.00GHz Default rsocket tuning parameters No CPU utilization measurements yet SMDB: ~0.5 GB (for 40K nodes) March 30 – April 2, 2014#OFADevWorkshop6 Data Transfer Size in BytesElapsed Time 0.5 GB0.669 seconds 1.0 GB1.342 seconds

Distribution Tree Number of management nodes needed is dependent on subnet size and node capability (CPU speed, memory) –Combined nodes Fanouts in distribution tree for 40K compute nodes –10 distribution per core –20 access per distribution –200 consumer per access March 30 – April 2, 2014#OFADevWorkshop7

Core Layer March 30 – April 2, 2014#OFADevWorkshop8 SM SM’ Nodes join SSA tree Core found at SM LID raw SM DB  SSA DB extraction and comparison Manage SSA group - distribution control - monitoring - rebalancing

Core Performance Initial subnet up for ~20K nodes fabric –Extraction: sec –Comparison: sec SUBNET UP after no change in fabric –Extraction: sec –Comparison: sec SUBNET UP after single switch unlink and relink –Extraction: sec –Comparison: sec Measurements above on Intel(R) Xeon(R) CPU 2.00GHz 8 cores & 16G RAM March 30 – April 2, 2014#OFADevWorkshop9

Distribution Layer March 30 – April 2, 2014#OFADevWorkshop10 SM SM’ Transaction log - incremental updates - lockless Data agnostic Distributes SSA DB - relational data model - data versioning (epoch value)

Access Layer March 30 – April 2, 2014#OFADevWorkshop11 SM SM’ Epoch value - lightweight notification - minimal job impact Data aware Formats data - select SA queries - higher-level queries

Access Layer Notes Calculates SMDB into PRDB on per consumer basis –Multicore/CPU computation Only updates epoch if PRDB for that consumer has changed March 30 – April 2, 2014#OFADevWorkshop12

Access Layer Measurements/Future Improvement(s) Half world (HW) PR calculations for 10K node simulated subnet Using GUID buckets/core approach, parallelizing HW PR calculation works ~16 times faster on 16 core CPU –Single threaded takes 8 min 30 sec for all nodes –Multi threaded (thread per core) takes 33 seconds –Parallelization will be less than linear with CPU cores Future Improvement(s) –One HW path record per leaf switch used for all the hosts that are attached to the same leaf switch March 30 – April 2, 2014#OFADevWorkshop13

Compute Nodes (Consumer/ACM) March 30 – April 2, 2014#OFADevWorkshop14 SM SM’ Localized cache - compares epoch - pull updates Integrated with IB ACM - via librdmacm Publish local data - hostname - IP addresses

ACM Notes ACM pulls PRDB at daemon startup and when application is resolving routes/paths –Minimize OS jitter during running job ACM is moving to plugin architecture –ACM version 1 (multicast backend) –SSA backend Other ACM improvements being pursued –More efficient cache structure –Single underlying PathRecord cache ? March 30 – April 2, 2014#OFADevWorkshop15

Combined Node/Layer Support Core and access Distribution and access March 30 – April 2, 2014#OFADevWorkshop16

Reliability March 30 – April 2, 2014#OFADevWorkshop17 SM SM’ Local databases - log files for consistency Primary and backup parents Error reporting - parent notifies core of error

System Requirements AF_IB capable kernel –3.11 and beyond librdmacm with AF_IB and keepalive support –Beyond release libibverbs libibumad –Beyond release OpenSM – release or beyond March 30 – April 2, 2014#OFADevWorkshop18

OpenMPI RDMA CM AF_IB connector contributed to master branch recently –Thanks to Vasily Mellanox –Need to work out release details Not in 1.7 or 1.6 releases March 30 – April 2, 2014#OFADevWorkshop19

Deployment March 30 – April 2, 2014#OFADevWorkshop20 SMSA IB ACM Shipped by distros IB ACM Shipped by distros IB SSA Core package IB SSA Core package IB SSA Distribution package IB SSA Distribution package Mgmt Nodes Compute Nodes

Project Team Hal Rosenstock (Mellanox) - Maintainer Sean Hefty (Intel) Ira Weiny (Intel) Susan Colter (LANL) Ilya Nelkenbaum (Mellanox) Sasha Kotchubievsky (Mellanox) Lenny Verkhovsky (Mellanox) Eitan Zahavi (Mellanox) Vladimir Koushnir (Mellanox) March 30 – April 2, 2014#OFADevWorkshop21

Development Mostly by Mellanox –Review by rest of project team Verification/regression effort as well March 30 – April 2, 2014#OFADevWorkshop22

Initial Release Path Record Support Limitations (Not Part of Initial Release) –QoS routing and policy –Virtualization (alias GUIDs) Preview – June Release - December March 30 – April 2, 2014#OFADevWorkshop23

Future Development Phases 1.IP address and name resolution 1.Collect up SSA tree 2.Redistribute mappings 3.Resolve path records directly from IP address/names 2.Event collection and reporting 1.Performance monitoring March 30 – April 2, 2014#OFADevWorkshop24

Summary A scalable, distributed SA Works with existing apps with minor modification Fault tolerant Please contact us if interested in deploying this! March 30 – April 2, 2014#OFADevWorkshop25

#OFADevWorkshop Thank You