Microsoft Project Olympus Rack Management Mallik Bulusu Cloud Firmware Architect Microsoft Corporation
What are these images? Super view of London – courtesy NASA TITLE SLIDE What are these images? Super view of London – courtesy NASA Super view of Phoenix, AZ – courtesy NASA
Building Block for Data Center Management TITLE SLIDE Building Block for Data Center Management Cloud tenants – Set of VMs and OS instances Set of Servers Rack is the fundamental building block for effective Data Center Management. Rack manager is the brain. Rack Manager Hardware Rack Manager Firmware / Software Set of Racks
Rack Manager Requirements TITLE SLIDE Rack Manager Requirements Rack Boundary Rack Manager Functions Power Management Out-of-band Server Management RM Instances in Cloud Discrete (Consumes 1U) Integrated into PMDU Communication Network TTY Console Hardware Signaling Server Presence Server On/Off Server Throttle Power Metering & Control Remote Debug Remote Media Out of band FW update and recovery UEFI, CPLD, FPGA, PSU Rack Manager Fabric Switch Server Server Server Server Server Server Server Server
Rack Manager Design Choices TITLE SLIDE Rack Manager Design Choices ASPEED vs ARM Dual NIC Different Peripherals Higher accuracy ADC Offload Power Calculations Miscellaneous provisioning of components etc. Network Port: Shared vs. Discrete NCSI: single port with two MAC address Rack Manager: Dedicated
Rack Manager Data Traffic Interfaces REDFISH REDFISH Rack bound Fabric bound SSH SSH Legacy REST IPMI
Rack Manager Block Diagram TITLE SLIDE Rack Manager Block Diagram LEDs – Attention, Power, Debug, Status PCIe x16 edge finger & PCIe x8 edge finder to Interface with PMDU or backplane Temp Sensor Humidity Sensor To Fabric To Mgmt. Switch FRU GPIO Buffers x48 for Blade presence x48 for Blade enable To Mgmt. Switch eMMC GPIOs for Boot strap, Throttle bypass, power control etc. Debug QSPI To DIGI 1GB DDR3L
Management Switch Requirements Rack Boundary General purpose switch (L2+) 48x RJ45 ports Management Requirements Console Port (OOB) Ethernet 1+1 AC Redundant Power Supply PSUs must support Hot on-line repair Fans are not required Orientation Cold Isle RJ45 ports, Uplinks, UART Hot Isle PSUs (for replaceability) Configuration Requirements 48 individual VLANs, 48 DHCP address pools Single leasable address per pool Support for indefinite DHCP leases Configuration update via UART RJ45 port Supports TFTP firmware recovery Rack Manager Fabric Switch Server Server Server Server Server Server Server Server
Rack Manager Build Process Dependencies Recipes Repo Kernel Recipes UBOOT YOCTO QEMU BSP Sources Sources BIN Applications + Services BITBAKE toolchain
Rack Manager Firmware Stack Embedded Firmware Small foot print (min-kernel) Yocto build framework Connectivity Ethernet: SSH for CLI Redfish HTTPS UART: CLI Console Controls Hardware OFF/ON, AC Relay, Server OOB BMC Interface. Robust FW Update Recovery Boot Loader Recovery Console Factory Restore or rollback Ethernet + TFTP Bare-metal imaging Ethernet – bootp + TFTP
Rack Manager Firmware Recovery Normal Update TFTP/SCP Secure File Copy Forklift - Execute in place uboot Recovery Read-only partition WDT Recovery TFTP + Ethernet Uboot recovery console Remote Pin Strap Ethernet – bootp + TFTP
Rack Manager Firmware Logical Flow Auth at Interface: Ethernet / REST CLI / TTY Privilege: Permissions token Log / Audit Execution Task execution Logical to Physical Action Hardware Control
Firmware Commands Rack Manager Rack Manager Network Rack Power Users Rack Power Status Rack Power Reading Log Power Faults Rack Manager Power Limit Control Rack Power Limit Row Alert Rack Power Reading, per phase Rack Power Meter version Clear Rack Power Fault Log Clear Rack Max Power Rack Power Telemetry Rack Power Throttle Control Rack Power Throttle Report Rack Throttle HW Bypass Users shows roles and users shows users by role shows users by user group add new users update user role update user password delete existing user Rack Manager show rack manager version show rack information show inventory of rack rack manager health PDU power port status PDU On/Off Rack Attention LED Rack Manager Status LED Rack Audit Log Rack Telemetry Log Log management, clearing/moving Rack Manager asset info Rack Manager FRU programming Rack Manager Relay Status Rack Manager Relay Control Firmware Update Firmware Recovery Rack Manager Console Rack Manager Network Configure network settings Configure network interface add static routes enable/disable interfaces Rack Manager Serial Serial Port Console Control Management Switch Switch Firmware Update Switch Config Update Switch Reset Switch Console Switch Status Switch Port Status Rack Services TFTP Server Control TFTP Client NFS Server Control NTP server and Control Rack Manager Configuration Rack Manger session list Rack Manager session kill Native IPMI commands
Firmware Commands – Contd. Server Management Server information Server health Sensors and health Server Fan health Server Rack location Server LED status/control Server State Server Power Control Server Soft Power Control Default Power State System Presence POST code logging System Event Logs Power Cycle Control FPGA health status FPGA temperature FPGA version Server Management FPGA pass-through mode FPGA asset info FPGA Update FPGA Recovery Remote Media booting Remote HW Debug Remote Kernel Debug Boot order control Next boot control BIOS Configuration BIOS Update BIOS Physical Presence BMC Update BMC Command Console BMC Cmd pass-through Server Console Session Console Session Control Server Power Server Power Limit Control Server Power Alert Actions Server Power Policy Actions Server Power Limiting Server Throttle Status PSU Status Battery presence Battery test PSU Firmware Status PSU Firmware Update Firmware and boot loader version Clear PSU Fault log Clear Phase Fault Log PSU Status, Battery Status
Additional Firmware Commands Server Management Server Presence/Location Server Extended Logs Server Temperature / Fans Server IPT (JTAG debug) Server FPGA (debug /recovery) Server Kernel Debug Server Remote Media Server Power Policies Server Firmware Update Server Firmware Recovery Server PSU health Server Cmd pass-through Rack Power Rack Temperature Rack Inventory Firmware Update Rack Telemetry Log SSH Shell TFTF Server / Client Management Switch Switch Firmware Update Switch Config Update Switch Console Switch Power Control Switch Status
Rack Manager Interfaces SSH CLS Interface SSH Serial Console Complete Walled Garden CLI Environment REST Interface Functionally comparable to CMv1/CMv2 Redfish support Example Schema Hierarchy Rack vs. Server Schema Comparison Example Schema Extensions
Links Project Olympus Software Implementation @ OCP: http://www.opencompute.org/wiki/Server/ProjectOlympus @ Github : https://github.com/opencomputeproject/Project_Olympus Software Implementation @ Github : https://github.com/Project-Olympus/rackmanager-bsp Texas Instruments AM4376 Datasheet: http://www.ti.com/lit/ds/symlink/am4376.pdf Redfish https://www.dmtf.org/standards/redfish
Back-up
AC Power Monitoring Monitors AC Power at the Rack Level Senses AC voltage for each phase Sense AC current for each phase (Current Transformer) 12 A/D Converters on the Rack Manager Scale and calculate rack power Accuracy +/-2.5% 20% max load to 100% Error increases .5% below 20% Requires factory calibration