Smart Storage and Linux An EMC Perspective Ric Wheeler
Why Smart Storage? Central control of critical data One central resource to fail-over in disaster planning Banks, trading floor, air lines want zero downtime Smart storage is shared by all hosts & OS’es Amortize the costs of high availability and disaster planning over all of your hosts Use different OS’es for different jobs (UNIX for the web, IBM mainframes for data processing) Zero-time “transfer” from host to host when both are connected Enables cluster file systems
Data Center Storage Systems Change the way you think of storage Shared Connectivity Model “Magic” Disks Scales to new capacity Storage that runs for years at a time Symmetrix case study Symmetrix 8000 Architecture Symmetrix Applications Data center class operating systems
Traditional Model of Connectivity Direct Connect Disk attached directly to host Private - OS controls access and provides security Storage I/O traffic only Separate system used to support network I/O (networking, web browsing, NFS, etc)
Shared Models of Connectivity VMS Cluster Shared disk & partitions Same OS on each node Scales to dozens of nodes IBM Mainframes Shared disk & partitions Same OS on each node Handful of nodes Network Disks Shared disk/private partition Same OS Raw/block access via network Handful of nodes
New Models of Connectivity Every host in a data center could be connected to the same storage system Heterogeneous OS & data format (CKD & FBA) Management challenge: No central authority to provide access control Shared Storage IRIX DGUX FreeBSD MVS VMS Linux Solaris HPUX NT
Magic Disks Instant copy Devices, files or data bases Remote data mirroring Metropolitan area 100’s of kilometers 1000’s of virtual disks Dynamic load balancing Behind the scenes backup No host involved
Scalable Storage Systems Current systems support 10’s of terabytes Dozens of SCSI, fibre channel, ESCON channels per host Highly available (years of run time) Online code upgrades Potentially 100’s of hosts connected to the same device Support for chaining storage boxes together locally or remotely
Longevity Data should be forever Storage needs to overcome network failures, power failures, blizzards, asteroid strikes … Some boxes have run for over 5 years without a reboot or halt of operations Storage features No single point of failure inside the box At least 2 connections to a host Online code upgrades and patches Call home on error, ability to fix field problems without disruptions Remote data mirroring for real disasters
Symmetrix Architecture 32 PowerPC 750’s based “directors” Up to 32 GB of central “cache” for user data Support for SCSI, Fibre channel, Escon, … 384 drives (over 28 TB with 73 GB units)
Symmetrix Basic Architecture
Data Flow through a Symm
Read Performance
Prefetch is Key Read hit gets RAM speed, read miss is spindle speed What helps cached storage array performance? Contiguous allocation of files (extent-based file systems) preserve logical to physical mapping Hints from the host could help prediction What might hurt performance? Clustering small, unrelated writes into contiguous blocks (foils prefetch on later read of data) Truly random read IO’s
Symmetrix Applications Instant copy TimeFinder Remote data copy SRDF (Symmetrix Remote Data Facility) Serverless Backup and Restore Fastrax Mainframe & UNIX data sharing IFS
Business Continuance Problem “Normal” Daily Operations Cycle Online Day BACKUP / DSS Resume Online Day Sunrise “Race to Sunrise” 4 Hours of Data Inaccessibility* 2 am 6 am
Creation and control of a copy of any active application volume Capability to allow the new copy to be used by another application or system Continuous availability of production data during backups, decision support, batch queries, DW loading, Year 2000 testing, application testing, etc. Ability to create multiple copies of a single application volume Non-disruptive re-synchronization when second application is complete TimeFinder Backups Decision Support Data Warehousing Euro Conversion BCV is a copy of real production data Sales BUSINESS CONTINUANCE VOLUME BUSINESS CONTINUANCE VOLUME BUSINESS CONTINUANCE VOLUME PRODUCTION APPLICATION VOLUME PRODUCTION APPLICATION VOLUME PRODUCTION APPLICATION VOLUME
Business Continuance Volumes âA Business Continuation Volume (BCV) is created and controlled at the logical volume level âPhysical drive sizes can be different, logical size must be identical âSeveral ACTIVE copies of data at once per Symmetrix
Using TimeFinder âEstablish BCV âStop transactions to clear buffers âSplit BCV âStart transactions âExecute against BCVs âRe-establish BCV M1 BCV M2
Re-Establishing a BCV Pair nBCV pair “PROD” and “BCV” have been split nTracks on “PROD” updated after split nTracks on ‘BCV’ updated after split nSymmetrix keeps table of these “invalid” tracks after split nAt re-establish BCV pair, “invalid” tracks are written from “PROD” to “BCV” nSynch complete PROD M1 Split BCV Pair BCV M1 UPDATED Re-Establish BCV Pair PROD BCV UPDATED
Restore a BCV Pair nBCV pair “PROD” and “BCV” have been split nTracks on “PROD” updated after split nTracks on “BCV” updated after split nSymmetrix keeps table of these “invalid” tracks after split nAt restore BCV pair, “invalid” tracks are written from “BCV to PROD” nSynch complete PROD Split BCV Pair BCV UPDATED Restore BCV Pair PROD BCV UPDATED
Make as Many Copies as Needed nEstablish BCV 1 nSplit BCV 1 nEstablish BCV 2 nSplit BCV 2 nEstablish BCV 3 M1 BCV 1 M2 BCV 2 BCV 3 4 PM 5 PM 6 PM
The Purpose of SRDF Local data copies are not enough Maximalist Provide a remote copy of the data that will be as usable after a disaster as the primary copy would have been. Minimalist Provide a means for generating periodic physical backups of the data.
Synchronous Data Mirroring Write is received from the host into the cache of the source I/O is transmitted to the cache of the target ACK is provided by the target back to the cache of the source Ending status is presented to the host Symmetrix systems destage writes to disk Useful for disaster recovery
Semi-Synchronous Mirroring An I/O write is received from the host/server into the cache of the source Ending status is presented to the host/server. I/O is transmitted to the cache of the target ACK is sent by the target back to the cache of the source Each Symmetrix system destages writes to disk Useful for adaptive copy
Backup / Restore of Big Data Exploding amounts of data cause backups to run on too long How long does it take you to backup 1 TB of data? Shrinking backup window and constant pressure for continuous application up-time Avoid using production environment for backup No server CPU or I/O channels No involvement of regular network Performance must scale to match customer’s growth Heterogeneous host support
Location 1 Location 2 Tape Library SCSI Fastrax Data Engine UNIX Linux Fastrax Enabled Backup/Restore Applications SYMAPI UNIX STD1 STD2 BCV1 R1R2 BCV2 Fibre Channel PtP Link(s) Fastrax Overview
Host to Tape Data Flow Symmetrix Fastrax Tape Library Host
Fastrax Performance Performance scales with the number of data movers in the Fastrax box & number of tape devices Restore runs as fast as backup No performance impact on host during restore or backup RAF DM SRDF Fastrax
Moving Data from Mainframes to UNIX
InfoMover File System Transparent availability of MVS data to Unix hosts MVS datasets available as native Unix files Sharing a single copy of MVS datasets Uses MVS security and locking Standard MVS access methods for locking + security
Mainframe 4 IBM MVS / OS390 Open Systems 4 IBM AIX 4 HP HP-UX 4 Sun Solaris ESCON Channel Parallel Channel FWD SCSI Ultra SCSI Fibre Channel Minimal Network Overhead -- No data transfer over network! -- MVS Data Symmetrix with ESP IFS Implementation
Symmetrix API’s
Symmetrix API Overview SYMAPI Core Library Used by “Thin” and Full Clients SYMAPI Mapping Library SYMCLI Command Line Interface
Symmetrix API’s SYMAPI are the high level functions Used by EMC’s ISV partners (Oracle, Veritas, etc) and by EMC applications SYMCLI is the “Command Line Interface” which invoke SYMAPI Used by end customers and some ISV applications.
Basic Architecture Symmetrix Application Programming Interface (SymAPI) Symmetrix Command Line Interpreter (SymCli) Other Storage Management Applications User access to the Solutions Enabler is via the SymCli or Storage Management Application
Client-Server Architecture Symapi Server runs on the host computer connected to the Symmetrix storage controller Symapi client runs on one or more host computers Client Host Server Host Thin Client Host SymAPI library SymAPI Server Storage Management Applications Storage Management Applications SymAPI Client Thin SymAPI Client
SymmAPI Components Initialization Discover and Update Configuration Gatekeepers TimeFinder Functions Device Groups DeltaMark Functions SRDF Functions Statistics Mapping FunctionsBase Controls Calypso Controls Optimizer Controls InfoSharing
Data Object Resolve RDBMS Data File File System Logical Volume Host Physical Device Symmetrix Device Extents
File System Mapping File System mapping information includes: File System attributes and host physical location. Directory attributes and contents. File attributes and host physical extent information, including inode information, fragment size. I-nodesDirectoriesFile extents
Data Center Hosts
Solaris & Sun Starfire Hardware Up to 62 IO Channels 64 CPU’s 64 GB of RAM 60 TB of disk Supports multiple domains Starfire & Symmetrix ~20% use more than 32 IO channels Most use 4 to 8 IO channels per domain Oracle instance usually above 1 TB
HPUX & HP 9000 Superdome Hardware 192 IO Channels 64 CPU’s cards 128 GB RAM 1 PB of storage Superdome and Symm 16 LUNS per target Want us to support more than 4000 logical volumes!
Solaris and Fujitsu GP7000F M1000 Hardware 6-48 I/O slots 4-32 CPU’s Cross-Bar Switch 32 GB RAM 64-bit PCI bus Up to 70TB of storage
Solaris and Fujitsu GP7000F M2000 Hardware I/O slots CPU’s Cross-Bar Switch 256 GB RAM 64-bit PCI bus Up to 70TB of storage
AIX 5L & IBM RS/6000 SP Hardware Scale to 512 Nodes (over 8000 CPUs) 32 TB RAM 473 TB Internal Storage Capacity High Speed Interconnect 1GB/sec per channel with SP Switch2 Partitioned Workloads Thousands of IO Channels
IBM RS/6000 pSeries 680 AIX 5L Hardware 24 CPUs 64-bit RS64 IV 600MHz 96 MB RAM GB Internal Storage Capacity 53 PCI slots 33 – 32bit/20-64bit
Really Big Data IBM (Sequent) NUMA 16 NUMA “Quads” 4 way/ 450 MHz CPUs 2 GB Memory 4 x 100MB/s FC-SW Oracle with up to 42 TB (mirrored) DB EMC Symmetrix 20 Small Symm 4’s 2 Medium Symm 4’s
Windows 2000 on IA32 Usually lots of small (1u or 2u) boxes share a Symmetrix 4 to 8 IO channels per box Qualified up to 1 TB per meta volume (although usually deployed with ½ TB or less) Management is a challenge Will 2000 on IA64 handle big data better?
Linux Data Center Wish List
Lots of Devices Customers can uses hundreds of targets and LUN’s (logical volumes) 128 SCSI devices per system is too few Better naming system to track lots of disks Persistence for “not ready” devices in the name space would help some of our features devfs solves some of this Rational naming scheme Potential for tons of disk devices (need SCSI driver work as well)
Support for Dynamic Data What happens when the LV changes under a running file system? Adding new logical volumes? Happens with TimeFinder, RDF, Fastrax Requires a remounting, reloading drivers, rebooting? API’s can be used to give “heads up” before events Must be able to invalidate Data, name and attribute caches for individual files or logical volumes Support for dynamically loaded, layered drivers Dynamic allocation of devices Especially important for LUN’s Add & remove devices as fibre channel fabric changes
Keep it Open Open source is good for us We can fix it or support it if you don’t want to No need to reverse engineer some closed source FS/LVM Leverage storage API’s Add hooks to Linux file systems, LVM’s, sys admin tools Make Linux manageable Good management tools are crucial in large data centers
New Technology Opportunities Linux can explore new technologies faster than most iSCSI SCSI over TCP for remote data copy? SCSI over TCP for host storage connection? High speed/zero-copy TCP is important to storage here! Infiniband Initially targeted at PCI replacement High speed, high performance cluster infrastructure for file systems, LVM’s, etc Multi gigabits/sec (2.5 GB/sec up to 30 GB/sec) Support for IB as a storage connection? Cluster file systems
Linux at EMC Full support for Linux in SymAPI, RDF, TimeFinder, etc Working with partners in the application space and the OS space to support Linux Oracle Open World Demo of Oracle on Linux with over 20 Symms (could reach 1PB of storage!) EMC Symmetrix Enterprise Storage EMC Connectrix Enterprise Fiber Channel Switch Centralized Monitoring and Management
MOSIX and Linux Cluster File Systems
Our Problem: Code Builds Over 70 OS developers Each developer builds 15 variations of the OS Each variation compiles over a million lines of code Full build uses gigabytes of space, with 100k temporary files User sandboxes stored in home directory over NFS Full build took around 2 hours 2 users could build at once
Our Original Environment Software GNU tool chain CVS for source control Platform Computing’s Load Sharing Facility Solaris on build nodes Hardware EMC NFS server (Celerra) with EMC Symmetrix back end 26 SUN Ultra-2 (dual 300 MHz CPU) boxes FDDI ring used for interconnect
EMC's LSF Cluster
LSF Architecture Distributed process scheduling and remote execution No kernel modifications Prefers to use static placement for load balancing Applications need to link special library License server controls cluster access Master node in cluster Manages load information Makes scheduling decisions for all nodes Uses modified GNU Make (lsmake)
MOSIX Architecture Provide transparent, dynamic migration Processes can migrate at any time No user intervention required Process thinks it is still running on its creation node Dynamic load balancing Use decentralized algorithm to continually level load in the cluster Based on number of CPU's, speed of CPU's, RAM, etc Worked great for distributed builds in 1989
MOSIX Mechanism Each process has a unique home node UHN is the node of the processes creation Process appears to be running at its UHN Invisible after migration to others on its new node UHN runs a deputy Encapsulates system state for migrated process Acts as a proxy for some location-sensitive system calls after migration Significant performance hit for IO over NFS, for example
MOSIX Migration Link layer deputy remote User level Kernel local process User level Kernel NFS
MOSIX Enhancements MOSIX added static placement and remote execution Leverage the load balancing infrastructure for placement decisions Avoid creation of deputies Lock remotely spawned processes down just in case Fix several NFS caching related bugs Modify some of our makefile rules
MOSIX Remote Execution Link layer deputy remote User level Kernel local process User level Kernel NFS
EMC MOSIX cluster EMC’s original MOSIX cluster Compute nodes changed from LSF to MOSIX Network changed from FDDI to 100 megabit ethernet. The MOSIX cluster immediately moved the bottleneck from the cluster to the network and I/O systems. Performance was great, but we can do better!
Latest Hardware Changes Network upgrades New switch deployed Nodes to switch use 100 megabit ethernet Switch to NFS server uses gigabit ethernet NFS upgrades 50 gigabyte, striped file systems per user (compared to 9 gigabyte non-striped file systems) Fast/wide differential SCSI between server and storage Cluster upgrades Added 28 more compute nodes Added 4 “submittal” nodes
EMC MOSIX Cluster Gigabit Ether SCSI
Performance Running Red Hat 6.0 with kernel (MOSIX and NFS patches applied) Builds are now around minutes (down from hours) Over 35 concurrent builds at once
Build Submissions
Cluster File System & MOSIX Fibre Channel Connectrix
DFSA Overview DFSA provides the structure to allow migrated processes to always do local IO MFS (MOSIX File System) created No caching per node, write through Serverless - all nodes can export/import files Prototype for DFSA testing Works like non-caching NFS
DFSA Requirements One active inode/buffer in the cluster for each file Time-stamps are cluster-wide, increasing Some new FS operations Identify: encapsulate dentry info Compare: are two files the same? Create: produce a new file from SB/ID info Some new inode operations Checkpath: verify path to file is unique Dotdot: give true parent directory
Information MOSIX: GFS: Migration Information Process Migration, Milojicic, et al. To appear in ACM Computing Surveys, Mobility: Processes, Computers and Agents, Milojicic, Douglis and Wheeler, ACM Press.