IBM ProtecTIER Deduplication Solutions Note to Presenter: Present to customers and prospects to provide them with an overview of IBM’s ProtecTIER Deduplication Solutions
got data? too much And not enough ( blank ) to store it all? Disk is not always the best answer for data protection issues and requirements. IBM has a broad portfolio of data protection solutions giving us the freedom to solve customer issues with the most effective technology Other vendor’s, like EMC Data Domain, answer to any problem is “Use more disk” because disk is the only thing they have to sell! And not enough ( blank ) to store it all? Time Money People Floor Space Electricity Air Conditioning 2
We Need to do More with Less, and we need to do it smarter The tidal wave of data continues … The amount of digital information continues to grow exponentially And we need to keep more of it, longer And the costs of losing data are increasingly unacceptable Lost revenues Lost customer confidence Embarrassment in the market Fines from contracts, government agencies CEO and CFO could go to jail But budgets are not increasing 2005 2006 2007 2008 2009 2010 Data created and copied is expected to grow at 48% CAGR through 2010 Disk is not always the best answer for data protection issues and requirements. IBM has a broad portfolio of data protection solutions giving us the freedom to solve customer issues with the most effective technology Other vendor’s, like EMC Data Domain, answer to any problem is “Use more disk” because disk is the only thing they have to sell! We Need to do More with Less, and we need to do it smarter Source: Various external consultant reports 3
Survey - what are your two biggest storage pain points? Disk is not always the best answer for data protection issues and requirements. IBM has a broad portfolio of data protection solutions giving us the freedom to solve customer issues with the most effective technology Other vendor’s, like EMC Data Domain, answer to any problem is “Use more disk” because disk is the only thing they have to sell! * TheInfoPro Storage Study: F1000 Sample. n=149. Other n=14. *Multiple responses recorded 4
The pressures on backup administrators are growing More new data coming Backup takes longer Growth Backup Manage Recover Disk is not always the best answer for data protection issues and requirements. IBM has a broad portfolio of data protection solutions giving us the freedom to solve customer issues with the most effective technology Other vendor’s, like EMC Data Domain, answer to any problem is “Use more disk” because disk is the only thing they have to sell! Can’t buy more storage Recovery takes longer 5
Using the right balance of high density tape and high performance disk will help . . . Long Term Retention Cost effective capacity Removable & transportable Compliance Meet financial & regulatory requirements Data encryption, WORM Short Term Retention Use disk for daily backup & restore operations Performance Fast backups Even faster restores Meet “backup windows” Disk is not always the best answer for data protection issues and requirements. IBM has a broad portfolio of data protection solutions giving us the freedom to solve customer issues with the most effective technology Other vendor’s, like EMC Data Domain, answer to any problem is “Use more disk” because disk is the only thing they have to sell! 6
And data deduplication is the key to using more disk more cost effectively! Deduplication technologies have evolved considerably over the last few years and have delivered the promise of improving backup and recovery operations while consuming less storage and utilizing infrastructure, such as network bandwidth, more efficiently and improving SLAs. The benefits provided by deduplication deliver a very compelling ROI, reduce TCO and provide opportunities to save money and accomplish backup and recovery goals more efficiently. And unlike some other technologies, difficult economic times actually magnify the value propositions delivered by deduplication. Now is the time, when budgets and spending are tight, for deduplication, a technology that can actually save you money.
ProtecTIER Overview
Protect More. Store Less.® ProtecTIER reduces the required backup disk capacity by up to 25 times! 9 9
IBM ProtecTIER Deduplication Innovation and Leadership 2003 2004 2005 2006 2007 2008 2009 2010 2011 6 PhDs begin researching massively scalable deduplication algorithms First Deduplication Virtual Tape Library deployed into production First single node system to store over 1PB of deduplicated data Fastest single node inline deduplication solution First to deliver Many-to-Many replication The only “true” enterprise-class deduplication solution on the market today IBM acquires Diligent First Deduplication solution for System z Fastest restore speed – up to 2800 MB/sec! First non-hash deduplication algorithm developed, designed for 100% data integrity First to deliver VTL solutions for both Open and Mainframe environments First true clustered system with Global Deduplication IBM’s first midrange solution released Installed in all major industries Over 1,400 ProtecTIER systems sold to date Production systems range in size from 5TB to over 700TB Over 90 PB of physical disk capacity behind ProtecTIER servers in production protecting thousands of PBs of backup data
How ProtecTIER works Backup with Inline deduplication New Data Stream Repository HyperFactor™ Memory Resident Index ProtecTIER™ Server The major difference between ProtecTIER and other standard VTLs is ProtecTIER’s unique patent pending factoring algorithm called HyperFactor. HyperFactor is the data de-duplication technology developed by Diligent Technologies that allows you to store a lot more backup data onto a smaller amount of disk. Here’s how HyperFactor works. As we build the slide, you’ll see that there are unique blocks of data in the repository. HyperFactor keeps track of this data with a memory resident index, like a table of contents, that can map the contents of a 1PB repository in only 4GB of memory. That 250,000:1 ratio between the repository and the index is a significant differentiator and has orders of magnitude greater granularity than anything in the market place. And as we build the slide, you’ll see a stream of data coming from one of the backup servers. This stream contains some data that already exists (as represented by the multi colored icons) and some data that’s new (as represented by the tan icons). HyperFactor is an inline data de-duplicator. As the data stream passes through the HyperFactor data de-duplication engine, HyperFactor will be looking for data that is “similar” to data it has stored before. PAUSE I use the word “similar” because it’s not identical, and that’s because part of the algorithm’s power is that it uses similarities instead of identicals to achieve unmatched performance. The most similar pattern in the repository is found with no I/O operations and then that data into brought to the server to do a computational compare and then store the delta. This is performed without impacting the search time, regardless of the repository size. Because there is no I/O required, we’re actually performing a memory search on an index, the search time difference will not be noticeable whether it’s 10TB or a petabyte. The location and similarity of the data isn’t affected by naming conventions, shifts or offsets in position, because we are looking at the byte level of the data. Customers often ask “What happens if the index disappears?” Remember the index is used to locate similarity in the repository, and in fact it’s not used in the restore process at all. If the backup applications data stream needed to be restored, this data that exists in the repository is self describing which means that a restore can be done without the index since the data itself tells me what is required to restore the stream. As we said, the index is important in finding similarity, not only is it in the server memory, but it’s also duplicated in two places on RAID protected disk and synchronized. “Filtered” data Backup with Inline deduplication Up to 1400MB/sec per server or 2000MB/sec with 2 node cluster! Only 4GB needed to map 1PB of physical disk! Backup Servers
ProtecTIER Deduplication Operation and Results Example Backup application writes data to ProtecTIER as it would to tape Only unique data is stored, existing duplicate data is referenced When data objects expire, references are removed and free space is reclaimed and reused Backup Amount Amount Dedupe Event Received Stored Ratio First Full Backup 1 TB 250 GB 4:1 Incremental Backup 100 GB 10 GB 4.2:1 Incremental Backup 100 GB 10 GB 4.4:1 1 2 3 4 5 Second Full Backup 1 TB 10 GB 7.8:1 Backups, especially full backups, are notoriously inefficient processes that repetitively send large amounts of data, mostly redundant, over and over again. Deduplication can eliminate redundant data and dramatically reduce the amount of storage needed for backups An important point to note about this slide is that the effect of deduplication is not instant but grows over time. The more data that is stored and the longer it is retained, the greater the deduplication ratio (the difference between the received over the amount actually stored) becomes and the more capacity is saved. Incremental Backup 100 GB 10 GB 8:1 Third Full Backup 1 TB 10 GB 11:1 After two months . . . 7.8 TB 350 GB 22:1 A B C D E F G H I J 12
Store up to 25 times backup data on given physical storage capacity Storage Impact from ProtecTIER Deduplication Represented capacity Master Server Backup Server ProtecTIER Server Physical capacity Store up to 25 times backup data on given physical storage capacity 13
Virtual cartridges can be cloned to tape at DR site Significantly Reduces Replication Bandwidth Backup Server Represented capacity Primary Site ProtecTIER Gateway Physical capacity Deduplication enables a large amounts of data to be replicated with significantly less bandwidth Backup Server IP-based WAN link Secondary Site By dramatically reducing the amount of storage needed to hold backup data, a number of powerful data protection strategies can then be leveraged. For example, replication can be used to automatically and electronically move the deduplicated data to a secondary site for disaster recovery purposes, eliminating the need to physically transport tapes. Virtual cartridges can be cloned to tape at DR site ProtecTIER Gateway Physical capacity Backup Server Tape library
ProtecTIER Many-to-One Replication Overview Up to 12 Branch Offices (spokes): Gateways and/or Appliances 1 target (hub): Appliance, Gateway, single or two-node cluster IP based NR links This slides shows the power of deduplication to protect data stored at a large number of smaller remote site and replicate that data back to a central location for additional protection and Disaster Recovery purposes. The amount of bandwidth needed is very small because only new unique data needs to be transmitted over the line since existing data is already stored at the hub. Backup Server Virtual cartridges can be cloned to tape by the Main-Site B/U server ProtecTIER Gateway Physical capacity Central / DR Site Tape library
ProtecTIER Many-to-Many Native Replication Grid Site A Up to 4 hubs in a grid Site B Site C Site D Backup Server ProtecTIER Gateway Physical capacity Supports any combination of Gateways, Appliances, single or two-node clusters
NetBackup Policy and Control ProtecTIER Support for Symantec OpenStorage (OST) OST API separates the backup logic from the storage appliance logic and implementation NetBackup Policy and Control NetBackup Server OpenStorage API ProtecTIER OST Plugin IBM ProtecTIER: Backup storage appliance with Deduplication and Native Replication ProtecTIER Server 17 IBM Confidential
IBM ProtecTIER® Deduplication Family Scalable Capacity and Performance IBM ProtecTIER® Deduplication Family TS7650G & TS7680 ProtecTIER Gateways TS7650 ProtecTIER Appliances Highest Performance Largest Capacity High Availability TS7610 ProtecTIER Appliance Express Better Performance Larger Capacity Scalable Good Performance Entry Level Easy to Install Backup: Up to 2000 MB/sec Restore: Up to 2800 MB/sec Up to 1 PB Useable Capacity The IBM ProtecTIER product family ranges from the midrange TS7610 ProtecTIER Appliance Express system to the enterprise-class TS7650 ProtecTIER Cluster. All of these solutions run the same unique and patented ProtecTIER deduplication software, and all of the features and differentiation discussed in these slides are available to all ProtecTIER solutions. This is different than a lot of our competitors that only offer certain features and capabilities for specific backup applications and whose performance claims are tied to specific (often unrealistic) configurations as well. Up to 500 MB/sec 7 TB to 36 TB Useable Capacity Up to 100 MB/sec 4 TB and 5.4 TB Useable Capacity 18 18
ProtecTIER Differentiation
ProtecTIER Advantage: Data Integrity Unique and patented HyperFactor® deduplication technology The only production proven deduplication solution not based on a hash algorithm Designed for 100% data integrity Bit for bit comparison of data to ensure data is a duplicate Can NEVER lose data due to a hash collision A major differentiator for ProtecTIER versus other hash-based deduplication systems is IBM’s unique and patented HyperFactor deduplication algorithm. HyperFactor was designed to provide 100% data integrity by doing a bit for bit comparison of data before declaring it a duplicate and not relying on a hash algorithm. HyperFactor can never lose data due to a hash collision unlike most other products on the market. Vendors of hash-based deduplication products will claim the odds of losing data due to a hash collision is the same odds as winning the lottery. That analogy is true as long as you admit that with every backup job you get thousands and thousands of lottery tickets. IT organizations have hit “the data loss lottery” in the past and will hit it again in the future. A hash collision will not only result in the loss of some data, it could cause corruption that results in the loss of all your backup data. Are you willing to take that risk? Some vendors claim to have “data Invulnerability” mechanisms to avoid hash collisions. However, performing these optional extra checks significantly slow down system performance and are usually disabled by the vendor and/or the customer to maximize speed. ProtecTIER delivers both high performance and data integrity. You don’t have to choose one or the other. Although the chance of losing data from a hash collision is low, it is NOT ZERO as it is with a ProtecTIER solution
ProtecTIER Advantage: Restore Performance Restoring data from a ProtecTIER solution is even FASTER than backing up ProtecTIER can easily restore at 2800MB/sec! High restore performance not limited to certain backup applications or specific data sets like other vendors High restore performance achieved on real data with realistic 20% change rate in production environments Never requires agents on backup servers The performance stated by IBM for its ProtecTIER solutions is always realistic numbers that are achievable in real production environments. This is quite unlike our competitors who claim exaggerated performance numbers that can not be achieved in real production environments. Most even acknowledge this by hiding in the fine print a disclaimer like “these are maximum benchmark speeds that should not be used for configuration purposes” Other vendor’s “CPU-centric” architectures are optimized for processing hashes not moving data
ProtecTIER Advantage: Scalability A single ProtecTIER system can support up to 1 Petabyte of useable capacity ProtecTIER supports the use of any IBM storage system (DS8000, DS5000, XIV, etc.) and most third party storage systems for the repository IBM has hundreds of ProtecTIER systems with over 100TBs of useable capacity in production environments throughout the world IBM always states “Useable Capacity” and never uses the deceptive “RAW capacity” terms like other vendors The capacity stated by IBM for its ProtecTIER solutions is always “Useable Capacity”. This is the real capacity that can be used by the system to store data. For example, if you have 10TB or useable capacity and achieve a deduplication ration of 20:1, you can store 200TB of backup data. Most of our competitors use “RAW capacity” to make their systems seem bigger than they really are. RAW capacity is all of the disk drives summed up within a system in its maximum configuration. It does not take into consideration the effects of RAID, spares, and capacity needed for the application, metadata and other requirements that take away from the actual useable capacity of the system Some vendors use RAW capacity to deceive you and they should not be trusted. In addition, these vendors often make big claims about “Logical capacity” which is taking the usable capacity and multiplying it by a deduplication ratio 50 or more. While achieving a dedupe ratio of 50 is possible, it is not common. Vendors that use these cheap tricks to deceive you should not be trusted! The hidden costs associated with managing, maintaining, powering and cooling multiple appliances is significant and should not be ignored!
ProtecTIER Advantage: Global Deduplication ProtecTIER Cluster with true Global Deduplication has been Generally Available and in production since 2008 Supported with all major backup applications and available for all Open Systems, System z and System I platforms No agents or backup server upgrades required Other vendor’s Global Deduplication capabilities are immature and incomplete with very few if any systems in production Other vendor’s Global Dedupe restricted to certain models, only with NetBackup OST and require agents to be installed Many vendors claim to have Global Deduplication but create multiple separate repositories that may contain redundant data!
ProtecTIER Advantage: Inline Deduplication Example: Disk activity needed to ingest and deduplicate 10 TBs of backup data Post Process Approach: Deduplicate after Storing Requires: > storage > I/Os > Time > Effort > Admin Hash-based Post Process 10 TB Data Write 10 TB 2x Read 10 TB For every 10 Terabytes of data (for example) sent by the backup application server, a DeltaStor server must write 10 Terabytes of data to disk, read 10 Terabytes of data form disk, and then read another 10 Terabytes of data from disk. It must also read and write from its index database and write some amount of pointers. All this reading and writing will significantly hammer the SATA drives. ProtecTIER, on the other hand, will only read or write 10TBs of data for every 10 Terabytes of data sent by the backup application server. This is at least one third less disk activity on more reliable FC disks. ProtecTIER Inline Approach: Deduplicate before Storing Results: simple faster easier cheaper efficient Read or Write 10 TB 10 TB Data HyperFactor 1x
ProtecTIER Advantage: Inline Deduplication Inline Processing Truck Backup Server ProtecTIER VT Tape Library SLA is Met Dedupe 8:00 PM 2:00 AM 8:00 AM 8:00 PM Post Processing Dedupe Overlap Truck Backup Server VTL Tape Library Dedupe 8:00 PM 2:00 AM 8:00 AM 8:00 PM
With an IBM ProtecTIER Solution you can . . . Store up to 25 times more data on disk Up to 25:1 reduction with 100% data integrity Reduce backup and restore times Fast inline deduplication up to 2000 MB/sec Even faster restores up to 2800 MB/sec Improve the reliability of backup operations Eliminates mechanical & handling failures Drive the cost of disk based backup down Reduces energy, cooling, and space required Increase data retention Store more backup data on disk for a longer time with very little additional cost 26 26
For More Information on IBM’s ProtecTIER IBM Customers The main ProtecTIER Web Page www.ibm.com/systems/storage/tape/protectier IBM and Business Partners Visit the IBM ProtecTIER Sales Kit on PartnerWorld https://www-304.ibm.com/jct09002c/partnerworld/wps/servlet/mem/ContentHandler/ProtecTIER%20SalesKit/lc=en_US Visit the IBM ProtecTIER Sales Kit on W3 http://w3-03.ibm.com/sales/support/ShowDoc.wss?docid=C469520B08856D52&infotype=SK&infosubtype=S0&node=doctype,S0|doctype,SKT|brands,B5000|clientset,IA|geography,AMR|industries,&appname=CC_CFSS
Trademarks and Disclaimers 8 IBM Corporation 1994-2011. All rights reserved. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown may be engineering prototypes. Changes may be incorporated in production models.