RCF Status Extended outage of the Mass Storage System (HPSS) last Wednesday –Latest transaction logs of namespace DB were erroneously deleted in the production.

Slides:



Advertisements
Similar presentations
Oracle Hyperion Financial Data Quality Management Considerations for a scaled, expedited and integrated approach on data quality NCOAUG – Aug 15, 2008.
Advertisements

Crash Recovery John Ortiz. Lecture 22Crash Recovery2 Review: The ACID properties  Atomicity: All actions in the transaction happen, or none happens 
Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.
CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.
Mainframe Replication and Disaster Recovery Services.
Copyright © 2003 Americas’ SAP Users’ Group Custom Archiving 101 Session Code 108 Karin Tillotson Sr. Basis Administrator Tuesday, May 20 th, 2003.
Chapter 11: File System Implementation
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
1 Transaction Management Database recovery Concurrency control.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Chapter 12 File Management Systems
5 Copyright © 2006, Oracle. All rights reserved. Database Recovery.
Backup and Recovery Part 1.
1 Recovery and Backup RMAN TIER 1 Experience, status and questions. Meeting at CNAF June of 2007, Bologna, Italy Carlos Fernando Gamboa, BNL Gordon.
Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.
Oracle backup and recovery strategy
Backup Rationalisation Reorganisation of the CERN Computer Centre Backups David Asbury IT/DS Friday 6 December 2002.
Backing up data By Alicia stewart.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Administration etc.. What is this ? This section is devoted to those bits that I could not find another home for… Again these may be useless, but humour.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Backup, Restore, and Server Replacement Josh Rose UCBU Software Engineer.
1 Chapter 12 File Management Systems. 2 Systems Architecture Chapter 12.
SRUTHI NAGULAVANCHA CIS 764, FALL 2008 Department of Computing and Information Sciences (CIS) Kansas State University -1- Back up & Recovery Strategies.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
RMAN: Your New Best Friend for Backup and Recovery Ruth Gramolini ORACLE DBA Vermont Department of Taxes.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
Chapter 15 Relational Implementation with DB2 David M. Kroenke Database Processing © 2000 Prentice Hall.
Page 1 SQL Server Myths XV ENCONTRO DA COMUNIDADE SQLPORT Rui Ribeiro MCITP 2011/08/16.
Backup and Recovery Overview Supinfo Oracle Lab. 6.
Recovery System By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
11 DISASTER RECOVERY Chapter 13. Chapter 13: DISASTER RECOVERY2 OVERVIEW  Back up server data using the Backup utility and the Ntbackup command  Restore.
08/30/05GDM Project Presentation Lower Storage Summary of activity on 8/30/2005.
INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.
HPSS for Archival Storage Tom Sherwin Storage Group Leader, SDSC
Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.
RCF Status One issues with the Mass Storage System (HPSS) –On 1/1 at 2:40 PM the core server process stopped working –Was automatically restarted by heartbeat.
Who Says Servers Can’t Crash? Rocky Mountain PBS Survives Multiple Server Crashes and Lives to tell about it! Presented By Michelle Nesmith Rocky Mountain.
1 Backups Part of a Systems Administrators job is maintaining the integrity of the system. This does not mean that she is expected to prevent anything.
GIST 19: GGSPS status Status of GGSPS development and operations Andy Smith GGSPS software project manager.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.
STAR C OMPUTING Plans for Production Use of Grand Challenge Software in STAR Torre Wenaus BNL Grand Challenge Meeting LBNL 10/23/98.
Computer Basics Boot Camp HPCSD Instructional Technology Department.
Learningcomputer.com SQL Server 2008 – Backup and Restore Database.
IMS 4212: Data and Database Administration 1 Dr. Lawrence West, Management Dept., University of Central Florida Data & Database Administration.
RCF Status - Introduction PHENIX and STAR Counting Houses are connected to RCF at a Network Bandwidth of 20 Gbits/sec each –Redundant (Bandwidth-wise and.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
RCF Status Smooth operation over the course of last week –No reportable outages of the Mass Storage System (HPSS), the Tape Libraries and the Network PHENIX.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
6 Copyright © Oracle Corporation, All rights reserved. Backup and Recovery Overview.
3 Copyright © 2004, Oracle. All rights reserved. Database Architecture Comparison.
6 Copyright © 2005, Oracle. All rights reserved. Managing Database Storage Structures.
( ) 1 Chapter # 8 How Data is stored DATABASE.
Unit 8: Database and Storage Pool Backup and Recovery.
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
File-System Management
Backups for Azure SQL Databases and SQL Server instances running on Azure Virtual Machines Session on backup to Azure feature (manual and managed) in SQL.
Remote Backup Systems.
Experiences and Outlook Data Preservation and Long Term Analysis
Windows Azure Migrating SQL Server Workloads
Outline Introduction Background Distributed DBMS Architecture
Database Backup and recovery
data backup & system report
File System Implementation
Database administration
Remote Backup Systems.
IBM Tivoli Storage Manager
Concurrency Control.
Presentation transcript:

RCF Status Extended outage of the Mass Storage System (HPSS) last Wednesday –Latest transaction logs of namespace DB were erroneously deleted in the production system by Administrator –Last backup was taken ~1:30 hour before mishap –Investigated two possible routes to fix the problem 1.Fix the database w/o restoring it from backup 2.Restore DB as far as possible from backup  1. would have been too time consuming (would have to get IBM involved, wait for their response and check DB consistency)  2. was chosen because it’s much faster to restore the service in anticipation of continuing data taking (already resumed at that time). Extracted information from HPSS Cache as to files that needed to be re-transferred  In summary  Length of service outage: ~13 hours  Data Loss  PHENIX: No files were lost (keeping files in buffer at CH until safe on tape)  STAR: 118 files (deleting files from buffer once stored in HPSS cache) –Fortunately all this happened during an APEX day … –Have implemented measures to avoid this mistake happening again

Raw Data Volume collected & archived since 11/26/07 11/26 01/ TB 125 TB PHENIX Raw Data STAR Raw Data

RAW Data Collected in RHIC Runs Run8 (d-AU only)Run3 (d-AU only)

Results from Data Taking STAR PHENIX 3 Gigabits/second PHENIX / RCF Network Link (10 Gbps max.) Data Migration to Tape 1500 Megabits/second STAR / RCF Network Links (2 * 1 Gbps max.) 1500 GB/hour 6,000 GB HPSS Outage