Copyright © 2002 Legato Systems, Inc. Legato Confidential
Legato Systems, Inc - Confidential and Proprietary 2 Introduction Prerequisites for attending this TOI session Overview and Benefits of the new feature Installation considerations How to configure/enable the feature Using the feature Licensing considerations Architecture and internal Design Debugging techniques and tips Questions and Answers
Legato Systems, Inc - Confidential and Proprietary 3 Prerequisites List any prerequisites to attending this presentation Internal documentation (proprietary) – d1783 d1682 Non-proprietary – c_pr/spec/SMIS_v101.pdf c_pr/spec/SMIS_v101.pdf _v28 DSP0107.pdf
Legato Systems, Inc - Confidential and Proprietary 4 Overview and Benefits NetWorker’s backup control and event management was limited + monitoring and reporting was also sparse Scheduled backup’s failure detection and classification was difficult for NetWorker Administrator History of failures/events was not stored in structured DB. Runtime monitoring of backups was limited Parallelism control was not centralized
Legato Systems, Inc - Confidential and Proprietary 5 Overview and Benefits (cont.) Savegroup reporting had inadequate error reporting Savegroup failures are not easy to detect Failure to backup some files was not treated as errors or warnings Failure reporting was done post completion of saveset Runtime monitoring of savesets was limited Control on individual savesets was limited
Legato Systems, Inc - Confidential and Proprietary 6 Overview and Benefits (cont.) To solve these problem a new jobs framework is utilized The framework utilizes a new daemon called nsrjobd (The jobs daemon). The jobs daemon maintains a repository that stores information about jobs such as: status, indications and job session information. This information is gathered at run time to allow monitoring of active jobs. Jobs are managed and controlled from a central point which provides the ability to stop an individual backup, for example from the GUI. Jobs are queued for central parallelism control in the jobs daemon
Legato Systems, Inc - Confidential and Proprietary 7 Overview and Benefits (cont.) Savegroup starts jobs using new central jobs daemon (nsrjobd) Savegroup receives information while processes run, rather than after the fact. This allows for better inactivity timeout monitoring. Jobs report indications (events) continuously during run Savegroup monitors indications and generates error reporting based on these
Legato Systems, Inc - Confidential and Proprietary 8 Overview and Benefits (cont.) Old Framework
Legato Systems, Inc - Confidential and Proprietary 9 Overview and Benefits (cont.) New framework save or savefs nsrexecd nsrjobd savegrp nsrd ServerClient
Legato Systems, Inc - Confidential and Proprietary 10 Overview and Benefits (cont.) System requirements to use feature Standard requirements Needs more space under /nsr/res
Legato Systems, Inc - Confidential and Proprietary 11 Overview and Benefits Where to learn more D1783 D1786 D NMC TOI
Legato Systems, Inc - Confidential and Proprietary 12 Installation Considerations Changes to installation /nsr/res/jobsdb created at installation New binary on server: nsrjobd RAP database used by nsrjobd does not export an RPC interface, but is viewable on disk.
Legato Systems, Inc - Confidential and Proprietary 13 Configuring the Feature How to enable and/or configure this feature Always enabled (Cannot disable) New attributes in NSR resource - Maximum Jobs DB size - Minimum Retention time New attributes in savegroup Restart window –Time limit for valid restart (default: 12:00 hr) Success threshold –Threshold to determine success/failure based on indication severity (default: Warning)
Legato Systems, Inc - Confidential and Proprietary 14 Using the Feature Daemon started by nsrd, only runs on the server, not storage nodes or clients. Daemon does all the remote execution and gathers information on the client side processes. Information is stored in permanent storage to allow for NMC to use for reporting.
Legato Systems, Inc - Confidential and Proprietary 15 Using the Feature New commands No new command GUI Changes in the GUI Described by NMC TOI
Legato Systems, Inc - Confidential and Proprietary 16 Using the Feature Attributes Minimum retention time Use this to configure the minimum amount of time that records will stay in the jobs database. Maximum Jobsdb size Use this to configure the maximum amount of space that the records will use. (As reported by save –nq) Restart window Use this to set a limit to consider last run as valid backup
Legato Systems, Inc - Confidential and Proprietary 17 Using the Feature (cont.) Success threshold Use the Success threshold to report savesets as failure. If success threshold is set to Warning (default), even if warning indications are generated the savegroup is reported as successful Setting the success threshold to “Success” will mean warnings will be treated and reported as failure
Legato Systems, Inc - Confidential and Proprietary 18 Group Properties - Advanced Using the Feature (cont.)
Legato Systems, Inc - Confidential and Proprietary 19 Using the Feature (cont.) Report changes Summary section NetWorker savegroup: (notice) Default completed, Total 3 client(s), 1 Succeeded with warnings(s), 2 Succeeded. Please see group completion details for more information. Succeeded with warnings: scoop.legato.com Succeeded: greenland.devlab.legato.com, soft Start time: Tue Jul 19 16:00: End time: Tue Jul 19 16:01:
Legato Systems, Inc - Confidential and Proprietary 20 Using the Feature (cont.) Indications --- Unsuccessful Save Sets --- * pa1pberde:c:\SFU\var\adm save: Saving files modified since Thu Feb 24 16:01: * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\.security * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\utmpx * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\wtmpx * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\ * pa1pberde:c:\SFU\var\adm C:\SFU\var\ * pa1pberde:c:\SFU\var\adm C:\SFU\ * pa1pberde:c:\SFU\var\adm C:\ * pa1pberde:c:\SFU\var\adm / * pa1pberde:c:\SFU\var\adm pa1pberde: c:\SFU\var\adm level=incr, 8 KB 00:00:05 5 files * : File C:\SFU\var\adm\.security could not be opened and was not backed up. (The process cannot access the file because it is being used by another process.) * : File C:\SFU\var\adm\utmpx could not be opened and was not backed up. (The process cannot access the file because it is being used by another process.) * : File C:\SFU\var\adm\wtmpx could not be opened and was not backed up. (The process cannot access the file because it is being used by another process.)
Legato Systems, Inc - Confidential and Proprietary 21 Using the Feature (cont.) Previously completed in Restart --- Previously Completed Save Sets --- aragorn: / level=full, 3831 MB 02:08: files aragorn: /space level=full, 6907 MB 02:48: files dev-nwserv: index:aragorn level=full, 63 MB 00:00:09 9 files
Legato Systems, Inc - Confidential and Proprietary 22 Licensing Considerations This feature is not licensed
Legato Systems, Inc - Confidential and Proprietary 23 Questions and Answers Any questions that have not been answered yet?
Legato Systems, Inc - Confidential and Proprietary 24 savegr p nsrjobd Jobs database Architecture and Internal Design Architectural diagram nsrd nsrexec d sa ve Console/GUI nsrmmd
Legato Systems, Inc - Confidential and Proprietary 25 Architecture and Internal Design (cont.) More notes on internal design Jobs daemon uses session channels wherever possible for doing remote execution and communication with nsrd and savegrp. All jobs get a record in the jobs database, this record remains for a period of time and then is purged based on the attributes set in the NSR resource. The daemon is multi-threaded. Not all threads are persistent. Depending on the OS this means it may appear that more than one nsrjobd is running.
Legato Systems, Inc - Confidential and Proprietary 26 Architecture and Internal Design (cont.) Savegroup opens a bidirectional session channel with nsrjobd at start Savegroup requests nsrjobd to start remote job Nsrjobd opens bidirectional session channel with the client’s nsrexecd Nsrexecd forks the child job and has bidirectional session channel to job Job reports state changes to nsrjobd
Legato Systems, Inc - Confidential and Proprietary 27 Architecture and Internal Design (cont.) Nsrjobd relays the state changes to savegroup Once the job gets media session, the session info is relayed to nsrjobd by nsrd Savegroup monitors for inactivity based on media session info and activity timestamp in nsrjobd database All stdout is redirected by nsrjobd to savegroup for backward compability Savegroup uses the stdout messages for completion reporting
Legato Systems, Inc - Confidential and Proprietary 28 Architecture and Internal Design (cont.) The instrumented client binaries (save) can generate indication events and completion events for the job for errors or warnings These indications are relayed to savegroup by nsrjobd and also stored in the jobs database Savegroup determines the success/failure of the backup based on indication severity
Legato Systems, Inc - Confidential and Proprietary 29 Debugging Techniques and Tips How to obtain debugging or tracking information Uses the standard debugging command of -D and levels 1-9. All debugging and error output is logged to the daemon.log All output will be prepended with a date/time stamp and the daemon name. The database at /nsr/res/jobsdb can be viewed using standard RAP tools. It contains a record of the jobs that have run and as such is a useful repository of information for debugging. Core file location follow the same convention as all other daemons. Use –vvv to get verbose output of remote client -D is not relayed to spawned jobs The verbose output is copied over to daemon.log and the temp file is retained as in 7.2 Indication level Debug is not used in 7.3, but wait for more sleeker and internationalized tracing of remote backup jobs failures in future.
Legato Systems, Inc - Confidential and Proprietary 30 Debugging Techniques and Tips Common pitfalls you or the customer may encounter Server machines need more memory disk space and CPU power than the past. Still NetWorker works with a decent low level server configuration Data reported (file size and times for completion) will not be exactly same as reported by GUI as these come from different sources. So the numbers can be slightly skewed The instrumented client binary is not reporting right level of indication, can still cause a warning to look like error or vice-versa All messages are in client’s locale. So still messages coming from clients from different locale will not be translated to servers locale. (This will be addressed in next release) Too long of a retention period or too large of a maximum size on the jobs database. Client state transitions are lost causing savegroup to seem like hung. (nsrjobd cleans up jobs in incorrect state periodically causing savegroup to recover from the hung situation) Very small restart window will cause the previous backups to be considered invalid and restarts will take longer Very large restart window can cause restart to overlap with next scheduled run. (Ideally restart window should be half of interval) Grouping needs to changed to group clients and savesets which are important and a warning should be considered a failure, into groups with Success threhold of “Success” Loss of reporting information if NMC daemon (gstd) is not run for period greater than minimum retention period. Savegroup unable to spawn processes. Check new authorization settings and the servers files. Customer wondering why there are no nsrexec’s running on the server. This is as designed
Legato Systems, Inc - Confidential and Proprietary 31 Debugging Techniques and Tips (cont.) Error messages customers might see daemon.log messages All jobs did not end gracefully…. –This means some jobs were not aborted at exit and savegroup was forced to exit before waiting for the exit of all jobs. –Completion report will not be valid for all jobs Lost channel with server –This means the communication with the nsrjobd was broken and caused savegroup to abort –If this message is seen repeatedly, nsrjobd is too busy to handle requests or hung (if a restart does not solve the problem, a daemon diagnosis (truss/pstack etc.) of nsrjobd might be needed) Aborting inactive job (%d) –The job is not saving data longer than inactivity timeout –The network bandwidth with the client needs to be checked –If the save process is hung in disk read a retry might resolve the issue.
Legato Systems, Inc - Confidential and Proprietary 32 Known Issues and Limitations Known issues and/or bugs Restarted savegroup does not clone savesets in previous runs (Existing issue in all past releases) Workaround – None (Plan to resolve this in next maintenance release) Limitations Older clients will not have indications All binaries are not fully instrumented to generate new indications (Gradual approach) CPE will be trained to extend existing error messages into indications for 7.3 clients Workaround (clients should be upgraded to 7.3)
Legato Systems, Inc - Confidential and Proprietary 33 Questions and Answers Any questions that have not been answered yet?
Legato Systems, Inc - Confidential and Proprietary 34 Demonstration If time permits - show db layout on disk & browsing of db using nsradmin - show savegrp –D9 and –vvv output and explain how to read new debug messages - show temp files created (& how to cleanup the debug temp files)
Legato Systems, Inc - Confidential and Proprietary 35 Questions and Answers Any questions that have not been answered yet? Thanks for attending