Presentation is loading. Please wait.

Presentation is loading. Please wait.

TNPM v1.3 Flow Control. 2 High Level Instead of each component having flow control settings that govern only its directory, we now have a set of flow.

Similar presentations


Presentation on theme: "TNPM v1.3 Flow Control. 2 High Level Instead of each component having flow control settings that govern only its directory, we now have a set of flow."— Presentation transcript:

1 TNPM v1.3 Flow Control

2 2 High Level Instead of each component having flow control settings that govern only its directory, we now have a set of flow control settings for each datachannel root directory including all components that live in that directory Components no longer monitor their own space usage. Instead, inside the AMGR there is a Disk Usage Server (DUS) that monitors the space for each datachannel root directory on that host Components ask the DUS if there is enough space to write to disk and stop processing when there is not enough space When the overall space consumed in a datachannel root directory becomes too low, the DUS tells all components that live in that root directory to free up some space (or all available space) Components try harder to not overuse space by only acquiring a few hours of data before processing it and stopping when there are a few hours of data waiting to be picked up

3 3 High Level Components can still become flow controlled (stopped) because there is not enough space or the quota for the datachannel root directory has been exceeded Components still store old data that is no longer needed in their done directory and delete this data when more space is required

4 4 Flow Control Overview AMGR DiskUsageServer /dc /dc/CME.1.1 /dc/FTE.1.1 /dc/LDR.1 /dc/DLDR.1 CME.1.1 DiskUsageClient FTE.1.1 DiskUsageClient LDR.1 DiskUsageClient DLDR.1 DiskUsageClient Components ask DUS if they can use more disk space DUS tells components to free disk space when necessary

5 5 Managing Consumed Space When disk consumption is < 80%  DiskUsageServer will continue to answer yes to space requests When disk consumption is >= 80%  DiskUsageServer will contact all components who reside in this root directory and tell them to free up some space as they see fit. For example, each component may delete only 5 hour directories or only 50 files, etc.  DiskUsageServer will continue to answer yes to space requests

6 6 Managing Consumed Space When disk consumption is >= 90%  DiskUsageServer will contact all components who reside in this root directory and tell them to free up all space that they can  DiskUsageServer will answer no to space requests which will stop all components (except LDR & DLDR components) in this root directory. LDR and DLDR components are allowed to run because the system cannot unblock itself unless these components run. The LDR and DLDR components are given 9% of the total quota to operate and load data which can unblock the system if there are no errors happening. When disk consumption is >= 99%  DiskUsageServer will answer no to space requests from LDR & DLDR components in this root directory

7 7 Managing Free Space When free disk space <= FS_LL  DiskUsageServer will contact all components who reside in this root directory and tell them to free up all space that they can  DiskUsageServer will answer no to space requests which will stop all components in this root directory

8 8 Good Citizen Components try to behave as good citizens by:  Only acquiring and buffering a few hours of data in advance in their do directory (default if 4 hours). Can be configured at the component level by modifying FC_MAX_DO_HOURS  Only producing a few hours of data in their output directory and stopping if this data is not picked up by downstream components (default is 4 hours) Can be configured at the component level by modifying FC_MAX_OUTPUT_HOURS  Honoring their retention interval and only keeping a certain number of hours of data in the done directory even if space is available. This has not changed from the previous release. Can be configured at the component level by modifying FC_RETENTION_HOURS

9 9 Supported Configurations  Single datachannel root directory  Component directories on the same disk (not mounted or linked) Disk 1 Datachannel Root Datachannel Root FTE.1

10 10 Supported Configurations  Multiple datachannel root directories (can be on different disks)  Component directories are NOT mounted or linked  Can create a root directory for each channel or for all FTEs or any other organization you choose Disk 1 Datachannel Root 1 Datachannel Root 1 FTE.1.1 Disk 2 Datachannel Root 2 Datachannel Root 2 FTE.2.1

11 11 New Restrictions Previously if you were running low on disk space you could mount or link a component directory (say CME.1.1) from another file system. This is no longer allowed. Instead of mounting or linking a component directory, you can mount another datachannel root directory and put some components in this new datachannel root directory. This new datachannel root directory must have its own DUS configuration settings.

12 12 Unsupported Configurations  Datachannel root and component directories are on different disks  To do this they use mounted or linked component directories  This is NOT SUPPORTED and will cause problems Disk 1 Datachannel Root 1 Datachannel Root 1 FTE.1.1 Disk 2 link or mount

13 13 Example DUS Configuration AMGR.DC1C.DUS.1.FC_FSLL=150000000 AMGR.DC1C.DUS.1.FC_QUOTA=2800000000 AMGR.DC1C.DUS.1.LOCAL_ROOT_DIRECTORY=/opt/datachannel AMGR.DC1C.DUS.1.REMOTE_PASSWORD=CACCDHDBCCCJ AMGR.DC1C.DUS.1.REMOTE_ROOT_DIRECTORY=/opt/datachannel AMGR.DC1C.DUS.1.REMOTE_USERNAME=pvuser AMGR.DC1C.DUS.1.USE_SECURE_FILE_TRANSFER=TRUE AMGR.DC1C.DUS.1.PORT_NUMBER=21

14 14 DUS Configuration Settings – FC_FSLL is the free space low limit. When the disk has less than this amount of space available (in bytes), components will become flow controlled (stopped) – FC_QUOTA is the amount of space (in bytes) you wish to allocate to the components running in this datachannel root directory. – LOCAL_ROOT_DIRECTORY is the full local path to the datachannel root directory – REMOTE_ROOT_DIRECTORY is the path to the datachannel root directory when accessing this directory via ftp or sftp – REMOTE_USERNAME is the username to use when accessing this datachannel root directory via ftp or sftp – REMOTE_PASSWORD is the password to use when accessing this datachannel root directory via ftp or sftp – USE_SECURE_FILE_TRANSFER allows you to say that you want to use sftp when accessing this datachannel root directory from another host – PORT_NUMBER is the port number to use for ftp or sftp

15 15 DUS Configuration in Topology Editor

16 16 Log Messages V1:9017 2010.03.30-18.14.40 UTC AMGR.DC1C- 4673:8272 FLOW_CTRL_STATE 1 Dir=/opt/datachannel Actual free space = 416,288,768 Free space low limit = 150,000,000 Actual consumed space = 237,341,696 Space quota = 2,800,000,000 Consumed space calc milliseconds =91 The DUS inside AMGR will log this message so you can see how much space is currently used and available on the filesystem

17 17 Log Messages 010.03.24-15.00.00 UTC DG.1.13-17864:2515 FLOW_CTRL_ON 1 Flow control is being asserted – Components will log this message when the system is low on available disk space and the DUS is answering no to components space requests. This means the component is flow controlled (stopped) until more space becomes available. 2010.03.24-15.25.59 UTC DG.1.13-17864:2515 FLOW_CTRL_OFF 1 Flow control has been deasserted – Components will log this message when space has become available and they are returning to normal processing. This means the component is no longer flow controlled (stopped) because more space has become available.

18 18 Log Messages 2010.03.30-18.15.05 UTC FTE.4.8-8977:7706 FLOW_CTRL_PROCESSING_PAUSED GYMDC39209W Processing paused because output at maximum – Components will log this message when there is too much data in the output directory waiting to be acquired by downstream components 2010.03.23-19.33.49 UTC CME.1.2-26344:1784 FOW_CTRL_PROCESSING_UNPAUSED GYMDC39211I Processing unpaused because no longer at max output – Components will log this message when enough output data has been acquired

19 19 Log Messages 2010.03.24-17.06.15 UTC AMGR.DCAIX2-1622116:4888 FLOW_CTRL_PURGE_SOME 1 Notifying components in dir (/opt/proviso/datachannel) to purge some – DUS will log this message when it is telling components to delete some data from their done directory. This is normal and should not cause worry. 2010.03.24-17.06.17 UTC CME.2.2000-1646612:15281 FLOW_CTRL_PURGE_SOME 1 Server requests I purge some – Components will log this message when they are told to delete some data from their done directory. This is normal and should not cause worry.

20 20 Log Messages 2010.03.24-15.25.40 UTC AMGR.DC1C-4673:11897 FLOW_CTRL_PURGE_ALL 1 Notifying components in dir (/opt/proviso/datachannel) to purge all – DUS will log this message when it is telling components to delete all data from their done directory 2010.03.24-15.25.41 UTC CME.1.13-19745:5271 FLOW_CTRL_PURGE_ALL 1 Server requests I purge all – Components will log this message when they are told to delete all data from their done directory

21 21 Log Messages 2010.03.24-15.25.40 UTC AMGR.DC1C-4673:11897 FLOW_CTRL_QUOTA_FAILURE GYMDCDC10111 Error: Some error. Unable to get disk consumption for dir: /opt/datachannel – DUS will log this message when it encounters an error while running the du command 2010.03.24-15.25.40 UTC AMGR.DC1C-4673:11897 FLOW_CTRL_FS_FAILURE GYMDCDC10157 Error: Some error. Unable to get free disk space for dir: /opt/datachannel – DUS will log this message when it encounters an error while calculating the amount of free space available on this filesystem

22 22 Troubleshooting Tips Grep the log for FLOW_CTRL log messages Run the du command manually on the root directory to make sure it works Run the df command manually to see how much free space is available If your system is catching up after some components were stopped it is normal to see components log FLOW_CTRL_PROCESSING_PAUSED and FLOW_CTRL_PROCESSING_UNPAUSED as they rush ahead and downstream components are unable to keep up with the output of new data. BCOL and LDR have FLOW_CTRL_SKIP log messages that describe why BCOL or LDR is skipping the acquisition of data. Usually it is because too much data has already been acquired and buffered. CME logs NOT_ACQUIRING_TUPLES for a number of reasons. It could flow controlled or it could have already acquired and buffered too much data. This could also indicate a problem with CME receiving input from some inputs but not other inputs caused by a down collector or stopped FTE or CME.

23 23 Troubleshooting Tips The system depends on LDR and DLDR being able to load data into the database and then delete that data from the disk. This means that LDR and DLDR are allowed to run even if other components are stopped because the system is low on disk space. When flow control problems happen, components will back up from right to left (see diagram below). If your LDR is crashing it will eventually cause CME then FTE then UBA to flow control. This means when you notice a problem, start looking at components on the right to see if they are the cause. UBAFTECM E LDRDLDRstart Flow control problems cause backups upstream

24 24 Upgrade All installations before upgrade should have one datachannel root directory per host Check that there are no linked or mounted component directories under the datachannel root directory. If there are, they need to be reconfigured so that they are local directories under the main root directory or a new mounted root directory The Topology Editor will sum up component quotas and set the default root directory quota to this sum. Check that this sum is not greater than the amount of disk space available

25 25 Environment Design Guidelines Never link or mount a component directory under a datachannel root directory FC_QUOTA for a root directory should not exceed the amount of actual space available on the filesystem FC_FSLL should be large enough to be useful. Setting this number too low will make it very hard to recover if the system runs out of space. Think of this number as the buffer of space that will be available to recover from running out of space.


Download ppt "TNPM v1.3 Flow Control. 2 High Level Instead of each component having flow control settings that govern only its directory, we now have a set of flow."

Similar presentations


Ads by Google