Views:

Summary

Block backup tasks connected over a WAN Link or router always fail two hours after begins.

 

 

Symptoms

 

 

Block backup fails exactly 2 hours after data transfer starts


172.16.0.99 2/6/2017 5:04:15 pm SNBNCJ_100N Task 6 NDMP_LOG: id(3.520), type(NDMP_LOG_NORMAL), text(CDOT transport type is [NFS] for backup target: [/vol/v01k05054/[v01k05054]v01k05054@{544AE08A}])
... two hours later
172.19.8.13 2/6/2017 6:51:22 pm SNBNCX1031E Error calling fn (ms_recv_msg) rc (10053)
172.16.0.99 2/6/2017 7:04:15 pm SNBSVH_968E Create relationship(v01k05054/[v01k05054]v01k05054@{544AE08A}) failed with exception: NDMPSessionException(0, createRelationship exception: cm_recv_rec failed, localPort (49423), peer (172.19.8.13:58316), rc (10054), description (The connection to the module has been reset.), peerstring (ssndmpc 2.2/4.4 win-x64 14:22:07 Aug 1 2015))

or

172.27.1.67 6/12/2018 8:35:37 pm SNBNCJ_100N Task 5 NDMP_LOG: id(1.208), type(NDMP_LOG_NORMAL), text(SVP: Done File History. Files sent:[1])
... two hours later
172.27.1.67 6/12/2018 10:37:56 pm SNBSVH_969E Transfer backup(R:/[JOBNAME@{50C49958}) failed with exception: NDMPSessionException(0, transferBackup exception: cm_recv_rec failed, localPort (57146), peer (172.27.1.67:57147), rc (12004), description (cm_recv_rec reported 0-len read: peer module closed connection.), peerstring (ssndmpc 2.2/4.5 win-x64 23:08:19 Apr 10 2018))
172.27.1.67 6/12/2018 10:37:56 pm SNBSVH_245J Task 5 remaining retry count: 5

 

Resolution

The backup failure is caused by an idle connection timeout setting on the switch/router/firewall.

Most firewall administrators prefer to set the socket keep-alive timeout value low, such as a few minutes. If a backup/restore job goes through a firewall, any idle DPX connections will terminate within a few minutes of the job starting. While it is not common for network connections transferring data to go idle, control connections between DPX modules almost always go idle when the data transfer phase is in progress. Such control connections are usually terminated at the network device during the data transfer phase of the backup/restore job. When these connections need to be used again for job control, the job encounters an error and fails.

There are two solutions to this problem: 

 

 

  1. Send keep-alive packets more frequently (an interval less than the keep-alive timeout on the firewall). 
  2. If the firewall administrator is not willing to modify the value of the keep-alive timeout on the firewall, the only other option is to modify the TCP keep-alive timeout value on the end-points of the network connection. 
  • Changing the idle connection timeout value on the affected client and/or Master server to 5 minutes(300000 milliseconds) on the affected client and Master server. 
Based on the operating system of the end-point, perform one of the following procedures to modify the value of TCP keep-alive timeout: 
Windows: 
In Windows, there is no way to query the system for variables.. 
To set an explicit value for keep-alive timeout, create or edit the registry value HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ KeepAliveTime. Its value is the timeout in milliseconds. 
Note: Reboot the server for this change to take effect. 
Linux: 
sysctl -A | grep net.ipv4 
displays kernel variables for TCP/IP v4. The variable that controls keep-alive timeout is net.ipv4.tcp_keepalive_time. 
To set an explicit value, use the command: sysctl -w net.ipv4.tcp_keepalive_time=<value in seconds>. 
Solaris / HP-UX: 
ndd /dev/tcp \? 
displays kernel variables for TCP/IP. The variable that controls keep-alive timeout is tcp_keepalive_interval. 
To get the current value of this variable, use the command: ndd /dev/tcp tcp_keepalive_interval. 
To set an explicit value, use the command: ndd -set /dev/tcp tcp_keepalive_interval <value in milliseconds>. 

Note: Tuning the keep-alive timeout value is a system-wide setting. It affects any socket connection created with the keep-alive option, not only those created by DPX 
  • Increase the firewall socket keep-alive timeout value to a value equal to or greater than 2 hours.