Views:

Summary



This article answers frequently asked questions and describes infrastructure requirements for DPX operations in a customer network.

Symptoms



Troubleshooting Network and Communication Problems FAQs

Resolution



Prerequisites

  1. Collect master server, device server, client operating system, and service pack information.
  2. Collect DPX version (SSPRODIR\.build) or (SSPRODIR\.swinfo) and patch (SSPRODIR\updates\patches.level) level data. Where SSPRODIR is the installation folder.
  3. Consult the System Requirements and Compatibility matrix and validate the configuration: https://mysupport.catalogicsoftware.com/compatibility.php
  4. Confirm that the master server, device server, and client can ping each other by hostname and IP Address.
  5. Confirm that the DPX command prompt is functional:
    For Windows use: "BEX CMD Shell"
    For Linux use: ". ./bexenv"
    For Linux OES use: "backupexpress.sh"
  6. If applicable, collect logs from the NDMP proxy (the client node that communicates with the NetApp storage system).

1. Network/Communication Failure

There should not be a firewall, VPN, or third-party software in the environment because these items could block TCP or UDP ports between the master server, device servers, or client nodes. If there is a firewall, additional configuration is required before a backup job can run.

  1. Stop operating system-bundled firewall services. If firewall services cannot be stopped, additional configuration is required before a backup job can run.
  2. Use current driver and firmware for all NICs. Use recent firmware for LAN hardware (switches, routers).
  3. NIC configuration should force specific parameters (e.g. 1 GB, Half Duplex) instead of autoconfiguration. Configure switch port to force specific speed and duplex settings instead of "Auto".
  4. If the NIC offers TCP Offload Engine (TOE) or hibernation capabilities, disable these NIC settings to identify the root cause of failure:
    For NetApp: options ip.fastpath.enable off
  5. If there are multiple IP addresses bound to the NICs or multiple NICs are configured as a TEAM to share one IP address, modify the SSICMAPI (the registry string value under Catalogic Software Key or in the file under the SSPRODIR, depending on operating system type to append:
    -hn Preferred-Local-IP-Address-To-Bind
  6. Regarding firewall setup, see:
2. Socket keep-alive timeout value

Most firewall administrators prefer to set the socket keep-alive timeout value low, such as a few minutes. If a backup/restore job goes through a firewall, any idle DPX connections will terminate within a few minutes of the job starting. While it is not common for network connections transferring data to go idle, control connections between DPX modules almost always go idle when the data transfer phase is in progress. Such control connections are usually terminated at the firewall during the data transfer phase of the backup/restore job. When these connections need to be used again for job control, the job encounters an error and fails. There are two solutions to this problem:

  • Increase the firewall socket keep-alive timeout value to a value greater than 2 hours.
  • Send keep-alive packets more frequently (an interval less than the keep-alive timeout on the firewall).

If the firewall administrator is not willing to modify the value of the keep-alive timeout on the firewall, the only other option is to modify the TCP keep-alive timeout value on the end-points of the network connection (solution #2). Based on the operating system of the end-point, perform one of the following procedures to modify the value of TCP keep-alive timeout:

Linux:

sysctl -A | grep net.ipv4

displays kernel variables for TCP/IP v4. The variable that controls keep-alive timeout is net.ipv4.tcp_keepalive_time.

To set an explicit value, use the command: sysctl -w net.ipv4.tcp_keepalive_time=<value in seconds>.

Solaris / HP-UX:

ndd /dev/tcp \?

displays kernel variables for TCP/IP. The variable that controls keep-alive timeout is tcp_keepalive_interval.

To get the current value of this variable, use the command: ndd /dev/tcp tcp_keepalive_interval.

To set an explicit value, use the command: ndd -set /dev/tcp tcp_keepalive_interval <value in milliseconds>.

Windows:

In Windows, there is no way to query the system for variables. See Microsoft KB article 120642 for tuning TCP parameters.

To set an explicit value for keep-alive timeout, create or edit the registry value HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ KeepAliveTime. Its value is the timeout in milliseconds.

Note: Reboot the server for this change to take effect.

Note: Tuning the keep-alive timeout value is a system-wide setting. It affects any socket connection created with the keep-alive option, not only those created by DPX.

3. TCP/IP connection

If the connection is between two modules running on the same machine, by default this uses a named pipe, but a TCP/IP connection can be used by specifying -lcl no in the SSICMAPI parameters.

4. Tickle Time

This is an option on the job handler, which is disabled by default. It is enabled by setting --tickletime X on the job handler where X is a period in minutes.

As long as messages continue to be ready for processing (on any connection), the time to tickle is checked. Once the elapsed time exceeds the setting, another round of TOK_DUMMY messages are sent. Since the tickle time is not checked while the message handler is waiting for a message to arrive (before the timeout expires), a tickle may not be sent out until the tickle time plus the timeout value has elapsed since the last tickle was sent. If the timeout value is 10 minutes, a setting of 15 minutes for tickle time results in a range of 15 to 25 minutes between tickles.

On Windows: String Value

HKLM\Software\Syncsort\BackupExpress\<node name>\0\SSJOBHND

Value Data

--tickletime 10

On UNIX: String value

To set the value on UNIX, modify the file ssjobhnd in the DPX bin directory. In this file, append the string " --tickletime10 " to the end of the exec statement for jobhnd. The result may look similar to the following:

exec ./jobhnd --noconnectioncheck --maxjob 10 --tickletime 10

Note: The feature is not enabled by default, but will not cause any disruption or degradation to the normal operation of the product.

5. Return Code 10060 (CM_ERR_ETIMEDOUT)

The 10060 error is a general TCP error that points to communications issues between computers. The cmagent and child processes package the information to back up on the remote machine and push it across the network. To test the underlying TCP/IP protocol and the communications path, ping the nodes involved continuously with at least a 1024 byte packet size. To do so, at a command prompt type:

   ping-l 1024 [ip address] -t

If the ping command times out, troubleshoot this as a network issue.

Or, test your network connectivity by running testperf. This is a tool provided with DPX, that is located in the DPX tools directory. The testperf utility transfers data from the machine where you run the command to a destination machine and reports throughput statistics. Syntax:

   testperfsend 10000 32768 -s [Destination hostname] -p -v -u -r

Note: cmagent must be running for testperf to function. Also replace Destination hostname with your master server host name and then again with the node in question. If the node in question is not the device server, run the second testperf with the device server's name.

Provide us with the output for each run of the testperf command.

Check the following:

  • Verify that the TCP/IP address servers (Domain Name Servers) are configured correctly on this PC and have the correct address for this server.
  • If the address is correct, check the HOSTS file on your computer for errors. If the IP address specified in the HOSTS file is incorrect for the server, correct it and try the operation again.

6.KEEPALIVE option

By enabling the KEEPALIVE option when the TCP/IP connection is made, active communication between DPX modules is determined by the operating system's KEEPALIVE time setting.

To enable the KEEPALIVE option for TCP/IP sockets, the SSICMAPI must be modified on both servers at each end of the communication. Append the following value to SSICMAPI:

   /ska y or -ska on

Note that the KEEPALIVE time setting on the operating system must be less than the timeout setting on the firewall or the firewall drops the connection.

7. CMAGENT fails to restart with error 10048

The latest cmagent (*.cml) log in <SSPRODIR>\logs displays the following error:

     cmagent: Error(10048) on cm_ap_listen

If CMAGENT fails to restart with error 10048, use option -r  for the CMAGENT to reuse the socket even though it may not completely release.

-r

-r

Allows Reuse Of TCP Address + Port Number For CMAgent Listening EndPoint. Defaults: UNIX - on, Windows - off, Netware - off.

After adding the -r option, restart the DPX services.

8. DPX and Oracle Support

If a node has Oracle, edit the sbt11cfg.bex file with the same values as SSICMAPI.

9. skidletime

If tml logs show 10054 and 10038 errors after multiple "waiting for any msg timeout (60)" messages, then skidletime can be implemented. This is a setting in the tape mount manager module(tmm), --skidletime N, where N is the number of minutes, that resets the socket every N minutes so it will not be closed by the firewall. The parameter N must be less than the firewall inactive connection timeout.

To implement this change on Windows:

  1. Run regedit.
  2. Navigate to HKLM\Software\Syncsort\BackupExpress\<node name>\0.
  3. If the value does not exist at this location called "SSTPTMM", create it as a new String Value.
  4. Edit the SSTPTMM value.
  5. Append --skidletime N to the value data, where N is the number of minutes that resets the socket every N minutes so it does not close.
  6. Close regedit and restart DPX.

To implement this change on UNIX:

  1. Go to the bin directory under DPX.
  2. Edit the sstptmm file.
  3. Append --skidletime N to the end of the line, where N is the number of minutes that resets the socket every N minutes so it does not close.
  4. Save and close the file.
  5. Restart DPX

10. How do I configure the Microsoft Windows Firewall to work with DPX?

To allow DPX access through the Microsoft Windows Firewall, add either the ports DPX uses or the DPX Programs to the exceptions list for the Firewall:

1)    Open the Microsoft Windows Firewall using either of the following:

  • Select Start > Run, then type firewall.cpl and click OK.
  • Select Start > Settings > Control Panel > Windows Firewall.

On Win2008 R2, click Turn Windows Firewall on or off which displays the Windows Firewall Setting dialog. In the Windows Firewall Setting dialog, select the Exceptions tab, then add either ports used by DPX or the DPX programs to Exceptions:  

a) To Add the DPX Ports:     

  1. Click Add Port:     
  2. In the Add a Port dialog, specify the ports and if they are TCP or UDP. If you want to specify the computers for which these ports are unblocked, click Change scope: to restrict access to a list of IP addresses, your subnet or any computer.     
  3. After you enter your port information, click OK.
  4. Click Add Port: to add any additional ports.

b) To Add the DPX Programs:     

  1. Click Add program:     
  2. In the Add a Program dialog, either select the Program from the list or click Browse to locate the program. It is recommended that you add all the executable files (extension "exe") found in the $SSPRODIR\bin directory. If you are performing OSSV backups, you will also need to add $SSPRODIR\tools\jre\bin\java.exe. If you want to specify the computers for which these ports are unblocked, click Change scope: to restrict access to a list of IP addresses, your subnet or any computer.     
  3. After you select your program, click OK.     
  4. Click Add Program: to add any additional programs 

2)    Under the Exceptions tab, verify that the check box next to your Program or Port is selected, then click OK.

Note: If you decide later that you do not want the program to be an exception, you must clear this check box. 

3)    Restart the DPX cmagent service on the node.

11. Errors 10060 or 10065

The node addition process requires that communication between the management console/GUI and port 6123 on the node being added isn't blocked. If this happens, errors 10060 or 10065 display. New versions of Linux run firewalls by default. If the firewall is not properly configured with a regular security mode, the Linux operating system will turn on the firewall for most communications. Please note that the firewall is a host-side firewall when following the below procedure. 

  1. Log in to the Linux node as a root user.
  2. Run the iptables-L command.
  3. Check the iptables-L command's output to see if the firewall allows CMAGENT to communicate with the master server.
  4. Turn off the node's firewall (/etc/init.d/iptables stop) or configure it so that the DPX client can properly communicate with the master server.

On SUSE Enterprise Linux, there is a service named "SuSEfirewall2_setup" that controls the firewall settings that have been configured with the YaST firewall utility. This service can be started, stopped and restarted using the "service" command.

Example: linux-w2mu:~ # service SuSEfirewall2_setup stop

Comments (0)