Views:

Summary



Many modules require access to the data stored in the DPX catalog files. The database module (ssdb.exe on Windows, db on Unix) is responsible for maintaining these files, and serves as a central point of contact for the other modules. If this module is busy handling a prior request, such as cataloging files at the end of a backup job or supplying restore job information for a large number of files, it may not respond to requests by the other modules in a timely manner.

Symptoms



Failures related to connectivity occur, coincident when other jobs are running or there is catalog intensive activity in progress.

One module acts as a wrapper for all access to the database module (ssdatmgr.exe on Windows, datmgr on Unix), running separate threads to handle each request by one of the other modules. If the database module does not respond within 2 minutes, this wrapper module will timeout and return an error to the module making the request for access. Depending on the nature of the request, the failure may result in a job failing or may result in a GUI error reported.

Evidence that this timeout has happened will be found in the relevant ssdatmgr.*.log file on the master server, looking something like this:

Time(Jan 27 02:24:24):TID(1102064560): * [DMM Server Thread started on socket(8) by client(sstptmm@x.x.x.x)] Time(Jan 27 02:26:24):TID(1102064560): | Error: DMMServer Thread aught exception while Connecting to DBCore: CMError - sockID(14), cmFunction(cm_ap_connect_str), cmErr(12032)

Time(Jan 27 02:26:24):TID(1102064560): | Error: DMMServer Thread is going to Error State (2300, Device Media Manager experienced an internal error (CMError - sockID(14), cmFunction(cm_ap_connect_str), cmErr(12032)) while processing request (DMM::downloadTabs).)

Note the 2 minute interval between the start of the affected thread and the connection error.

 



Resolution



An update is available to allow the wrapper module to retry connections that fall into this category. If a timeout occurs, the wrapper module will retry the connection up to 9 additional times (each try takes 2 minutes to timeout, so this would allow for a total of 20 minutes trying to connect).

Evidence that this retry is taking place will be found in the ssdatmgr.*.log file looking like this (the example is for a tape mount manager operation):

Time(Mar 22 00:16:23):TID(1162877872): * [DMM Server Thread started on socket(4) by client(sstptmm@x.x.x.x)]

Time(Mar 22 00:18:23):TID(1162877872): | DMMServer timed out while trying to connect to DB. Retrying 9 more times.

Time(Mar 22 00:18:23):TID(1162877872): | Warning: CMAPI::Socket::connect caught Exception while closing socket : CMError - sockID(7), cmFunction(cm_closesocket), cmErr(10038) -- ignoring.

Time(Mar 22 00:19:47):TID(1162877872): | tCtx(1111468788): Loaded a tape in media library (juke000) from slot (9) to device bay (1:L2).

Time(Mar 22 00:21:37):TID(1162877872): | tCtx(1111468898): Media(000009)(APPENDABLE|INMOUNT) is unassigned from slot (juke000, 9)

Time(Mar 22 00:21:37):TID(1162877872): | tCtx(1111468898): Device(L2) mounted media(000009)

Note the timeout is reported after 2 minutes elapse, and the remaining retry count reported. The socket closure message with 10038 error is normal as it cleans up the failed connection attempt. In this case the retry attempt is successful before the next 2 minute timeout, and the normal operation continues.

It should not be necessary to extend this retry beyond the default 9 retries. Please contact Technical Support if it appears that more is needed.

This behavior has been addressed in a DPX Maintenance Update. The update can be automatically installed via the BEX Software Update System user interface or can be manually downloaded from the Catalogic Software Online Support site (mysupport.catalogicsoftware.com). Check the Software Update Release Notes for your BEX release referencing Issue ID 2748.