Views:

Summary

The DPX Open Storage Server solution has been deployed at many customer sites and operates reliably under ideal conditions and when correctly sized and configured. Since this is not always achievable in the field for various reasons, the solution can destabilize. One serious potential effect of this instability is that the integrity of recovery points (snapshots) stored on the server may be compromised. Often, this may go undetected until a data recovery need actually arises at which time it can cause great anxiety and dissatisfaction. In most instances of snapshot corruption, the root cause has been identified and either a fix has been implemented or a workaround suggested. However, there are rare circumstances where the underlying reason is yet unknown. Therefore, there is a fervent need for a mechanism that will detect the condition early so that problem diagnosis and corrective action can be undertaken immediately.

 

Step By Step

 

Verification strategies

An Open Storage Server host stores recovery points in the form of “logical” snapshots. Each snapshot is realized as a collection of regular files on the file system organized in a specific layout. Some of these files comprise the metadata which defines the associations between the files. As snapshots are created or deleted, files get added, updated and deleted. Some recovery actions may also affect the state of the system. All of these processes are carefully designed to ensure that the consistency of the data and metadata are maintained. However, environmental factors or events have the potential to disrupt some of these multi step processes and cause inconsistencies. Additionally, unauthorized user action, particularly file deletion, can compromise data.
 
Any verification attempt must then be concerned with ensuring that both the file data and relationship integrity is maintained. The strategies presented here will attempt to cover each of these aspects.
 

Metadata verification

The layout and organization of files necessary to manage the snapshots is very intricate. The DPX Nibbler component which is solely responsible for managing open storage server snapshots also provides the ability to verify their logical consistency. This can be done by opening a DPX command shell and executing the command shown in the Figure 1.
 
 

>nibbler –f –c “verifysnap [allbmp=y|nobmp=y]”


 
This command will validate the association (linkage) metadata for snapshots residing on all data volumes of the open storage server host and can optionally verify the bitmap consistency for the latest snapshot (default), all snapshots or none of them.
 

Data verification

The snapshot data files represent the actual blocks of a primary client volume image. Once they are created, these files remain unchanged until they are either deleted or updated when the corresponding snapshot is expired or merged with one being expired. This takes place during the condense process. Any failures during the update should be detected by the bitmap verification process mentioned in the earlier section. Besides this, intentional or inadvertent user action or virus infections are some other activities which could compromise the data files.
 
When the likelihood of malicious data tampering is low, the need to verify integrity of snapshot data files may be of lesser concern. If this does occur, however, the effects could be catastrophic since the restored image contents would be compromised. Environments with high security requirements may wish to employ this level of verification as an additional safeguard.
 
Verification can be performed using a checksum validation process. A DPX prepackaged utility script has been provided to accomplish this task. The following two steps are required:
< >Generate checksum values for all data filesValidate the checksums against the current contents of the data filesFigure 2.
 

 

>cd ..\sched\scripts
>java -cp userscripts.zip;..\..\lib\JSE.jar;..\..\lib\bexclasses.zip com.syncsort.bex.utils.SnapValidator --checksum createif --volume <datavolume> --filter SNAPMETADATA
 

 
Once checksums have been generated for all the data files, open a DPX command shell and execute the commands shown in Figure 3 to validate the data files against their expected checksum to ensure that their contents have not been altered.

>cd ..\sched\scripts
>java -cp userscripts.zip;..\..\lib\JSE.jar;..\..\lib\bexclasses.zip com.syncsort.bex.utils.SnapValidator --checksum verify --volume <datavolume> --filter SNAPMETADATA
 

The checksum generation step may be repeated whenever new snapshots have been created or existing
 
The checksum generation step may be repeated whenever new snapshots have been created or existing snapshots have been updated in order to keep the checksums up to date.

Automating the verification

The prior sections described techniques for performing manual verification of snapshot metadata and data files. While this may be useful for the occasional need, it is not realistic to employ ad-hoc manual steps for this crucial activity. This section will illustrate how verification can be automated using the aforementioned tools.
 
To enable checksum generation for snapshot data files, existing block backup jobs must be updated to include a Post job script step as shown below

 
 

 
For greater clarity, the complete script specification is shown in Figure 5.
 

 

jsescript@<DOSS destination> --runlimit 0 --jobid %JOBID --runclass com.syncsort.bex.utils.SnapValidator -- --checksum createif --volume <Destination vol> --filter SNAPMETADATA

 
 
To enable verification of snapshot data and metadata files, a separate job should be defined and scheduled at the desired frequency. The following illustrations show how this may be done to perform a daily verification.
 

  • Define a new “Copy” job selecting pre created dummy source and destination folders on the open storage server host

 

  • In Destination Options for the Existing Files Handling option, be sure to select “Replace Existing Files and Directories” to ensure that the copy continually works to perform some activity. The Copy source folder must contain at least one file/directory for the Job to be successful.
  • Define a Pre Job script that performs checksum verification as well as bitmap validation for all snapshots on the open storage server host

 
The complete script definition is reproduced in Figure 8 for better clarity.
 

jsescript@<DOSS destination> --runlimit 0 --jobid %JOBID --runclass com.syncsort.bex.utils.SnapValidator -- --checksum verify --volume <Destination vol> --filter SNAPMETADATA -- verifysnap -- verifybmp all

  

  • Define a schedule for the verification job (daily shown)
  • Save the verification job

 
Since snapshot data files can be updated as described earlier, it is important to know that the schedule selected for a verification job must be chosen carefully when checksum validation must be performed in order to avoid verification failure false alarms. The correct orchestration requires that any condense operation be followed by at least one successful block backup in order to ensure that the relevant checksum data is correctly created or updated. This sequence is shown below.

  •   Schedule arrangement for correct verification 

About the script

The previous sections only presented a small glimpse into the capabilities of the SnapValidator script that is provided with DPX. The script is a versatile utility designed to work as an independent Java Application in addition to being used as a Pre or Post script for a DPX job.
 
While it was primarily developed to assist with open storage server data verification needs, the script can easily be used as a general purpose checksum generator or directory scanner. Please consult the Javadoc (included in the userscripts.zip archive) for details regarding the supported options and their behavior.
 
In order to run the script as a Java Application, the following DPX library dependencies must be accessible and added to the Classpath: jse.jar, bexclasses.zip. Launch the application using the following command[1] which will display brief usage syntax information.
 
 

>java -cp userscripts.zip;JSE.jar;bexclasses.zip com.syncsort.bex.utils.SnapValidator --help
 

 
By default, the script generates a MD5 hash checksum value since this provides the best balance of performance and security needs for the verification use case.  A different hash algorithm may be chosen by using the “--method” option and specifying any supported standard algorithm name as the option value. The script was tested with values – SHA-1, SHA-256, SHA-512 and MD2.
 
It is possible to exclude files from the operation by using the “--filter” option and specifying any valid Regular Expression pattern as the option value. This provides a finer grained control when required.
 

Performance

The performance of the script when the checksum option is selected greatly depends on the I/O subsystem of the host since it reads the complete contents of each selected file under the specified root directory in order to compute the checksum value. For sparse files, like the open storage server snapshots, this will even read the “holes” and can therefore take a considerable time even though the actual size of the file on disk appears to be small.
 
To calculate the approximate time required for a specific execution based on the selected volume(s), root directory and filter, first determine the total data size that must be processed by executing a run using the “--checksum scan” option. Note the reported size statistics for each volume:
 
 

INFO: Size stats for directory(E:\.BackupExpressSnapshots): Total Logical Size(1,359,787,491,517); Total Real Size(N/A)

 
Next, execute another run using the desired checksum operation while selecting a representative sample subset of the data. Note the reported performance and size statistics:
 
 

INFO: Performance stats for directory(E:\.BackupExpressSnapshots\SSSV_DOSS_Win\{BEX-453A-5333322102A0}\[DOSS_Win]QA41CLIENTW@{9665DF2B}): Processed Files(9); Duration(177.78s); Rate(0.05 files/sec)
INFO: Size stats for directory(E:\.BackupExpressSnapshots\SSSV_DOSS_Win\{BEX-453A-5333322102A0}\[DOSS_Win]QA41CLIENTW@{9665DF2B}): Total Logical Size(21,485,706,611); Total Real Size(N/A)
 

 
The estimated run time for the actual operation can then be calculated by extrapolating the results from the sample run scaled to the total expected data size.

 
[1] The shown command syntax assumes all dependencies are collocated with the script archive