STANDARDS CHANGE REQUEST - MD5 Checksums for Files ================================================== Purpose: Ensure that the data integrity of every file on a PDS volume is preserved. The method proposed is to recursively calculate MD5 checksums for every file in the volume and store them in a single file named md5sums.txt at the root directory of the volume. Date: 2004-11-22, revision 1.0 Working Group: J. Wilf (lead), T. King, M. McAuley Background ========== A true archive is required to ensure data integrity. That is, no data can be lost or corrupted. This SCR is concerned with data integrity at the file level, e.g., for files in the archive. For any given file, it is possible to prove that the file has not been corrupted. This is done by calculating an MD5 checksum for the file, a 128-bit number, represented as a 32-character string of hexidecimal digits, e.g., 754b9db19f79dbc4992f7166eb0f37ce. The MD5 checksum for a file never changes, as long as the file itself doesn't change. So an MD5 checksum can be calculated for a file at any time, and if it matches the original MD5 checksum of the file, it is proven that the file has not been corrupted. There are free utilities that calculate MD5 checksums for any given file. The free md5deep utility (http://md5deep.sourceforge.net/) has the additional ability to recursively calculate the MD5 checksums of every file below a given a given root directory. For example, these are the first few lines of the output of md5deep, run from the root directory of the mgsm_2004 MGS MAG/ER volume: $ md5deep -lr . f8dd7758cb5231c9e7817c4710d00b6e ./aareadme.htm d8b83365f5e117b9665181944889da3d ./aareadme.lbl 1e8d45f622e09b9e2998af1a6d67a296 ./aareadme.txt 7dcfa51691ddd149a5a091ebe87b9bb1 ./errata.txt 7f310bf58a37af7f9b16c4fe68a131fb ./voldesc.cat e270623efc55a088cacfd1aaf17aca27 ./browse/mag_cal/hgaazeld.ps b9fd3452f5b25209174eddf6fd178160 ./browse/mag_cal/hgaazelm.ps [etc...] If md5deep is run from the root directory of a volume and the output is stored in a file, then md5deep can read the file and check the integrity of every file on the volume, with a command line such as this: $md5deep -x md5sums.txt -rl . This SCR proposes that the data integrity of every PDS volume be ensured by recursively calculating the MD5 checksum of every file on the volume and saving them in a file named md5sums.txt, at the root level of the volume. The checksums in md5sums.txt can then be used to guarantee that no file on the volume has been altered. Current Urgency =============== Ensuring the integrity of the data is a fundamental part of data archiving. This SCR will ensure the integrity of all files delivered on PDS volumes. Recommendations =============== Given the urgency described above, the PDS Project Engineer recommends the following actions: 1. To Management Council: Approve the "MD5 Checksums for Files" SCR as soon as possible. 2. To the PDS Standards Lead: Upon Management Council approval, publish the changes in the next version of the PDS Standards Reference, as described in the paragraphs below. Changes to the Standards Reference ================================== The following changes to the PDS Standards Reference are required to support this SCR: Chapter 19, "Volume Organization and Naming," will be updated to state that an md5sums.txt file is required at the root level, containing MD5 checksums for every file on the volume. Note that every drawing of the PDS volume structure will also need to be updated. [Note: Actual wording will be supplied when the concepts are agreed upon] Changes to the Data Dictionary - None ============================== Changes to the PDS Tool Suite ============================= As the above description of md5deep shows, there are freely available utilities that can calculate and check MD5 checksums for all files on a PDS volume. Therefore, no additional PDS utilities need to be written to implement this SCR. Impact Statement ================ The PDS Standards Reference will be modified as described above. In addition, there are the following impacts to the PDS System and its operation: 1. Data Providers will have to calculate and save the MD5 checksums as described above. 2. It is recommended that the curating nodes for each data set calculate and save the MD5 checksums for the online volumes already in their possession. 3. It is recommended that the OODT-PDS product server be modified so that MD5 checksums can be retrieved upon request (if available) to check the integrity of data downloaded from the system. Open Issues - None =========== [END OF SCR]