Keith Bennett (GEO) Ed Guinness (GEO) Steve Hughes (EN) Steve Joy (PPI) Joe Mafi (PPI) Anne Raugh (SBN) Elizabeth Rye (EN) Dan Scholes (GEO) Dick Simpson (RS) Susie Slavney (GEO) Tom Stein (GEO) MD5 Checksum Steve Joy summarized the SCR - usage of the md5 checksum keyword is in psdd already with pending status no guidelines on how or when to use it this scr designed to address that number of us already using it ppi, cassini fields and particles have requiredments in sises to provide them other users at imaging and geo anne: multiple detachedd files referenced by same label joy: can keyword be used within an object other than implicit or explicit file object main issue todd addressing here is that md5 checksum should be applied at file level , even if file has multiple objects anne: detached label referencing an image in one file and a table in another joy: use explicit file objects rye: need both file and object checksums, depending on purpose; file: internet transfers, media refreshing of already archived data object: needed for label upgrades or data not yet fully archived rye: we need to not drop the object checksum anne: define chcksum as a class word define md5_checksum to refer to object in which it appears (if it refers to image object, it should be in image object) anne: reserve md5 file extension for a file containing md5 checksums joy: standard tool (md5deep) that is already out there for checking a DVDs worth of checksum expects a file called md5sums.txt that's the file it creates if you run it, and the file it will use if it already exists on a DVD. anne: will software let you give it an alternate file name? joy: don't know, joe's running a test right now. anne: other question: doesn't that md5sums.txt file need a label? rye: yes, it does anne: it will need to be detached or it will presumably cause problems for the software joy: exactly rye: we'll need to have that spelled out explicitly in this SCR with examples simpson: is this md5deep a PDS tool? rye: no joy: it's a standard... simpson: let's be careful about using "standard"; is it PDS? joy: no simpson: then we could write our own anee: is the algorithm available? joy: algorithm is published anne: we need a copy someplace in the PDS we can reference it rye: that's something else that needs to be stated explicitly in the SCR anne: yes, if it isn't stated somewhere else that PDS needs to maintain a copy of all standards it references, then yes, we need to state it explicitly here. ed: suggestion: confirmed that this meeting is about todd's scr; then can each node have a turn to go around and discuss the scr that todd sent out. rather than a free form discussion simpson: i'm not sure that steve finished his introductory remarks joy: no more really to add; todd has two scrs or two parts here; one is for use of md5checksum keyword in a label; other is creating a complete set of checksums for all files on a volume (electronic or physical) and how to store that information rye: go through the nodes: (roughly 20min into recording) geo (ed): in favor of using md5 checksums ; tom has investigated different kinds of checksums; agrees with Todd's statement that MD5 is sufficient and probably easiest to use; in general we're in favor of this idea for ensuring integrity of files and archives; couple of comments: geo has already been using; on third page, item 2 manifest file on root directory we have no objection to this but would point out that if information to do integrity checking on disk or file system is on that system it runs the risk of becoming corrupted and therefore is not sufficient for the intended purpose of maintaining the integrity of the archive. within the node, that information should be kept somewhere separate so that if that file is corrupted, you have a backup somewhere. Other issue: pds should have information on how to generate one's own checksum manifest in standards reference and software that we've decided is good should be on the software page so that if users want to get a copy of the software to check archives, they can. Tom also brought up the point that simply having a manifest table doesn't necessarily make it easy for users to check individual files; need a tool to go into the manifest table and get the checksum to compare with the checksum of the current file. PDS may want to come up with a tool that does the checksum and compares it with the source manifest file. Also, impact assessment, item 2 "including md5 checksums in delivery of online products" vague, in document, do we include checksums or not? in general, we're in favor steve joy: do not belive that todd was implying that checksums be required for inclusion. simpson: both impacts sort of assume checksums required simpson: theoretical lifespans need to be cleaned up. conjectural statements are problematic, this is demonstrated to be true in all but extreme circumstances (my impression rye: only real concerns are for cases security purposes (cryptography); for our purposes it is very good, only case where I've heard that you could have two different files is where two bytes are flipped within the same record. two bytes flipped from one record to another - different checksums; one byte changed - different checksums; need to confirm with Myche. ACTION: get more details on this. simpson: current and planned use: "the" cassini SIS - assume there's more than one so that needs to be more specific. Under "propsed solution", I don't see anything here that addresses point number 3. which says that this would permit md5checksums to be included - this is happening already; this provides a legal framework that they can be included under changes to StdRef 1st sentence not right - says that this file includes a checksum for every file on the volume but this will not include the md5sums.txt file itself. - that needs to be revised Also have question - do we intend to include directories like the extras directory? joy: if you don't include eXTRAS directory, then run md5deep, it will report errors back for not having been validated - it expects that every file beneath it should be included rye: will error be reported on md5sums.txt file itself? joy: no simpson: will it check files parallel to itself, like aareadme.txt? joy: it will work from its directory down. (ie., will include aareadme.txt and errata.txt) simpson: changes to PSDD would like to see revised proposed submission for data dictionary - first paragraph here ends in a sentence fragment; second paragraph "most standard checksum calculators return a 32-character hexademical string" - we need to know about others that return something other than this and pds ought to refuse to deal with anything that doesn't produce a 32-character hexadecimal string as output. joy: all produce 32-character, most lower case letters; simpson: so we're talking about case, not number of characters joy: in pds, if you have a lower-case string on the right hand side of the equals sign you have to quote it. simpson: i understand joy: that's issue, do you have to quote it or not; (roughly 30min into recording) simpson: do we care if it's upper or lower case? raugh: using character encoding as specific bytes so yes, uppercase and lower case make a difference joy: I couldn't find any software that used a checksum in upper case. raugh: that's interesting because if they were using character encoding you'd expect upper case to be about as common as lowercase. guinness: the tool that tom has been using has produced all uppercase checksums ISSUE: are checksums case sensitive? rye: need to make sure that whatever software we use - find out if case matters or just make sure that software is case insensitive simpson: I believe it's supposed to be a hexadecimal number; my guess is case doesn't matter raugh: if it's a hexadecimal number, you shouldn't see any letters past "F" simpson: I don't rye: problem isn't whether or not it's hexadecimal but whether or not the case matters guinness: if it's hexadecimal it doesn't matter joy: if your calculator puts it out lowercase, you have to quote it rye: is that why you're treating this as a text string? We have the capability of representing hexadecimal numbers - shouldn't we be doing it that way then case won't matter simpson: yes, then you won't have to use quotes joy: that's preferable simpson: store it as uppercase then let interpreting software interpret it as needed; (summary) basically okay with idea but not ready to make this required EN background discussion on whether pds software can handle a 32 character mask rye: we need to check into (above) but if it can't, we should just fix the software. ron wants to treat it as a string to avoid problem with parsers not being able to handle this large a number rye: it's not a string, it's a number, so treat it as a number simpson: it's a 128 bit number or a 32 character hexadecimal string - ie., it's a number, and if it's a number it shouldn't be quoted; but it's a number many of our computers can't deal with hughes: what do we specify as the general data type rye: non-decimal is what we're supposed to use for octal, binary, hexadecimal, but if we can't handle the number...? is this going to be treated as a string by the software that's attempting to validate it? simpson: impact statement may have to include "if"s in a lot of places where it runs into adding 32-character hexadecimal numbers hughes: that's where we run into the impasse rye: we need to look into this more (simpson finished) raugh: in addition to previous, so much of this SCR and specific standards changes depends on specific type of checksum, md5, and worse, a specific tool. Opposes md5sums.txt file because it supports one specific tool for one specific type of checksum. Either we're going to allow general checksum support type of files in the root directory (35:53) anne: odl limitation on 32 character hexadecimal numbers - may not be an easy fix simpson: is this a data management issue, therefore outside of archive? anne: paragraph about media maintenance, it's not correct that this checksum helps fix bad volumes, only tells if it's bad joy: point that todd was making was that if you have a mechanism to know that copy is bad, you go anne: make sure that it is not implied that md5 checksum adds anything in terms of saving archive, as opposed to simply being an indicator that archive has gone bad guinness: nobody's implying that - still need to have a backup anne: under problem , 2nd paragraph the media has theoretical lifespans simpson: 2 parts, psdd entry fixed and add md5sums.txt file anne: problem is with specific solution and specific software. i support a general solution. raugh finished; mitch - got confused reading SCR about who's being required to what when electronic deliveries from data preparers in root directory but not hard media deliveries? (requirements appear to be different based on method of delivery) simpson: he's not requiring anything (agreement from guinness) mitch: doesn't favor optional; this is essential for us to migrate data to media; either data provider or receiving node rye: recommendation for optional on pds3, required for pds4 mitch: that would be okay but we (PDS) need to be making checksums for all data received including past rye: agreed anne: concerned about requiring because we get all kinds of little files that aren't part of an archive, maintaing could be a major deal (43:20) rye: need for checksums also major for us as an archive organization, need mechanism for maintaining data integrity anne: most of those little files are ASCII, you can tell if there's a problem by looking at it rye: for you, not true for most nodes anee: that's fine, but if you make it required, then i have to put up with thousands of files that aren't created by a pipeline that don't have chcksums and won't have checksums ron: what if you didn't put them in a a file but just put them in a manifest anne: same problem, i'd have to be constantly updating the mainfest ron: i would recommend making it required and giving waivers where appropriate anne: also have problem with making it required becase it's a new thing that doesn't have a long shelf - life rye: you've already made some good points about needing to have a standard on checksums in general and not specifying the md5 checksum in particular; if we could come up with that general standard, then i belive it's important to require these simpson: that's why i gthink it should be external to archive, so if we do change to a new 64bit checksum, pds can go bac and regenerate all teh checksums without having to redo the archive. rye: as for maintaining a manifest, that could easily be done by a perl script run as a cron job; would take some of us ~1 hr to put together such a script given that md5 utilityies ar already available. script would validate archive against exiting manifest and make any updates as necessary. simpson: would just need to have some change control so that people don't change checksums just cuz they thing it's bad rye; optional vs required? simpson: okay with PDS4, but can see anne's problem guinness: for it optional, but not if required rye: what about pds4? guinness: this is a policy question not a standards issue for each node to decide how to maintain archive; have policy that each node is required to maintain the integrity of their data sets and they can choose the way they want to do that. rye: granted, but given that we transfer data among nodes a lot, need capability to have a standard at least sufficient for a machine to determine what sort of check sum is being used to I can validate it on this end of a transfer? mitch(?): we can always leave open who generates checksum so that node can decide whether to require it from data provider or generate it themselves